Using OpenMP-ized C code with R

2011-08-11 by . 2 comments

What is OpenMP? Basically a standard compiler extension allowing one to easily distribute calculations over multiple processors in a shared-memory manner (this is especially important when dealing with large data — simple separate-process approach usually requires as many copies of the working data as there are threads, and this may easily be an overkill even in overall size, not to mention the time wasted for copying).

The magic of OpenMP is that once you have a C or Fortran code, in most cases you need nothing more than a few additional compiler flags — thus the code remains as portable and as readable as before the modification. And is usually just nice and simple, not counting few common parallelism traps and some quirks related to the fact we want it to work with R.

In this post I don’t want to make an OMP tutorial (the web is full of them), rather show how to use it with R. Thus, I’ll use a toy example: a function that calculates the cumulative sum in an unnecessary demanding way:

 #include <R.h>
 #include <Rinternals.h>
 SEXP dumbCumsum(SEXP a){
  SEXP ans;
  PROTECT(a=coerceVector(a,REALSXP));
  PROTECT(ans=allocVector(REALSXP,length(a)));
  double* Ans=REAL(ans);
  double* A=REAL(a);
  for(int e=0;e<length(a);e++){
   Ans[e]=0.;
   for(int ee=0;ee<e+1;ee++)
    Ans[e]+=A[ee];
  }
  UNPROTECT(2);
  return(ans);
 }

There is only one for loop responsible for most computational time and no race conditions, thus the OMP-ized version will look like this:

 #include <R.h>
 #include <Rinternals.h>
 #include <omp.h>
 SEXP dumbCumsum(SEXP a){
  SEXP ans;
  PROTECT(a=coerceVector(a,REALSXP));
  PROTECT(ans=allocVector(REALSXP,length(a)));
  double* Ans=REAL(ans);
  double* A=REAL(a);
  #pragma omp parallel for
  for(int e=0;e<length(a);e++){
   Ans[e]=0.;
   for(int ee=0;ee<e+1;ee++)
    Ans[e]+=A[ee];
  }
  UNPROTECT(2);
  return(ans);
 }

Time for R-specific improvements; first of all, it is good to give the user an option to select number of cores to use (for instance he has 16 cores and want to use first 8 for one job and next 8 for something else — without such option he would have to stick to sequential execution); yet it is also nice to have some simple option to use the full capabilities of the system. To this end we will give our function an appropriate argument and use OMP functions to comply with it:

 #include <R.h>
 #include <Rinternals.h>
 #include <omp.h>
 SEXP dumbCumsum(SEXP a,SEXP reqCores){
  //Set the number of threads
  PROTECT(reqCores=coerceVector(reqCores,INTSXP));
  int useCores=INTEGER(reqCores)[0];
  int haveCores=omp_get_num_procs();
  if(useCores<=0 || useCores>haveCores) useCores=haveCores;
  omp_set_num_threads(useCores);
  //Do the job
  SEXP ans;
  PROTECT(a=coerceVector(a,REALSXP));
  PROTECT(ans=allocVector(REALSXP,length(a)));
  double* Ans=REAL(ans);
  double* A=REAL(a);
  #pragma omp parallel for
  for(int e=0;e<length(a);e++){
   Ans[e]=0.;
   for(int ee=0;ee<e+1;ee++)
    Ans[e]+=A[ee];
  }
  UNPROTECT(3);
  return(ans);
 }

This code will also ensure that the number of threads won’t be larger than the number of physical cores; doing this gives no speedup and comes with a performance loss caused by OS scheduler.

Finally, time to resolve small quirk — R has some code to guard the C call stack from overflows, which is obviously not thread-aware and thus have a tendency to panic and screw the whole R session up when running parallel code. To this end we need to disable it using the trick featured in R-ext. First, we include Rinterface to have an access to the variable with stack limit

 #define CSTACK_DEFNS 7
 #include "Rinterface.h"

and then set it to almost infinity in the code

 R_CStackLimit=(uintptr_t)-1;

Voilà, the stack is now unprotected — the work with R just become a bit more dangerous, but we can run parallel stuff without strange problems. The full code looks like this:

 #include <R.h>
 #include <Rinternals.h>
 #include <omp.h>
 #define CSTACK_DEFNS 7
 #include "Rinterface.h"
 SEXP dumbCumsum(SEXP a,SEXP reqCores){
  R_CStackLimit=(uintptr_t)-1;
  //Set the number of threads
  PROTECT(reqCores=coerceVector(reqCores,INTSXP));
  int useCores=INTEGER(reqCores)[0];
  int haveCores=omp_get_num_procs();
  if(useCores<=0 || useCores>haveCores) useCores=haveCores;
  omp_set_num_threads(useCores);
  //Do the job
  SEXP ans;
  PROTECT(a=coerceVector(a,REALSXP));
  PROTECT(ans=allocVector(REALSXP,length(a)));
  double* Ans=REAL(ans);
  double* A=REAL(a);
  #pragma omp parallel for
  for(int e=0;e<length(a);e++){
   Ans[e]=0.;
   for(int ee=0;ee<e+1;ee++)
    Ans[e]+=A[ee];
  }
  UNPROTECT(3);
  return(ans);
 }

Now, time to make sure that R will compile our function with OMP support (and thus make it parallel). To this end, we create a Makevars file (in the src in case of package and in code directory when using dangling object files) with a following contents (for GCC):

 PKG_CFLAGS=-fopenmp
 PKG_LIBS=-lgomp
 

The first line will trigger parsing OMP pragmas, the latter will link the OMP library with omp_* functions.

We are ready to test our example:

 $ R CMD SHLIB omp_sample.c
 $ R
 > dyn.load('omp_sample.so')
 > .Call('dumbCumsum',runif(100000),0L)

Try to run sum system monitor (like htop or GUI one that comes with your desktop environment) and watch your powerful CPU being finally fully utilized (-;

To close with an optimistic aspect, few words about limitations. Don’t even try to run any R code or use features like random number generation or Ralloc inside parallelized blocks — R engine is not thread-safe and thus this will end in a more or less spectacular failure.

Plus of course all issues of parallel programming and OMP itself also apply — but that is a different story.

Filed under R tips&tricks

Does Jon Skeet have mental powers that make us upvote his answers? (The effect of reputation on upvotes)

2011-08-04 by . 8 comments

Of course since we all know Jon Skeet does have various powers, I will move onto unanswered questions, whether a users reputation makes them receive more upvotes for answers. I’ve seen this theory mentioned in multiple places (see any of the comments to Jon Skeet’s answer that are along the lines of “If this was posted by someone other than Jon Skeet, would this have gotten as many upvotes?”). It is similar to the question we were supposed to address in the currently dormant Polystats project as well.

Examining user and post data from the SO Data Dump, first I looked at the correlation between users reputation and their most recent posts. Below is a scatterplot, with the users current reputation (as of the June-2011 data dump) on the x-axis and the current score on their most recent non community wiki answer on the y-axis.

 

One can discern a very slight correlation between reputation and the score per answer (score = upvotes – downvotes). While consistent with a theory of reputation effects, an obvious alternative explanation is simply those with higher reputation give better answers (I doubt they bought their way to the top!). I’m sure this is true, but I suspect that they always gave higher quality answers, even before they had their high reputation. Hence, a natural comparison group is to assess whether high rep users get more upvotes compared to answers they gave when they did not have as high reputation.

To assess whether this is true, below I have another scatterplot. The x-axis represents the sequential number of the non community wiki answer for a particular user, and the y-axis represents that post minus the mean score of all of the users posts. The scatterplot on the top is all other SO users with a reputation higher than 50,000, and the scatterplot on the bottom is Jon Skeet (he deserves his own simply for the number of posts he has made as of this data-dump, over 14,000 posts!)

In simple terms, if reputation effects existed, you would expect to see a positive correlation between the post order number and the score. For those more statistically savvy, if one were to fit a regression line for this data, it would be referred to as a fixed effects regression model, where one is only assessing score variance within a user (i.e., a user becomes their own counter-factual). Since the above graphic is dominated by a few outlying answers (including one that has over 4,000 up votes!) I made the same graphic as above except restricted the Y axis to between -10 and 25 and increased the transparency level

As one can see, there is not much of a correlation for Jon Skeet (and it appears slightly negative for mortal users). When I fit the actual regression line, it is slightly negative for both mortal users and Jon Skeet. What appears to happen is there are various outlier answers which garner an incredibly high number of upvotes, although these appeared to happened for Jon Skeet (and the other top users) even in their earlier posts. It seems likely these particular posts attract a lot of attention (and hence give the appearance of reputation effects). But it appears on average these high rep users always had a high score per answer, even before they gathered a high reputation. Perhaps future analysis could examine if high rep users are more likely to receive these aberrantly high number of upvotes per answer.

This analysis does come with some caveats. If reputation effects are realized early on (like say within the first 100 posts), it wouldn’t be apparent in this graph. While other factors likely affect upvoting as well (such as views, tags, content), these seem less likely to be directly related to a posters reputation, especially by only examining within score deviations. If you disagree do some analysis yourself!

I suspect I will be doing some more analysis on behavior on the Stack Exchange sites now that I have some working knowledge of the datasets, so if you’ve done some analysis (like what Qiaochu Yuan recently did) let me know in the comments. Hopefully this revives some interest in the Polystats project as well, as their are a host of more things we could be examining (besides doing a better job of explaining reputation effects than I did in this little bit of exploratory data analysis). Examples of other suggested theories for voting behavior are pity upvotes, and sorting effects.

For those interested, the data manipulation was done in SPSS, and the plots were done in R with the ggplot2 package.

Filed under viewpoints

Using Emacs to work with R

2011-08-02 by . 10 comments

A simple yet efficient way to work with R consists in writing R code with your favorite text editor and sending it to the R console. This allows to build efficient R code in an incremental fashion. A good editor might even provide syntax highlighting, parenthesis matching, and a way to send a selected portion of code to R directly. That may appear a crude way of working with R, especially for those used to statistical packages featuring a spreadsheet-view of the data and a lot of menus and buttons with which the user can interact. However, R is a statistical language and offers a lot more interactivity, though that might hardly be reduced in a series of click and go actions. So, basically, let’s keep it simple and just use an R-aware text editor.

Well, install Emacs if it is not already present on your system, and you’re almost done. Emacs is a powerful tool (it’s difficult to say it is just an editor) for programmers and users dealing with text file. It offers a lot of functionalities and will be suitable for the basic copy/paste activity described above. But, wait. There is more to see with the ESS package. Now, you will have access to a lot of R-specific functionalities, including syntax highlighting, auto-indentation of code, line-by-line evaluation, etc. and you won’t have to open an external R console: everything can be done from within Emacs.

A nice overview of Emacs capabilities has been given by Dirk Eddelbuettel in his answer to the following question on SO.

In the following we will describe one possible way of working with ESS. This will be oriented towards users which have minimal experience with Emacs. First, you will need to learn how to perform basic text operations with Emacs. Since Emacs is very sophisticated, finding out how to simply select text and copy might be a challenge. So invest some time in finding out how to do that efficiently. Reading the manual might help. For Mac OS X, you can use Aquamacs which supports native shortcuts, among other things.

Working with ESS does not differ from working with R. The same rules for organizing code should apply. We suggest using a separate directory for every project, which resides in a parent directory called for example R, which resides in some directory which is easily accessible. For Unix type OS (Linux and Mac OS X) this would be your home directory. For Windows, I recommend to point Emacs home directory to the directory where all your source resides, this might involve setting some environmental variables for older versions of Windows.

As you see the initial investment might seem a bit daunting. But do not worry, the rewards as with R are great.

To start ESS simply start Emacs and press M-x R. You should see something like that:

Start ESS with Emacs

This how Emacs window looks on Mac OS X using Aquamacs Emacs distribution. You should see something similar with other Emacs distributions. One of the defining features of Emacs is its mini-buffer. It is the line at the bottom of the window, which contains the line M-x R. Every shortcut you press is reflected in this mini-buffer. If Emacs is stuck and does not respond look at the mini-buffer first, chances are that you inadvertently used some keyboard shortcut to invoke some Emacs command and now it is waiting for your further input. The way to get out of this is to quit the current command in the mini-buffer using the shortcut C-g. This will probably be the most used shortcut for first time Emacs users.

After pressing M-x R press enter. You should get the following prompt:

Select R directory

Select the directory (you can use Tab to complete long directory names) and press Enter. What you get is similar to usual R prompt:

R start

The improvement on the usual prompt is that it is also simple text file containing everything R produced. So for example you can scroll to your previous command and put the cursor on it, press Enter and it will be reproduced.

After starting R process, we suggest dividing emacs in two windows (Emacs terminology). Then on the left you can have a source code, which will be sent to the R process on the right. The relevant shortcuts are C-x 3 for splitting windows vertically, C-x 1 for making the current buffer the only window and C-x 2 for splitting windows horizontally. Here is how it looks after C-x 3:

Sample R ESS workflow

When sending code to R, it is advisable to keep distinction between functions and R statements. You can do this by keeping all your functions in one file called 10code.R for example. Then you can simply load this file using load ESS file option (shortcut C-c C-l). The advantage of this approach is that it sources all the functions and produces nothing in the R buffer. If there is an error in your code then ESS shows a message in the minibuffer and you can investigate it by pressing C-c `.

The other pieces of code are R statements, which should be kept self-explanatory: load data, clean data, fit statistical model, inspect the results, produce the final results. The source code for these statements should be the current status of the project. The intention is that after your project is finished, sourcing those source files should allow to reproduce the entire project. You can use git or other CVS for tracking history, since Emacs has good support for working with them.

When working with R files, you can work with one R statement at a time, which you send to the R process via the “Eval function, paragraph, statement” command, aka C-c C-c. This command sends the paragraph to the active R process, i.e. the text which is delimited by new lines. This is handy since you can group R statements into tasks, and send whole task to the R process. It also does not require selecting text, which is also very convenient. The shortcut C-c C-c has the advantage that it moves the cursor to R window, so you can immediately inspect R output.

In sum, the basic workflow for working with R and ESS is moving a lot between windows and buffers. To facilitate this you can use the following shortcuts, which should be put in your .emacs file:

(define-key global-map [f1] 'Control-X-prefix)
(define-key global-map [f3] 'find-file)
(define-key global-map [f2] 'save-buffer)
(define-key global-map [f8] 'kill-buffer)
(define-key global-map [f5] 'switch-to-buffer)
(define-key global-map [f6] 'other-window)
(define-key global-map [f9] 'ess-load-file)

Other specific ESS settings you can use are the following:

(setq comint-input-ring-size 1000)
(setq ess-indent-level 4)
(setq ess-arg-function-offset 4)
(setq ess-else-offset 4)

This tells ESS to make the tab 4 characters wide (the default is 2), which is a personal preference for some, and expands the number of your issued commands ESS saved in the history.

For working with R process directly, the following shortcuts can be very useful:

(add-hook 'inferior-ess-mode-hook
    '(lambda nil
          (define-key inferior-ess-mode-map [\C-up]
              'comint-previous-matching-input-from-input)
          (define-key inferior-ess-mode-map [\C-down]
              'comint-next-matching-input-from-input)
          (define-key inferior-ess-mode-map [\C-x \t]
              'comint-dynamic-complete-filename)
     )
 )

This recalls the R statement from your R statement history, but it tries to match it with the one which is already on your line. So, for example, typing pl in R process and pressing \C-up (that’s control and the up arrow) will cycle through all the statements which start with pl, so it will recall for example all the plot(... commands.

Another setting which you might find useful is:

 (setq ess-ask-about-transfile t)

This way ESS always asks where to save the text in the buffer with R process. You can number these files according to date, so you will always have another way to track what exactly you were doing. The only caveat of this option is that for some reason ESS sets the R buffer to read only, after loading the R. The shortcut for making buffer writable is C-x C-q.

So this is one way of working with ESS and R. Other ways are possible, Emacs and ESS are very customizable, so you can set it the way you like it. You will soon learn than using Emacs to interact with R overcome the limitations raised by using different task-oriented tools while allowing you to document and version your code, Sweave or TeXify R output in an easy way. The following screencast shows a live Emacs session where the use of R is interleaved with various Emacs tools.

A short intro to Emacs+ESS on Vimeo

Challenge alert — material identification

2011-07-28 by . 0 comments

We start yet another series of post — challenge alerts. This series is intended to share news about machine learning or data mining challenges which may be interesting to the members of our community, possibly with some brief introduction to the problem. So if you hear about some contest, notify us on Skewed distribution.

Today about the recent event on TunedIt, where FIND Technologies Inc. asks to develop a method to distinguish various materials based on the passive electromagnetic signals the produce. The supply the participants with 3000 1500-sample time series, each corresponding to a measurement of the electric potential on a surface of one of three materials. Here is a plot of one of given time series:

Sample signal

Sample signal, zoomed on the right panel.

Half of this set is annotated with a material class and given as a training set, the rest is a test set on which classes must be predicted. This is a `rolling’ challenge, i.e. participants can send many predictions at any time and their results on a preliminary test set (different from a test set used to finally assess their accuracy) are instantly published. Unfortunately, organizers have chosen the preliminary set out of train set samples, so overfitted submissions can get arbitrary high accuracy on the leaderboard. In fact this has happened already, so the real progress remains unknown. After registration, one can download a preliminary report which reveals some technical details about the problem. It also claims that one can obtain circa 70% accuracy in separating each pair of those classes using linear learner on wavelet spectra.

The main downside of the challenge is that is quite frequently regarded as a scam, especially because there is no way of trying to replicate the results from preliminary raport (and the method described therein fails on the challenge data) — more details can be found on the challenge thread on TunedIt forum. Anyway no-one has broke the first, 50% accuracy milestone till now.

The upside is that there are prizes; 1k Canadian $  for breaking 50, 60, 70, 80 and 90% milestone and 40k C$ for braking final goal of 95% accuracy and transferring intelectual rights to FIND.

So, good luck — or have a nice time doing more productive things (=

Filed under Challenge alert

Two-way CRAN

2011-07-26 by . 0 comments

Sooner on later, every useR will manage to exhaust R’s built-in capabilities and land on CRAN looking for his dreamed needle in a haystack of 3k+ contributed packages. Probably most of you already know stuff like Task Views or rseek which make finding something relevant a bit easier than digging the full list or googling, however all methods will eventually lead to a CRAN package page like this:

CRAN page for rgl package

Sample CRAN package page

Ok, but what’s the problem? We have basic info here, sources, manuals, builds, dependencies… Well, let’s compare this to some modern application repositories like Android Market, Mozilla Add-ons or AppStore; we’ll immediately notice lack of any form of user feedback, neither as ratings nor reviews. And such stuff may be quite handy — imagine you have for instance found three packages doing certain thing; surely one of them will be fastest, one least bloated, one most functional and one will best integrate with the rest of your code, but you probably won’t be able to deduce this easily from manuals.  Or you’ve found a package from 2003 and you don’t really want to check whether it is an abandoned code dependent on a bunch of obsolete quirks or just a solid code that just didn’t require any modifications. Or you have been using foozifier for years not knowing that bartools’ functionWithConfusingName does the same 50 times faster using 1/1000 of RAM. Or you just thought you can’t baz in R, yet the only problem was that the bazzing package author thought it was called wizzing.

Moreover, this is also useful for package authors — comment is much easier and more reusable way of leaving feedback than e-mail, so you can count on more reports and thus catch more bugs and gather more good feature ideas.

What’s worse with this story is that this is more-less already here; it is called Crantastic and works pretty well, not counting the fact that it would certainly benefit from some more traffic — so, go there and check if you are registered user of all packages you’ve installed and start contributing; it really comes back!

Filed under R tips&tricks

Welcome to the CV blog!

2011-07-25 by . 4 comments

It is almost a year since CrossValidated was launched. Today we start a new activity at CrossValidated — a community blog. It is the fourth (after the main site, meta and chat) place for getting in touch with the community and contributing to it.

To get started, we plan to post series of posts about the following topics:

  • Question of the Week Each week there will be a survey on meta in which the users will nominate and elect some recent splendid question — then we’ll ask the author of best answer either to elaborate a bit, write some summary of the whole thread, start a miniseries of posts inspired by the QoW or do something else in this manner.
  • Journal Club report There will be some summary of the each JC chat. While the JC has a summer break currently, you can expect some summaries of archived chats.
  • R tips&tricks We can not ignore our most popular tag. Hopefully we would manage to attract some people from R community on SO to participate here.
  • Challenge alert Announcements (with some introduction) of machine learning or data mining challenges.
This place is mainly devoted to general posts about the whole data science: news, novel methods, algorithms and plots, good and bad practices, interesting data or problems — you name it.

Finally the most important thing: remember that this is your place. Anyone can write a post here — just visit Skewed distribution — blog’s chat room and ask for an author account or maybe just suggest some topic or share your critic. We are open to any idea and we hope we’ll manage to make it an useful and avid place for the community.

Filed under CrossValidated