Using OpenMP-ized C code with R

2011-08-11 by . 3 comments

Post to Twitter

What is OpenMP? Basically a standard compiler extension allowing one to easily distribute calculations over multiple processors in a shared-memory manner (this is especially important when dealing with large data — simple separate-process approach usually requires as many copies of the working data as there are threads, and this may easily be an overkill even in overall size, not to mention the time wasted for copying).

The magic of OpenMP is that once you have a C or Fortran code, in most cases you need nothing more than a few additional compiler flags — thus the code remains as portable and as readable as before the modification. And is usually just nice and simple, not counting few common parallelism traps and some quirks related to the fact we want it to work with R.

In this post I don’t want to make an OMP tutorial (the web is full of them), rather show how to use it with R. Thus, I’ll use a toy example: a function that calculates the cumulative sum in an unnecessary demanding way:

 #include <R.h>
 #include <Rinternals.h>
 SEXP dumbCumsum(SEXP a){
  SEXP ans;
  PROTECT(a=coerceVector(a,REALSXP));
  PROTECT(ans=allocVector(REALSXP,length(a)));
  double* Ans=REAL(ans);
  double* A=REAL(a);
  for(int e=0;e<length(a);e++){
   Ans[e]=0.;
   for(int ee=0;ee<e+1;ee++)
    Ans[e]+=A[ee];
  }
  UNPROTECT(2);
  return(ans);
 }

There is only one for loop responsible for most computational time and no race conditions, thus the OMP-ized version will look like this:

 #include <R.h>
 #include <Rinternals.h>
 #include <omp.h>
 SEXP dumbCumsum(SEXP a){
  SEXP ans;
  PROTECT(a=coerceVector(a,REALSXP));
  PROTECT(ans=allocVector(REALSXP,length(a)));
  double* Ans=REAL(ans);
  double* A=REAL(a);
  #pragma omp parallel for
  for(int e=0;e<length(a);e++){
   Ans[e]=0.;
   for(int ee=0;ee<e+1;ee++)
    Ans[e]+=A[ee];
  }
  UNPROTECT(2);
  return(ans);
 }

Time for R-specific improvements; first of all, it is good to give the user an option to select number of cores to use (for instance he has 16 cores and want to use first 8 for one job and next 8 for something else — without such option he would have to stick to sequential execution); yet it is also nice to have some simple option to use the full capabilities of the system. To this end we will give our function an appropriate argument and use OMP functions to comply with it:

 #include <R.h>
 #include <Rinternals.h>
 #include <omp.h>
 SEXP dumbCumsum(SEXP a,SEXP reqCores){
  //Set the number of threads
  PROTECT(reqCores=coerceVector(reqCores,INTSXP));
  int useCores=INTEGER(reqCores)[0];
  int haveCores=omp_get_num_procs();
  if(useCores<=0 || useCores>haveCores) useCores=haveCores;
  omp_set_num_threads(useCores);
  //Do the job
  SEXP ans;
  PROTECT(a=coerceVector(a,REALSXP));
  PROTECT(ans=allocVector(REALSXP,length(a)));
  double* Ans=REAL(ans);
  double* A=REAL(a);
  #pragma omp parallel for
  for(int e=0;e<length(a);e++){
   Ans[e]=0.;
   for(int ee=0;ee<e+1;ee++)
    Ans[e]+=A[ee];
  }
  UNPROTECT(3);
  return(ans);
 }

This code will also ensure that the number of threads won’t be larger than the number of physical cores; doing this gives no speedup and comes with a performance loss caused by OS scheduler.

Finally, time to resolve small quirk — R has some code to guard the C call stack from overflows, which is obviously not thread-aware and thus have a tendency to panic and screw the whole R session up when running parallel code. To this end we need to disable it using the trick featured in R-ext. First, we include Rinterface to have an access to the variable with stack limit

 #define CSTACK_DEFNS 7
 #include "Rinterface.h"

and then set it to almost infinity in the code

 R_CStackLimit=(uintptr_t)-1;

Voilà, the stack is now unprotected — the work with R just become a bit more dangerous, but we can run parallel stuff without strange problems. The full code looks like this:

 #include <R.h>
 #include <Rinternals.h>
 #include <omp.h>
 #define CSTACK_DEFNS 7
 #include "Rinterface.h"
 SEXP dumbCumsum(SEXP a,SEXP reqCores){
  R_CStackLimit=(uintptr_t)-1;
  //Set the number of threads
  PROTECT(reqCores=coerceVector(reqCores,INTSXP));
  int useCores=INTEGER(reqCores)[0];
  int haveCores=omp_get_num_procs();
  if(useCores<=0 || useCores>haveCores) useCores=haveCores;
  omp_set_num_threads(useCores);
  //Do the job
  SEXP ans;
  PROTECT(a=coerceVector(a,REALSXP));
  PROTECT(ans=allocVector(REALSXP,length(a)));
  double* Ans=REAL(ans);
  double* A=REAL(a);
  #pragma omp parallel for
  for(int e=0;e<length(a);e++){
   Ans[e]=0.;
   for(int ee=0;ee<e+1;ee++)
    Ans[e]+=A[ee];
  }
  UNPROTECT(3);
  return(ans);
 }

Now, time to make sure that R will compile our function with OMP support (and thus make it parallel). To this end, we create a Makevars file (in the src in case of package and in code directory when using dangling object files) with a following contents (for GCC):

 PKG_CFLAGS=-fopenmp
 PKG_LIBS=-lgomp
 

The first line will trigger parsing OMP pragmas, the latter will link the OMP library with omp_* functions.

We are ready to test our example:

 $ R CMD SHLIB omp_sample.c
 $ R
 > dyn.load('omp_sample.so')
 > .Call('dumbCumsum',runif(100000),0L)

Try to run sum system monitor (like htop or GUI one that comes with your desktop environment) and watch your powerful CPU being finally fully utilized (-;

To close with an optimistic aspect, few words about limitations. Don’t even try to run any R code or use features like random number generation or Ralloc inside parallelized blocks — R engine is not thread-safe and thus this will end in a more or less spectacular failure.

Plus of course all issues of parallel programming and OMP itself also apply — but that is a different story.

Filed under R tips&tricks

3 Comments

Subscribe to comments with RSS.

  • Thanks for trying to poopularize OpenMP with R! I have however a few key points I would like to take up with you: a) the example is bad: your algo for cumsum is quadratic in N (two loops) whereas a standard approach is linear; b) your use of the R API is pretty slow as you have slots of macro/function call overhead (and an Rcpp based-approach is easier, shorter and faster; c) you could have use inline to make the snippets more easily useable from R. The only merit really is that your OpenMP variants outperform the non-OpenMP ones — but then you can beat either pretty easily still. I may post a follow-up on my blog.

    • says:

      As I wrote it is intentional; I was looking for a simple example that would do something easy to follow and consume a lot of CPU, just not to overwhelm the compilation/core number selection/memory part. This is also why I haven’t used Rcpp.

      But I agree that it was an error to show this overuse of REAL and do not criticize it, so I have fixed that.

      Oh, and feel free to post the follow-up here; it is a community blog after all. Just visit our blog editors’ chat room.

  • Selva Prabhakaran says:

    Great post! Thank you so much for sharing..

    For those who want to learn R Programming, here is a great new course on youtube for beginners and Data Science aspirants. The content is great and the videos are short and crisp. New ones are getting added, so I suggest to subscribe.

    https://www.youtube.com/watch?v=BGWVASxyow8&list=PLFAYD0dt5xCzTQHDhMPZwBoaAXWeVhZzg&index=19

  • Leave a comment

    Log in
    with Stack Exchange
    or