FlashR: Enable Parallel, Scalable Data Analysis in R
In the era of big data, R is rapidly becoming one of the most popular tools forndata analysis. But the R framework is relatively slow and unablento scale to large datasets. The general approach of speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. There are many works that parallelize R andnscale it to large datasets. For example, Revolution R Open parallelizes a limited set of matrix operations individually, which limits its performance. Others such as Rmpi and R-Hadoop exposes low-level programmingninterface to R users and require more explicit parallelization. It is challenging to provide a framework that has a high-level programming interface while achieving efficiency. FlashR is a matrix-oriented R programming framework that supports automatic parallelization and out-of-core execution for large datasets. FlashR reimplements matrix operations in the R base package and provides some generalized matrix operations to improve expressiveness. FlashR automatically fuses matrix operations to reduce data movement between CPU and disks. We implement machine learning algorithms such as Kmeans and GMM in FlashR to benchmark its performance. On a large parallelnmachine, both in-memory and out-of-core execution of these R implementations in FlashR significantly outperforms the ones in Spark Mllib. We believe FlashR significantly lowers the expertise for writing parallel and scalable implementations of machine learning algorithms and provides new opportunities for large-scale machine learning in R. FlashR is implemented as an R package and is released as open source (http://flashx.io/).