Distributed Computing using parallel, Distributed R, and SparkR
Data volume is ever increasing, while single node performance is stagnate. To scale, analysts need to distribute computations. R has built-in support for parallel computing, and third-party contributions, such as Distributed R and SparkR, enable distributed analysis. However, analyzing large data in R remains a challenge, because interfaces to distributed computing environments, like Spark, are low-level and non-idiomatic. The user is effectively coding for the underlying system, instead of writing natural and familiar R code that produces the same result across computing environments. This talk focuses on how to scale R-based analyses across multiple cores and to leverage distributed machine learning frameworks through the ddR (Distributed Data structures in R) package, a convenient, familiar, and idiomatic abstraction that helps to ensure portability and reproducibility of analyses. The ddR package defines a framework for implementing interfaces to distributed environments behind the canonical base R API. We will discuss key programming concepts and demonstrate writing simple machine learning applications. Participants will learn about creating parallel applications from scratch as well as invoking existing parallel implementations of popular algorithms, like random forest and kmeans clustering.