Exploring the R / SQL boundary
Databases have a long history of delivering highly scalable solutions for storing, manipulating, and analyzing data, transaction processing and data warehousing, while R is the most widely used language for data analytics and machine learning due to its rich ecosystem of machine learning algorithms and data manipulation capabilities. But, when using these tools together, how do you decide how much processing to do in SQL before switching to R? In this talk, we will explore setting the R / SQL boundary under three scenarios: RODBC connections, dplyr data extractions, and in-database R processing, and examine the consequences of each of these approaches with respect to data exploration, feature engineering, modeling and predictions. We identify common performance killers such as excessive data movements and serial processing, and illustrate the techniques, with examples from both an open source database (Postgres) and a commercial database (Microsoft SQL Server).