Size of Datasets for Analytics and Implications for R
With so much hype about "big data" and the industry pushing for distributed computing vs traditional single-machine tools, one wonders about the future of R. In this talk I will argue that most data analysts/data scientists don't actually work with big data the majority of the time, therefore using immature "big data" tools is in fact counter productive. I will show that contrary to widely-spread believes, the increase of dataset sizes used for analytics has been actually outpaced in the last 10 years by the increase in memory (RAM), making the use of single-machine tools ever more attractive. Furthermore, base R and several widely used R packages have undergone significant performance improvements (I will present benchmarks to quantify this), making R the ideal tool for data analysis on even relatively large datasets. In particular, R has access (via CRAN packages) to excellent high-performance machine learning libraries (benchmarks will be presented), while high-performance and parallel computing facilities have been part of the R ecosystem for many years. Nevertheless, the R community shall of course continue pushing the boundaries and extend R with new and ever more performant features.