How to keep your R code simple while tackling big datasets
Like many statistical analytic tools, R can be incredibly memory intensive. A simple GAM (generalized additive model) or K-nearest neighbor routine can devour many multiples of memory size compared to the starting dataset. And, R doesn't always behave nicely when it runs out of memory.
There are techniques to get around memory limitations, like using partitioning tools or sampling down. But these require extra work. It would be really nice to run elegantly simple R analytics without that hassle.
Using a really big, public dataset, from CMS.gov, Chuck will show GAM, GLM, Decision Trees, Random Forest and K Nearest Neighbor routines that were prototyped and run on a laptop then run unchanged on a single simple Linux instance with over a Terabyte of RAM against the entire dataset. This big computer is actually a collection of smaller off-the-shelf servers using TidalScale to create a single, virtual server with several terabytes of RAM.