bigKRLS: Optimizing non-parametric regression in R
Data scientists are increasingly interested in modeling techniques involving relatively few parametric assumptions, particularly when analyzing large or complex datasets. Though many approaches have been proposed for this situation, Hainmueller and Hazlett's (2014) Kernel-regularized Least Squares (KRLS) offers statistical and interpretive properties that are attractive for theory development and testing. KRLS allows researchers to estimate the average marginal effect (the slope) of an explanatory variable but (unlike parametric regression techniques whether classical or Bayesian) without the requirement that researchers know the functional form of the data generating process in advance. In conjunction with Tichonov regularization (which prevents overfitting), KRLS offers researchers the ability to investigate heterogeneous causal effects in a reasonably robust fashion. Further, KRLS estimates offers researchers several avenues to investigate how those effects depend on other observable, explanatory variables. We introduce bigKRLS, which markedly improves memory management over the existing R package, which is key since RAM usage is proportional to the number of observations squared. In addition, we allow users parallelize key routines (with the snow library) and shift matrix algebra operations to a distributed platform if desired (with bigmemory and bigalgebra). As an example, we estimate a model from a voter turnout experiment. The results show how the effects of a randomized treatment (here, a get-out-the-vote message) depend on other variables. Finally, we briefly discuss which post-estimation quantities of interest will help users determine whether they have sufficiently large sample size for the asymptotics on which KRLS relies.