Efficient in-memory non-equi joins using data.table

Play Efficient in-memory non-equi joins using data.table

Description

A join operation combines two (or more) tables on some shared columns based on a condition. An equi-join is a case where this combination condition is defined by the binary operator $==$. It is a special type of $\theta$-join which consists of the entire set of binary operators: {=, ==}. This talk presents the recent developments in the data.table package to extend its equi-join functionality to any/all of these binary operators very efficiently. For example, X[Y, on = .(X.a >= Y.a, X.b Y.a, X.b < Y.a)] performs a range join. Many databases are fully capable of performing both equi and non-equi joins. R/Bioconductor packages IRanges and GenomicRanges contain efficient implementations for dealing with interval ranges alone. However, so far, there are no direct in-memory R implementations of non-equi joins that we are aware of. We believe this is an extremely useful feature that a lot of R users can benefit from.

1