Episode

Big Data, Multivariate, Window based Feature Engineering using Microsoft R Server

IoT data is characterized by long time signals recorded from multiple sensors at the same time. Such sensors are utilized to monitor intricate systems and machine learning can be used to understand complex patterns of sensor data (called features) and their association with events of interest like failures or abnormal system behaviors (prediction labels).

Due to recordings long period of time and the rich and complex nature of IoT information, the important aspect that needs to be captured for prediction is the multivariate relationship between different sensors, rather than isolated sensors statistics. Multivariate time windows based feature engineering is a fundamental step for building advanced predictive modeling solutions for IoT systems, but is difficult to implement in Big Data cases by using standard row-focused Map/Reduce tools like Hive.

We discuss here how Microsoft R server (MRS) can be used to perform several such complex multivariate feature engineering methods for big data by employing a reusable software engineering pattern that exposes the content of current window for custom processing, and also allows communication between different data chunks as needed.

As an example, this flexible framework is then used to preprocess signals by applying window based operators to extract one signal statistics like local minima, maxima and quantiles, conditioned by the amplitude of a second binary signal. Such multivariate time windows based features provide a more complex view of the data than global univariate statistics computed with standard Hive queries. This generic processing pattern for IoT data can be extended straightforward to multiple continuous amplitude signals that are relevant to individual specific problems.

The code used to create this experiment will be soon available in Github repository.