Big Data @ Microsoft
Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and LOB operations; now, exploratory and predictive analysis are ubiquitous, and the default is to capture and store any and all data in anticipation of potential future value. Differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems where the emphasis is on supporting a wide range of large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for ML and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation. Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and IoT apps become common, enterprise data is growing at a staggering pace, and the need to leverage different storage tiers (from tape to memory) poses new challenges, leading to caching technologies such as Spark. On the analytics side, resource managers like YARN have opened the door for analytics tools to bypass Map-Reduce and directly exploit shared system resources while computing close to data copies. This trend is significant in iterative computations such as graph analytics and ML for which Map-Reduce is seen as a poor fit. While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, now enabling new state of the art external services like Azure Data Lake. I will examine these trends and ground the talk by discussing the Microsoft Big Data stack.