Building a community fountain around your data stream
The increasing availability of real-time data sources and the Internet of Things movement have pushed data analysis pipelines towards stream processing. But what does this really mean for my applications, and how do I have to change my code and workflow? In a new era of “Kappa architecture,” it’s easier than ever to use the same programming model for both batch and stream processing. For those interested in the design and operations side, I will cover high-level design considerations for architecting a modular and scalable stream processing infrastructure that can support the flexibility of different use cases and can welcome a community of users who are more familiar with batch processing. For the fast-batching Pythonistas, I’ll talk about some of the advantages of using streaming tech in a data processing pipeline and how to make your life easier with 1) built-in replication, scalability, and stream “rewind” for data distribution with Kafka, 2) structured messages with strictly enforced schemas and dynamic typing for fast parsing with Avro, and 3) a stream processing interface that is similar to batch with Spark that you can even use in a Jupyter notebook. When you’re ready to jump into the stream, or at least take a drink from the fountain, I’ll point you to an open source, containerized (with Docker), streaming ecosystem testbed that you can deploy to mock a stream of data and take your streaming analytics on a dry run over an astronomical data stream.