IOT Analytics Architecture Whiteboard with David Crook

Play IOT Analytics Architecture Whiteboard with David Crook
Sign in to queue


When working with IOT one of the more common questions we get is this:  "What is the typical architecture in IOT scenarios."  In this video, David Crook uses a white board to diagram and discuss a very common architecture when dealing with IOT devices and then addresses some questions the audience had at the end of the talk.  



The Discussion

  • User profile image

    Great explanation. What is you sugested datastore? Would azure tablestore be OK, for how should I store the files to be able to use them in hadoop (or spark)

  • User profile image

    @Lars: wasb:// Blob storage is HDFS compliant, however I would suggest for new products to use Azure Data Lake as it is HDFS compliant, has infinite data capabilities and will support U-SQL and Azure Data Lake Analytics packages. 

  • User profile image

    OK, thanks a lot for you feedback. I really like this architecture. Do you have any customers doing data validation in the stream analytics part before saving data to the storage? I mean building a model of the devices from which the data comes and then by machine learning mark data to be questionable? Would that be possible to do for very large amounts of data?

  • User profile image

    I don't see why not.  I'm interested to hear the use case.  One thing to note is because I can do something, doesn't mean I necessarily will.  For example, to generate an ML model on the fly in a stream means you have access to basically a windowed snapshot of the data, which is likely not very much data, you could theoretically bring in the historical stores as well, but then in my opinion you are defeating the purpose of Stream Analytics. 

    I would generate an ML model from my historical stores first, then dynamically pull up that model from stream and compare incoming objects to that.  I also do normalization of windowed objects (if necessary) in the stream.  You have to architect you ML algorithm fairly intelligently to use in Stream Analytics as to update the query itself, you need to recycle the stream job.  You could theoretically stand up a second job, then shut down the first.  I haven't tried it, but it should work.

    As for quantity of data that this will handle, you get up to16 channels per hub, and here is the page for input:

    You can then have different stream jobs listening to 1 or many channels and if necessary nest them by having the output feed into another input.

    Sounds like a great session topic :) 



  • User profile image


  • User profile image
    Giuseppe Mascarella

    Great job in make is so simple and easy to remember.

  • User profile image

    great job...I really like this architecture...
    <a href="">top colleges in Canada</a>

  • User profile image

    Good one. Keep up the great work..

    I have a use case where I need to calculate standard deviation of all incoming messages from devices. where does this calculation logic fit in? Is it inside data lake?

  • User profile image

    @Deepak: I think where you calculate depends on your use case.  You can calculate it in the Data Lake on historical data and use it wherever you want if you like.  You can also calculate it in the data stream on a rolling window of values.  I think it depends on what you are calculating that deviation for as well as where, when and why you want to use it. 

    Performance of course as well as structure of the data streams is also important to consider.  For example, If the solution has multiple hubs for the same devices/data, you may have to do it at the data lake or come up with a stream aggregation methodology to push everything in properly.  Remember, there are ways to reduce data at the first processor and still get the same answers to the aggregation processor as well if speed/size are issues for you.

    I suppose the sum of the answer is "it depends greatly" as there are so many different ways to solve the problem.

Add Your 2 Cents