Need some advice on Post-Processing large amounts of data.

me dagsunde com

Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...

Pete OHanlon

There are so many "it depends" that need to be looked at in here. We'll start with a simple question that should, hopefully, start us towards the answer. Question: Do you need to do anything else with the data that you are inserting? Once you have created these six values, is the telemetry data ever used again?

Advanced TypeScript Programming Projects

Dag Sunde

The telemetry data are also needed later, yes.

Pete OHanlon

Where are you hosting the data? Is it in a cloud provider such as Azure or AWS or are you managing it yourself?

Advanced TypeScript Programming Projects

Dag Sunde

The DB, and the code that receives tha data it hosted in Azure. Data is recieved as MQTT messages by Azure IoTHub. A C# worker subsribes to the incoming messages, and stores them normalized in an Azure SQLdb.

Mircea Neacsu

What worries you most? Is it the data volume? A back of the napkin calculation shows something like 10-20k per sec. Almost any DB system could handle that.

Mircea

me dagsunde com

Volume, Yes. As of today, the volume is not a problem at all. But when we get thousands of devices online, I'm a little bit afraid that the Job generating the aggregates will struggle/lead to locks. Also, who knows what other Post-processing needs will occur in the future.

Mircea Neacsu

How many more devices you anticipate to connect: 20000, 200000? At 20000 level, I think you should still be safe. At 200000, not so sure. Keep in mind however that inserting is the more probable bottleneck as it has to create all those indexes, allocate pages, etc. Querying should be much faster.

Mircea

me dagsunde com

Thanks! :)

Pete OHanlon

So, you have data coming in from 200 devices, 5 times a second. Let's think about the problem areas here:

When you add data, at that volume, you're adding 1000 items a second. Not a problem in SQL Server.
I'm assuming it's one aggregation per device every minute, so you're looking at a read of 60K rows every minute, with a write of 200 rows.

I would be tempted to invert the problem slightly. Rather than write to one source, write to two. So, write the incoming messages to SQL Server and, at the same time, write them to something like Redis Cache. Then, every minute, get your data from Redis and use that to calculate the values for the other table and write to that. This approach follows (somewhat), the CQRS so you're separating the reads and writes from each other. The beauty is, this scales well.

Advanced TypeScript Programming Projects

me dagsunde com

That was actually a brilliant Idea, as I already have Redis running :) The number of devices will hopefully increase 10 to a 100-fold, but this will still scale. So will it with additional transformations in the future. Thanks! The only drawback with this solution is that I have no excuse to sit down and play with Datalakes and all its tools... :laugh:

jschell

In one of the sub-threads you said the following "The number of devices will hopefully increase 10 to a 100-fold" You should have a very hard idea of what this increase looks like before you start making decisions. You should know the following. 1. What is the timeline for the increase? Also will the increase be gradual or will it be abrupt? 2. What is the maximum? If you owned the market what size would you need to support? Typically when I size I do the following 1. Document my assumptions. 2. Document my sources - where did I get the assumptions and how did I derive them. 3. I then increase the size from that by 3 to 10. Don't attempt to create a system the will work for the next 100 years. One that works for 10 is fine. After you have those numbers then you start looking for solutions that will handle those numbers. Keeping in mind of course that your architecture/design should support the larger values, but the implementation does not need to support that now. You just want to make sure the implementation does not preclude sizing it up.

Dag Sunde

Thanks! Very good points. The question was more about what would a datalake offer in terms of post processingand other benefits, since I am not familiar with the technology. I have no problems implementing an efficient solution with Azure/sql/c# for the foreseeable future. Just needed some insights in a tech I'm not familiar with.