Need some advice on Post-Processing large amounts of data.
-
Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...
There are so many "it depends" that need to be looked at in here. We'll start with a simple question that should, hopefully, start us towards the answer. Question: Do you need to do anything else with the data that you are inserting? Once you have created these six values, is the telemetry data ever used again?
-
There are so many "it depends" that need to be looked at in here. We'll start with a simple question that should, hopefully, start us towards the answer. Question: Do you need to do anything else with the data that you are inserting? Once you have created these six values, is the telemetry data ever used again?
-
Where are you hosting the data? Is it in a cloud provider such as Azure or AWS or are you managing it yourself?
-
Where are you hosting the data? Is it in a cloud provider such as Azure or AWS or are you managing it yourself?
-
Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...
What worries you most? Is it the data volume? A back of the napkin calculation shows something like 10-20k per sec. Almost any DB system could handle that.
Mircea
-
What worries you most? Is it the data volume? A back of the napkin calculation shows something like 10-20k per sec. Almost any DB system could handle that.
Mircea
Volume, Yes. As of today, the volume is not a problem at all. But when we get thousands of devices online, I'm a little bit afraid that the Job generating the aggregates will struggle/lead to locks. Also, who knows what other Post-processing needs will occur in the future.
-
Volume, Yes. As of today, the volume is not a problem at all. But when we get thousands of devices online, I'm a little bit afraid that the Job generating the aggregates will struggle/lead to locks. Also, who knows what other Post-processing needs will occur in the future.
How many more devices you anticipate to connect: 20000, 200000? At 20000 level, I think you should still be safe. At 200000, not so sure. Keep in mind however that inserting is the more probable bottleneck as it has to create all those indexes, allocate pages, etc. Querying should be much faster.
Mircea
-
How many more devices you anticipate to connect: 20000, 200000? At 20000 level, I think you should still be safe. At 200000, not so sure. Keep in mind however that inserting is the more probable bottleneck as it has to create all those indexes, allocate pages, etc. Querying should be much faster.
Mircea
Thanks! :)
-
The DB, and the code that receives tha data it hosted in Azure. Data is recieved as MQTT messages by Azure IoTHub. A C# worker subsribes to the incoming messages, and stores them normalized in an Azure SQLdb.
So, you have data coming in from 200 devices, 5 times a second. Let's think about the problem areas here:
- When you add data, at that volume, you're adding 1000 items a second. Not a problem in SQL Server.
- I'm assuming it's one aggregation per device every minute, so you're looking at a read of 60K rows every minute, with a write of 200 rows.
I would be tempted to invert the problem slightly. Rather than write to one source, write to two. So, write the incoming messages to SQL Server and, at the same time, write them to something like Redis Cache. Then, every minute, get your data from Redis and use that to calculate the values for the other table and write to that. This approach follows (somewhat), the CQRS so you're separating the reads and writes from each other. The beauty is, this scales well.
-
So, you have data coming in from 200 devices, 5 times a second. Let's think about the problem areas here:
- When you add data, at that volume, you're adding 1000 items a second. Not a problem in SQL Server.
- I'm assuming it's one aggregation per device every minute, so you're looking at a read of 60K rows every minute, with a write of 200 rows.
I would be tempted to invert the problem slightly. Rather than write to one source, write to two. So, write the incoming messages to SQL Server and, at the same time, write them to something like Redis Cache. Then, every minute, get your data from Redis and use that to calculate the values for the other table and write to that. This approach follows (somewhat), the CQRS so you're separating the reads and writes from each other. The beauty is, this scales well.
That was actually a brilliant Idea, as I already have Redis running :) The number of devices will hopefully increase 10 to a 100-fold, but this will still scale. So will it with additional transformations in the future. Thanks! The only drawback with this solution is that I have no excuse to sit down and play with Datalakes and all its tools... :laugh:
-
Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...
In one of the sub-threads you said the following "The number of devices will hopefully increase 10 to a 100-fold" You should have a very hard idea of what this increase looks like before you start making decisions. You should know the following. 1. What is the timeline for the increase? Also will the increase be gradual or will it be abrupt? 2. What is the maximum? If you owned the market what size would you need to support? Typically when I size I do the following 1. Document my assumptions. 2. Document my sources - where did I get the assumptions and how did I derive them. 3. I then increase the size from that by 3 to 10. Don't attempt to create a system the will work for the next 100 years. One that works for 10 is fine. After you have those numbers then you start looking for solutions that will handle those numbers. Keeping in mind of course that your architecture/design should support the larger values, but the implementation does not need to support that now. You just want to make sure the implementation does not preclude sizing it up.
-
In one of the sub-threads you said the following "The number of devices will hopefully increase 10 to a 100-fold" You should have a very hard idea of what this increase looks like before you start making decisions. You should know the following. 1. What is the timeline for the increase? Also will the increase be gradual or will it be abrupt? 2. What is the maximum? If you owned the market what size would you need to support? Typically when I size I do the following 1. Document my assumptions. 2. Document my sources - where did I get the assumptions and how did I derive them. 3. I then increase the size from that by 3 to 10. Don't attempt to create a system the will work for the next 100 years. One that works for 10 is fine. After you have those numbers then you start looking for solutions that will handle those numbers. Keeping in mind of course that your architecture/design should support the larger values, but the implementation does not need to support that now. You just want to make sure the implementation does not preclude sizing it up.
Thanks! Very good points. The question was more about what would a datalake offer in terms of post processingand other benefits, since I am not familiar with the technology. I have no problems implementing an efficient solution with Azure/sql/c# for the foreseeable future. Just needed some insights in a tech I'm not familiar with.