Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Design and Architecture
  4. Need some advice on Post-Processing large amounts of data.

Need some advice on Post-Processing large amounts of data.

Scheduled Pinned Locked Moved Design and Architecture
csharpdatabasesql-serversysadminquestion
13 Posts 5 Posters 13 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    me dagsunde com
    wrote on last edited by
    #1

    Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...

    P Mircea NeacsuM J 3 Replies Last reply
    0
    • M me dagsunde com

      Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...

      P Offline
      P Offline
      Pete OHanlon
      wrote on last edited by
      #2

      There are so many "it depends" that need to be looked at in here. We'll start with a simple question that should, hopefully, start us towards the answer. Question: Do you need to do anything else with the data that you are inserting? Once you have created these six values, is the telemetry data ever used again?

      Advanced TypeScript Programming Projects

      D 1 Reply Last reply
      0
      • P Pete OHanlon

        There are so many "it depends" that need to be looked at in here. We'll start with a simple question that should, hopefully, start us towards the answer. Question: Do you need to do anything else with the data that you are inserting? Once you have created these six values, is the telemetry data ever used again?

        Advanced TypeScript Programming Projects

        D Offline
        D Offline
        Dag Sunde
        wrote on last edited by
        #3

        The telemetry data are also needed later, yes.

        P 1 Reply Last reply
        0
        • D Dag Sunde

          The telemetry data are also needed later, yes.

          P Offline
          P Offline
          Pete OHanlon
          wrote on last edited by
          #4

          Where are you hosting the data? Is it in a cloud provider such as Azure or AWS or are you managing it yourself?

          Advanced TypeScript Programming Projects

          D 1 Reply Last reply
          0
          • P Pete OHanlon

            Where are you hosting the data? Is it in a cloud provider such as Azure or AWS or are you managing it yourself?

            Advanced TypeScript Programming Projects

            D Offline
            D Offline
            Dag Sunde
            wrote on last edited by
            #5

            The DB, and the code that receives tha data it hosted in Azure. Data is recieved as MQTT messages by Azure IoTHub. A C# worker subsribes to the incoming messages, and stores them normalized in an Azure SQLdb.

            P 1 Reply Last reply
            0
            • M me dagsunde com

              Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...

              Mircea NeacsuM Online
              Mircea NeacsuM Online
              Mircea Neacsu
              wrote on last edited by
              #6

              What worries you most? Is it the data volume? A back of the napkin calculation shows something like 10-20k per sec. Almost any DB system could handle that.

              Mircea

              M 1 Reply Last reply
              0
              • Mircea NeacsuM Mircea Neacsu

                What worries you most? Is it the data volume? A back of the napkin calculation shows something like 10-20k per sec. Almost any DB system could handle that.

                Mircea

                M Offline
                M Offline
                me dagsunde com
                wrote on last edited by
                #7

                Volume, Yes. As of today, the volume is not a problem at all. But when we get thousands of devices online, I'm a little bit afraid that the Job generating the aggregates will struggle/lead to locks. Also, who knows what other Post-processing needs will occur in the future.

                Mircea NeacsuM 1 Reply Last reply
                0
                • M me dagsunde com

                  Volume, Yes. As of today, the volume is not a problem at all. But when we get thousands of devices online, I'm a little bit afraid that the Job generating the aggregates will struggle/lead to locks. Also, who knows what other Post-processing needs will occur in the future.

                  Mircea NeacsuM Online
                  Mircea NeacsuM Online
                  Mircea Neacsu
                  wrote on last edited by
                  #8

                  How many more devices you anticipate to connect: 20000, 200000? At 20000 level, I think you should still be safe. At 200000, not so sure. Keep in mind however that inserting is the more probable bottleneck as it has to create all those indexes, allocate pages, etc. Querying should be much faster.

                  Mircea

                  M 1 Reply Last reply
                  0
                  • Mircea NeacsuM Mircea Neacsu

                    How many more devices you anticipate to connect: 20000, 200000? At 20000 level, I think you should still be safe. At 200000, not so sure. Keep in mind however that inserting is the more probable bottleneck as it has to create all those indexes, allocate pages, etc. Querying should be much faster.

                    Mircea

                    M Offline
                    M Offline
                    me dagsunde com
                    wrote on last edited by
                    #9

                    Thanks! :)

                    1 Reply Last reply
                    0
                    • D Dag Sunde

                      The DB, and the code that receives tha data it hosted in Azure. Data is recieved as MQTT messages by Azure IoTHub. A C# worker subsribes to the incoming messages, and stores them normalized in an Azure SQLdb.

                      P Offline
                      P Offline
                      Pete OHanlon
                      wrote on last edited by
                      #10

                      So, you have data coming in from 200 devices, 5 times a second. Let's think about the problem areas here:

                      1. When you add data, at that volume, you're adding 1000 items a second. Not a problem in SQL Server.
                      2. I'm assuming it's one aggregation per device every minute, so you're looking at a read of 60K rows every minute, with a write of 200 rows.

                      I would be tempted to invert the problem slightly. Rather than write to one source, write to two. So, write the incoming messages to SQL Server and, at the same time, write them to something like Redis Cache. Then, every minute, get your data from Redis and use that to calculate the values for the other table and write to that. This approach follows (somewhat), the CQRS so you're separating the reads and writes from each other. The beauty is, this scales well.

                      Advanced TypeScript Programming Projects

                      M 1 Reply Last reply
                      0
                      • P Pete OHanlon

                        So, you have data coming in from 200 devices, 5 times a second. Let's think about the problem areas here:

                        1. When you add data, at that volume, you're adding 1000 items a second. Not a problem in SQL Server.
                        2. I'm assuming it's one aggregation per device every minute, so you're looking at a read of 60K rows every minute, with a write of 200 rows.

                        I would be tempted to invert the problem slightly. Rather than write to one source, write to two. So, write the incoming messages to SQL Server and, at the same time, write them to something like Redis Cache. Then, every minute, get your data from Redis and use that to calculate the values for the other table and write to that. This approach follows (somewhat), the CQRS so you're separating the reads and writes from each other. The beauty is, this scales well.

                        Advanced TypeScript Programming Projects

                        M Offline
                        M Offline
                        me dagsunde com
                        wrote on last edited by
                        #11

                        That was actually a brilliant Idea, as I already have Redis running :) The number of devices will hopefully increase 10 to a 100-fold, but this will still scale. So will it with additional transformations in the future. Thanks! The only drawback with this solution is that I have no excuse to sit down and play with Datalakes and all its tools... :laugh:

                        1 Reply Last reply
                        0
                        • M me dagsunde com

                          Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...

                          J Offline
                          J Offline
                          jschell
                          wrote on last edited by
                          #12

                          In one of the sub-threads you said the following "The number of devices will hopefully increase 10 to a 100-fold" You should have a very hard idea of what this increase looks like before you start making decisions. You should know the following. 1. What is the timeline for the increase? Also will the increase be gradual or will it be abrupt? 2. What is the maximum? If you owned the market what size would you need to support? Typically when I size I do the following 1. Document my assumptions. 2. Document my sources - where did I get the assumptions and how did I derive them. 3. I then increase the size from that by 3 to 10. Don't attempt to create a system the will work for the next 100 years. One that works for 10 is fine. After you have those numbers then you start looking for solutions that will handle those numbers. Keeping in mind of course that your architecture/design should support the larger values, but the implementation does not need to support that now. You just want to make sure the implementation does not preclude sizing it up.

                          D 1 Reply Last reply
                          0
                          • J jschell

                            In one of the sub-threads you said the following "The number of devices will hopefully increase 10 to a 100-fold" You should have a very hard idea of what this increase looks like before you start making decisions. You should know the following. 1. What is the timeline for the increase? Also will the increase be gradual or will it be abrupt? 2. What is the maximum? If you owned the market what size would you need to support? Typically when I size I do the following 1. Document my assumptions. 2. Document my sources - where did I get the assumptions and how did I derive them. 3. I then increase the size from that by 3 to 10. Don't attempt to create a system the will work for the next 100 years. One that works for 10 is fine. After you have those numbers then you start looking for solutions that will handle those numbers. Keeping in mind of course that your architecture/design should support the larger values, but the implementation does not need to support that now. You just want to make sure the implementation does not preclude sizing it up.

                            D Offline
                            D Offline
                            Dag Sunde
                            wrote on last edited by
                            #13

                            Thanks! Very good points. The question was more about what would a datalake offer in terms of post processingand other benefits, since I am not familiar with the technology. I have no problems implementing an efficient solution with Azure/sql/c# for the foreseeable future. Just needed some insights in a tech I'm not familiar with.

                            1 Reply Last reply
                            0
                            Reply
                            • Reply as topic
                            Log in to reply
                            • Oldest to Newest
                            • Newest to Oldest
                            • Most Votes


                            • Login

                            • Don't have an account? Register

                            • Login or register to search.
                            • First post
                              Last post
                            0
                            • Categories
                            • Recent
                            • Tags
                            • Popular
                            • World
                            • Users
                            • Groups