Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Design and Architecture
  4. Need some advice on Post-Processing large amounts of data.

Need some advice on Post-Processing large amounts of data.

Scheduled Pinned Locked Moved Design and Architecture
csharpdatabasesql-serversysadminquestion
13 Posts 5 Posters 13 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D Dag Sunde

    The telemetry data are also needed later, yes.

    P Offline
    P Offline
    Pete OHanlon
    wrote on last edited by
    #4

    Where are you hosting the data? Is it in a cloud provider such as Azure or AWS or are you managing it yourself?

    Advanced TypeScript Programming Projects

    D 1 Reply Last reply
    0
    • P Pete OHanlon

      Where are you hosting the data? Is it in a cloud provider such as Azure or AWS or are you managing it yourself?

      Advanced TypeScript Programming Projects

      D Offline
      D Offline
      Dag Sunde
      wrote on last edited by
      #5

      The DB, and the code that receives tha data it hosted in Azure. Data is recieved as MQTT messages by Azure IoTHub. A C# worker subsribes to the incoming messages, and stores them normalized in an Azure SQLdb.

      P 1 Reply Last reply
      0
      • M me dagsunde com

        Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...

        Mircea NeacsuM Offline
        Mircea NeacsuM Offline
        Mircea Neacsu
        wrote on last edited by
        #6

        What worries you most? Is it the data volume? A back of the napkin calculation shows something like 10-20k per sec. Almost any DB system could handle that.

        Mircea

        M 1 Reply Last reply
        0
        • Mircea NeacsuM Mircea Neacsu

          What worries you most? Is it the data volume? A back of the napkin calculation shows something like 10-20k per sec. Almost any DB system could handle that.

          Mircea

          M Offline
          M Offline
          me dagsunde com
          wrote on last edited by
          #7

          Volume, Yes. As of today, the volume is not a problem at all. But when we get thousands of devices online, I'm a little bit afraid that the Job generating the aggregates will struggle/lead to locks. Also, who knows what other Post-processing needs will occur in the future.

          Mircea NeacsuM 1 Reply Last reply
          0
          • M me dagsunde com

            Volume, Yes. As of today, the volume is not a problem at all. But when we get thousands of devices online, I'm a little bit afraid that the Job generating the aggregates will struggle/lead to locks. Also, who knows what other Post-processing needs will occur in the future.

            Mircea NeacsuM Offline
            Mircea NeacsuM Offline
            Mircea Neacsu
            wrote on last edited by
            #8

            How many more devices you anticipate to connect: 20000, 200000? At 20000 level, I think you should still be safe. At 200000, not so sure. Keep in mind however that inserting is the more probable bottleneck as it has to create all those indexes, allocate pages, etc. Querying should be much faster.

            Mircea

            M 1 Reply Last reply
            0
            • Mircea NeacsuM Mircea Neacsu

              How many more devices you anticipate to connect: 20000, 200000? At 20000 level, I think you should still be safe. At 200000, not so sure. Keep in mind however that inserting is the more probable bottleneck as it has to create all those indexes, allocate pages, etc. Querying should be much faster.

              Mircea

              M Offline
              M Offline
              me dagsunde com
              wrote on last edited by
              #9

              Thanks! :)

              1 Reply Last reply
              0
              • D Dag Sunde

                The DB, and the code that receives tha data it hosted in Azure. Data is recieved as MQTT messages by Azure IoTHub. A C# worker subsribes to the incoming messages, and stores them normalized in an Azure SQLdb.

                P Offline
                P Offline
                Pete OHanlon
                wrote on last edited by
                #10

                So, you have data coming in from 200 devices, 5 times a second. Let's think about the problem areas here:

                1. When you add data, at that volume, you're adding 1000 items a second. Not a problem in SQL Server.
                2. I'm assuming it's one aggregation per device every minute, so you're looking at a read of 60K rows every minute, with a write of 200 rows.

                I would be tempted to invert the problem slightly. Rather than write to one source, write to two. So, write the incoming messages to SQL Server and, at the same time, write them to something like Redis Cache. Then, every minute, get your data from Redis and use that to calculate the values for the other table and write to that. This approach follows (somewhat), the CQRS so you're separating the reads and writes from each other. The beauty is, this scales well.

                Advanced TypeScript Programming Projects

                M 1 Reply Last reply
                0
                • P Pete OHanlon

                  So, you have data coming in from 200 devices, 5 times a second. Let's think about the problem areas here:

                  1. When you add data, at that volume, you're adding 1000 items a second. Not a problem in SQL Server.
                  2. I'm assuming it's one aggregation per device every minute, so you're looking at a read of 60K rows every minute, with a write of 200 rows.

                  I would be tempted to invert the problem slightly. Rather than write to one source, write to two. So, write the incoming messages to SQL Server and, at the same time, write them to something like Redis Cache. Then, every minute, get your data from Redis and use that to calculate the values for the other table and write to that. This approach follows (somewhat), the CQRS so you're separating the reads and writes from each other. The beauty is, this scales well.

                  Advanced TypeScript Programming Projects

                  M Offline
                  M Offline
                  me dagsunde com
                  wrote on last edited by
                  #11

                  That was actually a brilliant Idea, as I already have Redis running :) The number of devices will hopefully increase 10 to a 100-fold, but this will still scale. So will it with additional transformations in the future. Thanks! The only drawback with this solution is that I have no excuse to sit down and play with Datalakes and all its tools... :laugh:

                  1 Reply Last reply
                  0
                  • M me dagsunde com

                    Scenario: We have (as of today, will be more) 200 devices sending telemetry into an SQL server with a frequency of 5 rows each pr. second. Each row contains 6 decimal values to be processed into another dataset/table containing min/max/avg of these values in 1 minute intervals. I have never worked with DataLakes and similar tech, so I wonder: Should i read up on storing the raw telemetry in a datalake, and set up post processing there, or Just go for my existing SQL server and create a c# job post processing the data in the background myself? TIA...

                    J Offline
                    J Offline
                    jschell
                    wrote on last edited by
                    #12

                    In one of the sub-threads you said the following "The number of devices will hopefully increase 10 to a 100-fold" You should have a very hard idea of what this increase looks like before you start making decisions. You should know the following. 1. What is the timeline for the increase? Also will the increase be gradual or will it be abrupt? 2. What is the maximum? If you owned the market what size would you need to support? Typically when I size I do the following 1. Document my assumptions. 2. Document my sources - where did I get the assumptions and how did I derive them. 3. I then increase the size from that by 3 to 10. Don't attempt to create a system the will work for the next 100 years. One that works for 10 is fine. After you have those numbers then you start looking for solutions that will handle those numbers. Keeping in mind of course that your architecture/design should support the larger values, but the implementation does not need to support that now. You just want to make sure the implementation does not preclude sizing it up.

                    D 1 Reply Last reply
                    0
                    • J jschell

                      In one of the sub-threads you said the following "The number of devices will hopefully increase 10 to a 100-fold" You should have a very hard idea of what this increase looks like before you start making decisions. You should know the following. 1. What is the timeline for the increase? Also will the increase be gradual or will it be abrupt? 2. What is the maximum? If you owned the market what size would you need to support? Typically when I size I do the following 1. Document my assumptions. 2. Document my sources - where did I get the assumptions and how did I derive them. 3. I then increase the size from that by 3 to 10. Don't attempt to create a system the will work for the next 100 years. One that works for 10 is fine. After you have those numbers then you start looking for solutions that will handle those numbers. Keeping in mind of course that your architecture/design should support the larger values, but the implementation does not need to support that now. You just want to make sure the implementation does not preclude sizing it up.

                      D Offline
                      D Offline
                      Dag Sunde
                      wrote on last edited by
                      #13

                      Thanks! Very good points. The question was more about what would a datalake offer in terms of post processingand other benefits, since I am not familiar with the technology. I have no problems implementing an efficient solution with Azure/sql/c# for the foreseeable future. Just needed some insights in a tech I'm not familiar with.

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups