Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. How to sort a big volume of data?

How to sort a big volume of data?

Scheduled Pinned Locked Moved C#
questioncsharptutorial
16 Posts 4 Posters 16 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    michal kreslik
    wrote on last edited by
    #1

    Hello, how can I sort a big volume (hundreds of MBs) of ASCII data in C#? Thanks, Michal

    S 1 Reply Last reply
    0
    • M michal kreslik

      Hello, how can I sort a big volume (hundreds of MBs) of ASCII data in C#? Thanks, Michal

      S Offline
      S Offline
      snorkie
      wrote on last edited by
      #2

      Can you provide more information about the data. Maybe even a small example of the data. Hogan

      M 1 Reply Last reply
      0
      • S snorkie

        Can you provide more information about the data. Maybe even a small example of the data. Hogan

        M Offline
        M Offline
        michal kreslik
        wrote on last edited by
        #3

        Sure, I'll be thankful for any help. Basically, I've got txt files that contain information about Forex price ticks. Each row contains this information that I'm interested in: - date/time - price Now the txt files overlap themselves in such a way that the ending part of one file contains exactly the same ticks as the beginning part of another file (this was to ensure that during the preceding export, no price ticks were omitted). Obviously, this is not true neither for the very first nor for the very last file. The price ticks within a single file are guaranteed to not be duplicate. If we want to get a continuous data stream that would only contain unique ticks (rows), the simple solution would be to load all rows from all files and sort the rows on: DateTime ASC, FileID ASC, RowNumber ASC. Then I could go through all sorted rows and remove those that contain the same DateTime as the last valid DateTime, but with a new FileID. In other words, only ticks with the same FileID can have the same DateTime to ensure there are no duplicates. Unfortunately, the DataTable object is throwing an OutOfMemoryException on me if I attempt to DataTable.Select() such a big chunk of data (about 15 million rows). The same happens with DataTable.DefaultView.Sort. It works on smaller data sample with no memory exceptions being thrown, though. Any ideas? Thanks much for any input! Michal

        S 1 Reply Last reply
        0
        • M michal kreslik

          Sure, I'll be thankful for any help. Basically, I've got txt files that contain information about Forex price ticks. Each row contains this information that I'm interested in: - date/time - price Now the txt files overlap themselves in such a way that the ending part of one file contains exactly the same ticks as the beginning part of another file (this was to ensure that during the preceding export, no price ticks were omitted). Obviously, this is not true neither for the very first nor for the very last file. The price ticks within a single file are guaranteed to not be duplicate. If we want to get a continuous data stream that would only contain unique ticks (rows), the simple solution would be to load all rows from all files and sort the rows on: DateTime ASC, FileID ASC, RowNumber ASC. Then I could go through all sorted rows and remove those that contain the same DateTime as the last valid DateTime, but with a new FileID. In other words, only ticks with the same FileID can have the same DateTime to ensure there are no duplicates. Unfortunately, the DataTable object is throwing an OutOfMemoryException on me if I attempt to DataTable.Select() such a big chunk of data (about 15 million rows). The same happens with DataTable.DefaultView.Sort. It works on smaller data sample with no memory exceptions being thrown, though. Any ideas? Thanks much for any input! Michal

          S Offline
          S Offline
          snorkie
          wrote on last edited by
          #4

          michal, I'm no whiz like the other maniacs here on the forum, but here are some ideas. While they will take a few steps, they would work. Option 1. Download SQL Server Express and load your data into a SQL server table. Once there, you can do a select statement and order by anything you want. Option 2. Take the two pieces of data and combine them into a single string with a delimiter. Insert them into an ArrayList and then do an ArrayList.Sort(). I tried this on my computer, it took 2mins45 seconds and 1.5gb of ram to run... So if you have a powerful machine, this could work.

          ArrayList al = new ArrayList();
          for (int x = 0; x < 15000000; x++)
          {
          al.Add(System.Guid.NewGuid().ToString());
          }
          MessageBox.Show("Finished loading.");
          al.Sort();
          MessageBox.Show("Finished Sort.");

          Anyway, I hope this helps give you a direction. Personally, i would go with the SQL server option as it is a bit more robust and will handle large amounts of data much better! Hogan

          M 1 Reply Last reply
          0
          • S snorkie

            michal, I'm no whiz like the other maniacs here on the forum, but here are some ideas. While they will take a few steps, they would work. Option 1. Download SQL Server Express and load your data into a SQL server table. Once there, you can do a select statement and order by anything you want. Option 2. Take the two pieces of data and combine them into a single string with a delimiter. Insert them into an ArrayList and then do an ArrayList.Sort(). I tried this on my computer, it took 2mins45 seconds and 1.5gb of ram to run... So if you have a powerful machine, this could work.

            ArrayList al = new ArrayList();
            for (int x = 0; x < 15000000; x++)
            {
            al.Add(System.Guid.NewGuid().ToString());
            }
            MessageBox.Show("Finished loading.");
            al.Sort();
            MessageBox.Show("Finished Sort.");

            Anyway, I hope this helps give you a direction. Personally, i would go with the SQL server option as it is a bit more robust and will handle large amounts of data much better! Hogan

            M Offline
            M Offline
            michal kreslik
            wrote on last edited by
            #5

            Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal

            S T 2 Replies Last reply
            0
            • M michal kreslik

              Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal

              S Offline
              S Offline
              Spacix One
              wrote on last edited by
              #6

              The data would have to be Xml (or loaded into XML) to sort without a database for that much information. I was able to process 1,000,000 XML lines with System.Xml in > 9 secs on a bench mark I ran today for work. As another senior coder was wondering how my program using 100's of XML files would preform...


              -Spacix All your skynet questions[^] belong to solved

              M 1 Reply Last reply
              0
              • S Spacix One

                The data would have to be Xml (or loaded into XML) to sort without a database for that much information. I was able to process 1,000,000 XML lines with System.Xml in > 9 secs on a bench mark I ran today for work. As another senior coder was wondering how my program using 100's of XML files would preform...


                -Spacix All your skynet questions[^] belong to solved

                M Offline
                M Offline
                michal kreslik
                wrote on last edited by
                #7

                That's impressive. I wouldn't even think about XML as the format itself is a kind of a synonymum to "SLOW" for me :) But still, I was not able to find any caveat concerning loading big chunks of data into DataTable from Microsoft. The OutOfMemoryException occured during normal operation, I've got 2 GBs or RAM on my box with Win XP SP2 and the used RAM was only something like 1.4 GBs at the time. So that was definitely not the lack of physical memory. So "there's something rotten in the state of DataTable" .. :) What is it? Michal

                S 1 Reply Last reply
                0
                • M michal kreslik

                  That's impressive. I wouldn't even think about XML as the format itself is a kind of a synonymum to "SLOW" for me :) But still, I was not able to find any caveat concerning loading big chunks of data into DataTable from Microsoft. The OutOfMemoryException occured during normal operation, I've got 2 GBs or RAM on my box with Win XP SP2 and the used RAM was only something like 1.4 GBs at the time. So that was definitely not the lack of physical memory. So "there's something rotten in the state of DataTable" .. :) What is it? Michal

                  S Offline
                  S Offline
                  Spacix One
                  wrote on last edited by
                  #8

                  Then my guess would be it is a permissions issue limiting the application...


                  -Spacix All your skynet questions[^] belong to solved

                  M 1 Reply Last reply
                  0
                  • S Spacix One

                    Then my guess would be it is a permissions issue limiting the application...


                    -Spacix All your skynet questions[^] belong to solved

                    M Offline
                    M Offline
                    michal kreslik
                    wrote on last edited by
                    #9

                    It's strange as the DataTable is throwing an OutOfMemoryException if there are more than about 12,646,480 rows (I came to this number of rows by interval halving). However, the exception does not repeat itself reliably - sometimes the DataTable can sort 12,646,480 rows and sometimes it can't. With higher number of rows than 12,646,480, the certainty of the DataTable to throw an exception quickly rises and with lower number of rows, it quickly decreases. I REALLY wonder what this number of rows is related to. The number doesn't resemble any power of 2 and I tried logarithms of base 2 to 100 with no luck, too. Michal

                    1 Reply Last reply
                    0
                    • M michal kreslik

                      Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal

                      T Offline
                      T Offline
                      Thomas Krojer
                      wrote on last edited by
                      #10

                      How fast is the SQL Server solution (import, sort, export?)

                      M 1 Reply Last reply
                      0
                      • T Thomas Krojer

                        How fast is the SQL Server solution (import, sort, export?)

                        M Offline
                        M Offline
                        michal kreslik
                        wrote on last edited by
                        #11

                        Obviously, the SQL-based solution is much slower as it stores the data to disk as opposed to working directly in memory. Importing the data is very slow (0.9 ms per row) compared to DataTable, sorting is lightning fast. However, I can accomplish the task with SQL, which can't be said about the DataTable-oriented solution. Michal

                        T 1 Reply Last reply
                        0
                        • M michal kreslik

                          Obviously, the SQL-based solution is much slower as it stores the data to disk as opposed to working directly in memory. Importing the data is very slow (0.9 ms per row) compared to DataTable, sorting is lightning fast. However, I can accomplish the task with SQL, which can't be said about the DataTable-oriented solution. Michal

                          T Offline
                          T Offline
                          Thomas Krojer
                          wrote on last edited by
                          #12

                          A few years ago, I wrote an sort routine for sorting BIG number of records, using the "insertation sort" algorythm (I´m a confused about the naming of the alg ..., maybe he was called "insertation sort" only in this one book ...). The main idea: for fixed length records, and an known lower and upper key (you know after the first read cycle), its possible to sort the file with only 2 read and 1 write cycle - if you need more, I´ll post something.

                          M 1 Reply Last reply
                          0
                          • T Thomas Krojer

                            A few years ago, I wrote an sort routine for sorting BIG number of records, using the "insertation sort" algorythm (I´m a confused about the naming of the alg ..., maybe he was called "insertation sort" only in this one book ...). The main idea: for fixed length records, and an known lower and upper key (you know after the first read cycle), its possible to sort the file with only 2 read and 1 write cycle - if you need more, I´ll post something.

                            M Offline
                            M Offline
                            michal kreslik
                            wrote on last edited by
                            #13

                            Please go ahead and post more. I have been working with a huge SQL database of Forex price ticks for almost a year now. By now, it consists of about 270 million rows. Every fresh idea on how to help with the pre-precessing of the data before importing it into the SQL database is warmly welcome! :) Thanks, Michal

                            T 1 Reply Last reply
                            0
                            • M michal kreslik

                              Please go ahead and post more. I have been working with a huge SQL database of Forex price ticks for almost a year now. By now, it consists of about 270 million rows. Every fresh idea on how to help with the pre-precessing of the data before importing it into the SQL database is warmly welcome! :) Thanks, Michal

                              T Offline
                              T Offline
                              Thomas Krojer
                              wrote on last edited by
                              #14

                              sorry for the delay, i was in heavy troubles, so i had no time ... please post a snipplet of the datafile, i´ll implemnt this insertation sort, and post.

                              M 1 Reply Last reply
                              0
                              • T Thomas Krojer

                                sorry for the delay, i was in heavy troubles, so i had no time ... please post a snipplet of the datafile, i´ll implemnt this insertation sort, and post.

                                M Offline
                                M Offline
                                michal kreslik
                                wrote on last edited by
                                #15

                                Hi, Thomas, I've resolved the issue in the meantime. Thanks for help, Michal

                                T 1 Reply Last reply
                                0
                                • M michal kreslik

                                  Hi, Thomas, I've resolved the issue in the meantime. Thanks for help, Michal

                                  T Offline
                                  T Offline
                                  Thomas Krojer
                                  wrote on last edited by
                                  #16

                                  sorry again, how did you manage it? how is the performance? greetings, thomas

                                  1 Reply Last reply
                                  0
                                  Reply
                                  • Reply as topic
                                  Log in to reply
                                  • Oldest to Newest
                                  • Newest to Oldest
                                  • Most Votes


                                  • Login

                                  • Don't have an account? Register

                                  • Login or register to search.
                                  • First post
                                    Last post
                                  0
                                  • Categories
                                  • Recent
                                  • Tags
                                  • Popular
                                  • World
                                  • Users
                                  • Groups