Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. How to sort a big volume of data?

How to sort a big volume of data?

Scheduled Pinned Locked Moved C#
questioncsharptutorial
16 Posts 4 Posters 16 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M michal kreslik

    Sure, I'll be thankful for any help. Basically, I've got txt files that contain information about Forex price ticks. Each row contains this information that I'm interested in: - date/time - price Now the txt files overlap themselves in such a way that the ending part of one file contains exactly the same ticks as the beginning part of another file (this was to ensure that during the preceding export, no price ticks were omitted). Obviously, this is not true neither for the very first nor for the very last file. The price ticks within a single file are guaranteed to not be duplicate. If we want to get a continuous data stream that would only contain unique ticks (rows), the simple solution would be to load all rows from all files and sort the rows on: DateTime ASC, FileID ASC, RowNumber ASC. Then I could go through all sorted rows and remove those that contain the same DateTime as the last valid DateTime, but with a new FileID. In other words, only ticks with the same FileID can have the same DateTime to ensure there are no duplicates. Unfortunately, the DataTable object is throwing an OutOfMemoryException on me if I attempt to DataTable.Select() such a big chunk of data (about 15 million rows). The same happens with DataTable.DefaultView.Sort. It works on smaller data sample with no memory exceptions being thrown, though. Any ideas? Thanks much for any input! Michal

    S Offline
    S Offline
    snorkie
    wrote on last edited by
    #4

    michal, I'm no whiz like the other maniacs here on the forum, but here are some ideas. While they will take a few steps, they would work. Option 1. Download SQL Server Express and load your data into a SQL server table. Once there, you can do a select statement and order by anything you want. Option 2. Take the two pieces of data and combine them into a single string with a delimiter. Insert them into an ArrayList and then do an ArrayList.Sort(). I tried this on my computer, it took 2mins45 seconds and 1.5gb of ram to run... So if you have a powerful machine, this could work.

    ArrayList al = new ArrayList();
    for (int x = 0; x < 15000000; x++)
    {
    al.Add(System.Guid.NewGuid().ToString());
    }
    MessageBox.Show("Finished loading.");
    al.Sort();
    MessageBox.Show("Finished Sort.");

    Anyway, I hope this helps give you a direction. Personally, i would go with the SQL server option as it is a bit more robust and will handle large amounts of data much better! Hogan

    M 1 Reply Last reply
    0
    • S snorkie

      michal, I'm no whiz like the other maniacs here on the forum, but here are some ideas. While they will take a few steps, they would work. Option 1. Download SQL Server Express and load your data into a SQL server table. Once there, you can do a select statement and order by anything you want. Option 2. Take the two pieces of data and combine them into a single string with a delimiter. Insert them into an ArrayList and then do an ArrayList.Sort(). I tried this on my computer, it took 2mins45 seconds and 1.5gb of ram to run... So if you have a powerful machine, this could work.

      ArrayList al = new ArrayList();
      for (int x = 0; x < 15000000; x++)
      {
      al.Add(System.Guid.NewGuid().ToString());
      }
      MessageBox.Show("Finished loading.");
      al.Sort();
      MessageBox.Show("Finished Sort.");

      Anyway, I hope this helps give you a direction. Personally, i would go with the SQL server option as it is a bit more robust and will handle large amounts of data much better! Hogan

      M Offline
      M Offline
      michal kreslik
      wrote on last edited by
      #5

      Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal

      S T 2 Replies Last reply
      0
      • M michal kreslik

        Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal

        S Offline
        S Offline
        Spacix One
        wrote on last edited by
        #6

        The data would have to be Xml (or loaded into XML) to sort without a database for that much information. I was able to process 1,000,000 XML lines with System.Xml in > 9 secs on a bench mark I ran today for work. As another senior coder was wondering how my program using 100's of XML files would preform...


        -Spacix All your skynet questions[^] belong to solved

        M 1 Reply Last reply
        0
        • S Spacix One

          The data would have to be Xml (or loaded into XML) to sort without a database for that much information. I was able to process 1,000,000 XML lines with System.Xml in > 9 secs on a bench mark I ran today for work. As another senior coder was wondering how my program using 100's of XML files would preform...


          -Spacix All your skynet questions[^] belong to solved

          M Offline
          M Offline
          michal kreslik
          wrote on last edited by
          #7

          That's impressive. I wouldn't even think about XML as the format itself is a kind of a synonymum to "SLOW" for me :) But still, I was not able to find any caveat concerning loading big chunks of data into DataTable from Microsoft. The OutOfMemoryException occured during normal operation, I've got 2 GBs or RAM on my box with Win XP SP2 and the used RAM was only something like 1.4 GBs at the time. So that was definitely not the lack of physical memory. So "there's something rotten in the state of DataTable" .. :) What is it? Michal

          S 1 Reply Last reply
          0
          • M michal kreslik

            That's impressive. I wouldn't even think about XML as the format itself is a kind of a synonymum to "SLOW" for me :) But still, I was not able to find any caveat concerning loading big chunks of data into DataTable from Microsoft. The OutOfMemoryException occured during normal operation, I've got 2 GBs or RAM on my box with Win XP SP2 and the used RAM was only something like 1.4 GBs at the time. So that was definitely not the lack of physical memory. So "there's something rotten in the state of DataTable" .. :) What is it? Michal

            S Offline
            S Offline
            Spacix One
            wrote on last edited by
            #8

            Then my guess would be it is a permissions issue limiting the application...


            -Spacix All your skynet questions[^] belong to solved

            M 1 Reply Last reply
            0
            • S Spacix One

              Then my guess would be it is a permissions issue limiting the application...


              -Spacix All your skynet questions[^] belong to solved

              M Offline
              M Offline
              michal kreslik
              wrote on last edited by
              #9

              It's strange as the DataTable is throwing an OutOfMemoryException if there are more than about 12,646,480 rows (I came to this number of rows by interval halving). However, the exception does not repeat itself reliably - sometimes the DataTable can sort 12,646,480 rows and sometimes it can't. With higher number of rows than 12,646,480, the certainty of the DataTable to throw an exception quickly rises and with lower number of rows, it quickly decreases. I REALLY wonder what this number of rows is related to. The number doesn't resemble any power of 2 and I tried logarithms of base 2 to 100 with no luck, too. Michal

              1 Reply Last reply
              0
              • M michal kreslik

                Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal

                T Offline
                T Offline
                Thomas Krojer
                wrote on last edited by
                #10

                How fast is the SQL Server solution (import, sort, export?)

                M 1 Reply Last reply
                0
                • T Thomas Krojer

                  How fast is the SQL Server solution (import, sort, export?)

                  M Offline
                  M Offline
                  michal kreslik
                  wrote on last edited by
                  #11

                  Obviously, the SQL-based solution is much slower as it stores the data to disk as opposed to working directly in memory. Importing the data is very slow (0.9 ms per row) compared to DataTable, sorting is lightning fast. However, I can accomplish the task with SQL, which can't be said about the DataTable-oriented solution. Michal

                  T 1 Reply Last reply
                  0
                  • M michal kreslik

                    Obviously, the SQL-based solution is much slower as it stores the data to disk as opposed to working directly in memory. Importing the data is very slow (0.9 ms per row) compared to DataTable, sorting is lightning fast. However, I can accomplish the task with SQL, which can't be said about the DataTable-oriented solution. Michal

                    T Offline
                    T Offline
                    Thomas Krojer
                    wrote on last edited by
                    #12

                    A few years ago, I wrote an sort routine for sorting BIG number of records, using the "insertation sort" algorythm (I´m a confused about the naming of the alg ..., maybe he was called "insertation sort" only in this one book ...). The main idea: for fixed length records, and an known lower and upper key (you know after the first read cycle), its possible to sort the file with only 2 read and 1 write cycle - if you need more, I´ll post something.

                    M 1 Reply Last reply
                    0
                    • T Thomas Krojer

                      A few years ago, I wrote an sort routine for sorting BIG number of records, using the "insertation sort" algorythm (I´m a confused about the naming of the alg ..., maybe he was called "insertation sort" only in this one book ...). The main idea: for fixed length records, and an known lower and upper key (you know after the first read cycle), its possible to sort the file with only 2 read and 1 write cycle - if you need more, I´ll post something.

                      M Offline
                      M Offline
                      michal kreslik
                      wrote on last edited by
                      #13

                      Please go ahead and post more. I have been working with a huge SQL database of Forex price ticks for almost a year now. By now, it consists of about 270 million rows. Every fresh idea on how to help with the pre-precessing of the data before importing it into the SQL database is warmly welcome! :) Thanks, Michal

                      T 1 Reply Last reply
                      0
                      • M michal kreslik

                        Please go ahead and post more. I have been working with a huge SQL database of Forex price ticks for almost a year now. By now, it consists of about 270 million rows. Every fresh idea on how to help with the pre-precessing of the data before importing it into the SQL database is warmly welcome! :) Thanks, Michal

                        T Offline
                        T Offline
                        Thomas Krojer
                        wrote on last edited by
                        #14

                        sorry for the delay, i was in heavy troubles, so i had no time ... please post a snipplet of the datafile, i´ll implemnt this insertation sort, and post.

                        M 1 Reply Last reply
                        0
                        • T Thomas Krojer

                          sorry for the delay, i was in heavy troubles, so i had no time ... please post a snipplet of the datafile, i´ll implemnt this insertation sort, and post.

                          M Offline
                          M Offline
                          michal kreslik
                          wrote on last edited by
                          #15

                          Hi, Thomas, I've resolved the issue in the meantime. Thanks for help, Michal

                          T 1 Reply Last reply
                          0
                          • M michal kreslik

                            Hi, Thomas, I've resolved the issue in the meantime. Thanks for help, Michal

                            T Offline
                            T Offline
                            Thomas Krojer
                            wrote on last edited by
                            #16

                            sorry again, how did you manage it? how is the performance? greetings, thomas

                            1 Reply Last reply
                            0
                            Reply
                            • Reply as topic
                            Log in to reply
                            • Oldest to Newest
                            • Newest to Oldest
                            • Most Votes


                            • Login

                            • Don't have an account? Register

                            • Login or register to search.
                            • First post
                              Last post
                            0
                            • Categories
                            • Recent
                            • Tags
                            • Popular
                            • World
                            • Users
                            • Groups