How to sort a big volume of data?
-
Sure, I'll be thankful for any help. Basically, I've got txt files that contain information about Forex price ticks. Each row contains this information that I'm interested in: - date/time - price Now the txt files overlap themselves in such a way that the ending part of one file contains exactly the same ticks as the beginning part of another file (this was to ensure that during the preceding export, no price ticks were omitted). Obviously, this is not true neither for the very first nor for the very last file. The price ticks within a single file are guaranteed to not be duplicate. If we want to get a continuous data stream that would only contain unique ticks (rows), the simple solution would be to load all rows from all files and sort the rows on: DateTime ASC, FileID ASC, RowNumber ASC. Then I could go through all sorted rows and remove those that contain the same DateTime as the last valid DateTime, but with a new FileID. In other words, only ticks with the same FileID can have the same DateTime to ensure there are no duplicates. Unfortunately, the DataTable object is throwing an OutOfMemoryException on me if I attempt to DataTable.Select() such a big chunk of data (about 15 million rows). The same happens with DataTable.DefaultView.Sort. It works on smaller data sample with no memory exceptions being thrown, though. Any ideas? Thanks much for any input! Michal
michal, I'm no whiz like the other maniacs here on the forum, but here are some ideas. While they will take a few steps, they would work. Option 1. Download SQL Server Express and load your data into a SQL server table. Once there, you can do a select statement and order by anything you want. Option 2. Take the two pieces of data and combine them into a single string with a delimiter. Insert them into an ArrayList and then do an ArrayList.Sort(). I tried this on my computer, it took 2mins45 seconds and 1.5gb of ram to run... So if you have a powerful machine, this could work.
ArrayList al = new ArrayList();
for (int x = 0; x < 15000000; x++)
{
al.Add(System.Guid.NewGuid().ToString());
}
MessageBox.Show("Finished loading.");
al.Sort();
MessageBox.Show("Finished Sort.");Anyway, I hope this helps give you a direction. Personally, i would go with the SQL server option as it is a bit more robust and will handle large amounts of data much better! Hogan
-
michal, I'm no whiz like the other maniacs here on the forum, but here are some ideas. While they will take a few steps, they would work. Option 1. Download SQL Server Express and load your data into a SQL server table. Once there, you can do a select statement and order by anything you want. Option 2. Take the two pieces of data and combine them into a single string with a delimiter. Insert them into an ArrayList and then do an ArrayList.Sort(). I tried this on my computer, it took 2mins45 seconds and 1.5gb of ram to run... So if you have a powerful machine, this could work.
ArrayList al = new ArrayList();
for (int x = 0; x < 15000000; x++)
{
al.Add(System.Guid.NewGuid().ToString());
}
MessageBox.Show("Finished loading.");
al.Sort();
MessageBox.Show("Finished Sort.");Anyway, I hope this helps give you a direction. Personally, i would go with the SQL server option as it is a bit more robust and will handle large amounts of data much better! Hogan
Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal
-
Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal
The data would have to be Xml (or loaded into XML) to sort without a database for that much information. I was able to process 1,000,000 XML lines with System.Xml in > 9 secs on a bench mark I ran today for work. As another senior coder was wondering how my program using 100's of XML files would preform...
-Spacix All your skynet questions[^] belong to solved
-
The data would have to be Xml (or loaded into XML) to sort without a database for that much information. I was able to process 1,000,000 XML lines with System.Xml in > 9 secs on a bench mark I ran today for work. As another senior coder was wondering how my program using 100's of XML files would preform...
-Spacix All your skynet questions[^] belong to solved
That's impressive. I wouldn't even think about XML as the format itself is a kind of a synonymum to "SLOW" for me :) But still, I was not able to find any caveat concerning loading big chunks of data into DataTable from Microsoft. The OutOfMemoryException occured during normal operation, I've got 2 GBs or RAM on my box with Win XP SP2 and the used RAM was only something like 1.4 GBs at the time. So that was definitely not the lack of physical memory. So "there's something rotten in the state of DataTable" .. :) What is it? Michal
-
That's impressive. I wouldn't even think about XML as the format itself is a kind of a synonymum to "SLOW" for me :) But still, I was not able to find any caveat concerning loading big chunks of data into DataTable from Microsoft. The OutOfMemoryException occured during normal operation, I've got 2 GBs or RAM on my box with Win XP SP2 and the used RAM was only something like 1.4 GBs at the time. So that was definitely not the lack of physical memory. So "there's something rotten in the state of DataTable" .. :) What is it? Michal
Then my guess would be it is a permissions issue limiting the application...
-Spacix All your skynet questions[^] belong to solved
-
Then my guess would be it is a permissions issue limiting the application...
-Spacix All your skynet questions[^] belong to solved
It's strange as the DataTable is throwing an OutOfMemoryException if there are more than about 12,646,480 rows (I came to this number of rows by interval halving). However, the exception does not repeat itself reliably - sometimes the DataTable can sort 12,646,480 rows and sometimes it can't. With higher number of rows than 12,646,480, the certainty of the DataTable to throw an exception quickly rises and with lower number of rows, it quickly decreases. I REALLY wonder what this number of rows is related to. The number doesn't resemble any power of 2 and I tried logarithms of base 2 to 100 with no luck, too. Michal
-
Hello, Hogan, again, well, the SQL solution is what I've been working on in the meantime while posting the question here, and it's probably going to be the best one. I was just curious whether I can do this in an easy way without SQL. Option 2: I was thinking about using the ArrayList, too, but somehow I was too obsessed with the DataTable that I ruled this option out :) Thanks for your help! ANYWAY, why is the DataTable throwing the OutOfMemoryException? Is it only designed to handle small data samples? I seriously doubt it. Thanks, Michal
How fast is the SQL Server solution (import, sort, export?)
-
How fast is the SQL Server solution (import, sort, export?)
Obviously, the SQL-based solution is much slower as it stores the data to disk as opposed to working directly in memory. Importing the data is very slow (0.9 ms per row) compared to DataTable, sorting is lightning fast. However, I can accomplish the task with SQL, which can't be said about the DataTable-oriented solution. Michal
-
Obviously, the SQL-based solution is much slower as it stores the data to disk as opposed to working directly in memory. Importing the data is very slow (0.9 ms per row) compared to DataTable, sorting is lightning fast. However, I can accomplish the task with SQL, which can't be said about the DataTable-oriented solution. Michal
A few years ago, I wrote an sort routine for sorting BIG number of records, using the "insertation sort" algorythm (I´m a confused about the naming of the alg ..., maybe he was called "insertation sort" only in this one book ...). The main idea: for fixed length records, and an known lower and upper key (you know after the first read cycle), its possible to sort the file with only 2 read and 1 write cycle - if you need more, I´ll post something.
-
A few years ago, I wrote an sort routine for sorting BIG number of records, using the "insertation sort" algorythm (I´m a confused about the naming of the alg ..., maybe he was called "insertation sort" only in this one book ...). The main idea: for fixed length records, and an known lower and upper key (you know after the first read cycle), its possible to sort the file with only 2 read and 1 write cycle - if you need more, I´ll post something.
Please go ahead and post more. I have been working with a huge SQL database of Forex price ticks for almost a year now. By now, it consists of about 270 million rows. Every fresh idea on how to help with the pre-precessing of the data before importing it into the SQL database is warmly welcome! :) Thanks, Michal
-
Please go ahead and post more. I have been working with a huge SQL database of Forex price ticks for almost a year now. By now, it consists of about 270 million rows. Every fresh idea on how to help with the pre-precessing of the data before importing it into the SQL database is warmly welcome! :) Thanks, Michal
sorry for the delay, i was in heavy troubles, so i had no time ... please post a snipplet of the datafile, i´ll implemnt this insertation sort, and post.
-
sorry for the delay, i was in heavy troubles, so i had no time ... please post a snipplet of the datafile, i´ll implemnt this insertation sort, and post.
Hi, Thomas, I've resolved the issue in the meantime. Thanks for help, Michal
-
Hi, Thomas, I've resolved the issue in the meantime. Thanks for help, Michal
sorry again, how did you manage it? how is the performance? greetings, thomas