Merging sorted data blocks.

Stefan_Lang

Sorry, 'several orders of magnitude' was probably overstating. I've been thinking of one particular case where a database solution outperformed a file-based solution by a factor of ~15-20. But when I think about it, part of the acceleration may have been due to the algorithms being used, so 'one order of magnitude' might be closer to the real thing. I've also thought of another case with a 120 GB database of geographical data that turned out to be so fast that my algorithms to dig out some geometrical properties out of a given (and comparatively small) result set were much more time critical than the two-dimensional search to find the data in a particular geographical region. On a sidenote, that was about 15 years ago, so I doubt the database server was all that much faster than your current PC. In both cases I was just the 'user' of the data though - I am not a database specialist. I just wanted to point out that your opinion of databases might be premature, not that I could write down a solution in 5 minutes. Anyway, if you want me to try you'd have to wait for a month or two as the single hard disk in my 5-year old PC has barely enough room to fit one of those 6 GB files; forget about 5. X|

musefan

Just out of curiosity, why do you need to sort such large amounts of data? Is somebody planning on reading these files

Illogical thoughts make me ill

Member 4194593

musefan, Yes, someone will read that data. I had originally created a program to randomize a list. I wanted a verify that I had selected each entry in the original list once and once only. A sort would re-order the randomized list then a simple file compare (fc.exe) of the original file and the sorted file could verify the correctness of the randomization. Sort.exe was very slow (compared to the randomizing), so I went down that rabbit hole to develop a faster sort, and got lost in a maze of twisty little tunnels..... (what an Adventure!) Another method would have been to bitmap the randomized records and see that they did not overlap, but this would work only for numeric strings, and the randomize program worked with any string data. To bitmap that would require a string sort anyway to equate a string with an entry number. Dave.

Member 4194593

Stefan, I'd be happy to let you try, misery loves company! If you send me an Email I'll respond with an attached zip of the create.exe file and the template file (includes a READ.ME and some testing batch files). To ease the file size problems, create the test files one at a time, then write the file to an external USB drive, test with sort.exe and with your sort program writing the output files and any intermediate files to the external drive and compare the results, then save the timing results, then delete the test file and the sorted files, then create the next test file. That should keep the files down to 18 GB. As long as the input and output files are on the same drives for both tests, the relative times should be a good indicator of the relative performance. To create the files one at a time, create null files for each of the named test files (in the create.exe directory) and create.exe will skip all creations, then delete the file you want to test. Create.exe will create that file (since it is missing), test the file two ways, save the timing results, then replace the test file with a null file and delete the next file to test. OBTW, create.exe requires SSE. Dave.

Lost User

Member 4194593 wrote:

Which way is best (overall) for merging, and why?

What Luc said, because he's right. You were competing with Sort.exe, which isn't intended to work with large files. Now, a database doesn't have any problems with sorting, because they needn't load the entire data into memory. If you can manipulate the data, then you can add a "key" to each item that needs be sorted. Next, you could build an index. Fetch item from disk by key, compare, update index. Building an index on a database can be a relative time-consuming process, but I imagine that it's faster than forcing Windows' to swap a lot of Virtual Memory, pushing all other processes to the background. You could also partition your blocks; have the process that "generates" the first block only filter out the lines that start with an "A", the second with a "B".

I are Troll :suss:

Member 4194593

Eddy, Ok. You suggest use a database. What database do you suggest? If I had this database, would I need to program it or configure it in some way? Is it as easy to use as executing sort.exe? Don't get me wrong, I've been programming since '65, I am familiar with databases and their overhead. I still do not believe that any database solution would be as fast, as cheap, or easy to use as something like sort.exe. My offer to you is the same as to Stefan63 above. Send me an email so I can reply with a zip of an executable (of you prefer, the source and a makefile to create the executable). This executable will create the 5 large files (6 GB each) that I was testing with. Write your own sort to sort these files and sort the same files with sort.exe and compare the percentage faster your sort is than sort.exe for each file, and for the overall total testing (comparing this way evens the playing field for CPU, memory, and disk speed differences between your system and mine). Dave.

Lost User

Member 4194593 wrote:

You suggest use a database.

No, my suggestion would involve Hadoop and a lot of PS2's :)

Member 4194593 wrote:

What database do you suggest?

Sql Server if you can afford it, although it would be an expensive solution just for sorting. Sql Express would do too, I guess. You can only pack 4 Gb in a database in the free version, but there's no problem in adding several databases to the server.

Member 4194593 wrote:

If I had this database, would I need to program it or configure it in some way?

Ideally, it'd use a bulk-import. You could also have it generate the index, but it'd be more fun to build such a thing by hand. The advantage would be that you can easily partition your data, even over multiple computers.

Member 4194593 wrote:

Is it as easy to use as executing sort.exe?

From the users' point of view, this would be easier as their data simply already appears to be sorted by applying the index.

Member 4194593 wrote:

Don't get me wrong, I've been programming since '65, I am familiar with databases and their overhead. I still do not believe that any database solution would be as fast, as cheap, or easy to use as something like sort.exe.

There's free databases that are scalable and well-documented. Yes, I get your point - but most people won't have a dataset that's this large, so the use-case shouldn't be built around your average user :)

Member 4194593 wrote:

Write your own sort to sort these files and sort the same files with sort.exe and compare the percentage faster your sort is than sort.exe for each file, and for the overall total testing (comparing this way evens the playing field for CPU, memory, and disk speed differences between your system and mine).

My gratitude for the challenge, but I don't have the time nor the hardware. Have you considered writing a CodeProject article on the subject?

I are Troll :suss:

Member 4194593

Eddy, Thank you for the detailed reply. I think I'll stick to what I know. About writing an article - I never considered it. No one here on the CodeProject would like my code or what I was trying to do or the way I was doing it, they wouldn't like it on so many different levels that I wouldn't even try to enumerate. Dave.

Lost User

Member 4194593 wrote:

No one here on the CodeProject would like my code

That's part of the job; when someone comes along and sings "try this", then we'll cut it up and look at it from all sides to see whether or not it's better than our old habits. Now, I might not like the use-case, but that doesn't mean that it's not interesting topic to a lot of people (otherwise you'd have 0 replies) ..and sometimes, it's more interesting how you got a result - keep in mind that most people have problems with pointers :)

I are Troll :suss:

Member 4194593

Eddy, 1) My program is a console app. 2) It is totally MASM (except API calls for I/O). 3) It honors the Intel ABI, but in addition it saves and restores all registers around calls and saves flags around all API calls. 4) It uses mainly register calling sequences, not calling arguments (I think that is correct - from my ancient Fortran AutoCoder, "Parameters" are what you expect, "Argument" are what you get). 5) It returns no values, only zero or non-zero flags - any returned values are saved in fixed locations. 6) It uses no call stack frame - ebp is used as a general purpose register. 7) It has tons of commentary to explain What I did so I can pick up the program 20 years from now and understand Why I did what I did. 8) It uses many levels of conditional assembly to do validations and/or timing displays, not checking option flags and branching around unused code - the unused code is not even assembled. 9) There are conditional code segments in the RedBlackTree implementation to validate correctness and to display the tree structure (a short 31 element tree build). 10) The conditional code snipits save and restore any used registers to prevent any side effects. 11) It does not use anonymous symbols, instead, it uses lots of meaningful long labels everywhere, including places where a label is not needed but separates code groups. 12) Need I say more. Nobody will like it here. Dave.