Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Algorithms
  4. Merging sorted data blocks.

Merging sorted data blocks.

Scheduled Pinned Locked Moved Algorithms
questiondata-structuresperformancetutoriallounge
17 Posts 6 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Member 4194593

    Stefan, I just read the CodeProject "Daily News" for today. Interesting article "What would Feynman do?". It seems that you are suggesting a Feynman solution. I am not talking about a 6 core 16 GB server with a farm of hundreds of drives, I am talking abut my simple PC with 4 GB of memory and 2 hard drives running Windows XP. I want a utility string sort solution. You say "several orders of magnitude better!" I do not think this is very accurate. I'll tell you what I will do. I will supply you with an executable and a small template file. The execution of the executable will create all of my test files (5 files, 6 GB each, the same files that I am using). You can then write your version of a single program to sort these files (case sensitive), then execute Windows sort.exe for each file to create a baseline time on your machine, then execute your program for each file and report all of your timing statistics here. You have an advantage in that you already have my current times and you can refine your algorithm until you get better times. Note that is is not the time for any test or the overall time of all of the tests that matters, it is the percentage of improvement over the baseline sort.exe times that is important. This is done to even out timing differences between different machines. I just do not believe several orders of magnitude. Interested in trying? Note: I currently do not have a complete solution, I still need the final merge step for the last two files that is what I was trying to get suggestions about in this thread. I will complete the program in some way and post the final results for the final 2 files. Dave.

    S Offline
    S Offline
    Stefan_Lang
    wrote on last edited by
    #8

    Sorry, 'several orders of magnitude' was probably overstating. I've been thinking of one particular case where a database solution outperformed a file-based solution by a factor of ~15-20. But when I think about it, part of the acceleration may have been due to the algorithms being used, so 'one order of magnitude' might be closer to the real thing. I've also thought of another case with a 120 GB database of geographical data that turned out to be so fast that my algorithms to dig out some geometrical properties out of a given (and comparatively small) result set were much more time critical than the two-dimensional search to find the data in a particular geographical region. On a sidenote, that was about 15 years ago, so I doubt the database server was all that much faster than your current PC. In both cases I was just the 'user' of the data though - I am not a database specialist. I just wanted to point out that your opinion of databases might be premature, not that I could write down a solution in 5 minutes. Anyway, if you want me to try you'd have to wait for a month or two as the single hard disk in my 5-year old PC has barely enough room to fit one of those 6 GB files; forget about 5. X|

    M 1 Reply Last reply
    0
    • M Member 4194593

      What is the best (by that I mean the fastest) way to merge some smaller files of sorted data into a larger file, and then later merge the larger files into a single sorted file. Some particulars: Input: text files each 6 GB, short records of 500,000,000 10 digit numbers in three formats, sequential ascending, sequential descending, and random selection order, long records with null strings to 65533 byte strings in random order, with many duplicate strings, and then as unique strings where the last 10 digits of each line are replaced with one of the random short strings. A total of 5 different files for 5 different tests. Memory is allocated with VirtualAlloc (3 GB enabled in boot.ini and in the Link line and 4 GB if memory in the system), memory blocks of 2 GB, 1 GB, and 47 MB, plus 2 more small blocks are allocated. The first step does the initial sort. The 2GB buffer is used as an input buffer, the 1 GB buffer is used for building mod 16 aligned and sized strings and for the nodes for a RedBlackTree, the 47 MB buffer is used to hold pointers to the strings for the sort tree and for the records moved after the sort. For the short records, this results in about 211 sorted blocks written to a file. This step is complete for now. The next step combines the 47 MB sorted blocks into a big sort block (2 GB) using the 1 GB buffer for tree nodes and 28 sort blocks are combined into each big block (limited by the node space available - 16 BYTES for each record in the tree), and there are 8 big sort blocks written to a file. This step is complete for now, but could be changed if there is a better merge solution. The final step needs to merge these big blocks into a single output file, but has to be done in a fragmented mode because there is not enough memory to hold the files. The 2 GB buffer is split into pieces (8 in this case for the 8 big sort blocks). The header from each sort block and the initial pointers from each sort block and the initial associated records from each sort block are read into the 8 pieces of the 2 GB buffer (with an initial header pointing to each big sort block). Essentially you need to sort the big blocks in ascending sequence by the first record in each big block (how to do this efficiently?), then select and output all the records in the first big block that are lower than the first record in the second big block (how to do this efficiently?), then unlink the first buffer and re-link it with the other set of buffers in ascending sequence (how to do this effici

      M Offline
      M Offline
      musefan
      wrote on last edited by
      #9

      Just out of curiosity, why do you need to sort such large amounts of data? Is somebody planning on reading these files

      Illogical thoughts make me ill

      M 1 Reply Last reply
      0
      • M musefan

        Just out of curiosity, why do you need to sort such large amounts of data? Is somebody planning on reading these files

        Illogical thoughts make me ill

        M Offline
        M Offline
        Member 4194593
        wrote on last edited by
        #10

        musefan, Yes, someone will read that data. I had originally created a program to randomize a list. I wanted a verify that I had selected each entry in the original list once and once only. A sort would re-order the randomized list then a simple file compare (fc.exe) of the original file and the sorted file could verify the correctness of the randomization. Sort.exe was very slow (compared to the randomizing), so I went down that rabbit hole to develop a faster sort, and got lost in a maze of twisty little tunnels..... (what an Adventure!) Another method would have been to bitmap the randomized records and see that they did not overlap, but this would work only for numeric strings, and the randomize program worked with any string data. To bitmap that would require a string sort anyway to equate a string with an entry number. Dave.

        1 Reply Last reply
        0
        • S Stefan_Lang

          Sorry, 'several orders of magnitude' was probably overstating. I've been thinking of one particular case where a database solution outperformed a file-based solution by a factor of ~15-20. But when I think about it, part of the acceleration may have been due to the algorithms being used, so 'one order of magnitude' might be closer to the real thing. I've also thought of another case with a 120 GB database of geographical data that turned out to be so fast that my algorithms to dig out some geometrical properties out of a given (and comparatively small) result set were much more time critical than the two-dimensional search to find the data in a particular geographical region. On a sidenote, that was about 15 years ago, so I doubt the database server was all that much faster than your current PC. In both cases I was just the 'user' of the data though - I am not a database specialist. I just wanted to point out that your opinion of databases might be premature, not that I could write down a solution in 5 minutes. Anyway, if you want me to try you'd have to wait for a month or two as the single hard disk in my 5-year old PC has barely enough room to fit one of those 6 GB files; forget about 5. X|

          M Offline
          M Offline
          Member 4194593
          wrote on last edited by
          #11

          Stefan, I'd be happy to let you try, misery loves company! If you send me an Email I'll respond with an attached zip of the create.exe file and the template file (includes a READ.ME and some testing batch files). To ease the file size problems, create the test files one at a time, then write the file to an external USB drive, test with sort.exe and with your sort program writing the output files and any intermediate files to the external drive and compare the results, then save the timing results, then delete the test file and the sorted files, then create the next test file. That should keep the files down to 18 GB. As long as the input and output files are on the same drives for both tests, the relative times should be a good indicator of the relative performance. To create the files one at a time, create null files for each of the named test files (in the create.exe directory) and create.exe will skip all creations, then delete the file you want to test. Create.exe will create that file (since it is missing), test the file two ways, save the timing results, then replace the test file with a null file and delete the next file to test. OBTW, create.exe requires SSE. Dave.

          1 Reply Last reply
          0
          • M Member 4194593

            What is the best (by that I mean the fastest) way to merge some smaller files of sorted data into a larger file, and then later merge the larger files into a single sorted file. Some particulars: Input: text files each 6 GB, short records of 500,000,000 10 digit numbers in three formats, sequential ascending, sequential descending, and random selection order, long records with null strings to 65533 byte strings in random order, with many duplicate strings, and then as unique strings where the last 10 digits of each line are replaced with one of the random short strings. A total of 5 different files for 5 different tests. Memory is allocated with VirtualAlloc (3 GB enabled in boot.ini and in the Link line and 4 GB if memory in the system), memory blocks of 2 GB, 1 GB, and 47 MB, plus 2 more small blocks are allocated. The first step does the initial sort. The 2GB buffer is used as an input buffer, the 1 GB buffer is used for building mod 16 aligned and sized strings and for the nodes for a RedBlackTree, the 47 MB buffer is used to hold pointers to the strings for the sort tree and for the records moved after the sort. For the short records, this results in about 211 sorted blocks written to a file. This step is complete for now. The next step combines the 47 MB sorted blocks into a big sort block (2 GB) using the 1 GB buffer for tree nodes and 28 sort blocks are combined into each big block (limited by the node space available - 16 BYTES for each record in the tree), and there are 8 big sort blocks written to a file. This step is complete for now, but could be changed if there is a better merge solution. The final step needs to merge these big blocks into a single output file, but has to be done in a fragmented mode because there is not enough memory to hold the files. The 2 GB buffer is split into pieces (8 in this case for the 8 big sort blocks). The header from each sort block and the initial pointers from each sort block and the initial associated records from each sort block are read into the 8 pieces of the 2 GB buffer (with an initial header pointing to each big sort block). Essentially you need to sort the big blocks in ascending sequence by the first record in each big block (how to do this efficiently?), then select and output all the records in the first big block that are lower than the first record in the second big block (how to do this efficiently?), then unlink the first buffer and re-link it with the other set of buffers in ascending sequence (how to do this effici

            L Offline
            L Offline
            Lost User
            wrote on last edited by
            #12

            Member 4194593 wrote:

            Which way is best (overall) for merging, and why?

            What Luc said, because he's right. You were competing with Sort.exe, which isn't intended to work with large files. Now, a database doesn't have any problems with sorting, because they needn't load the entire data into memory. If you can manipulate the data, then you can add a "key" to each item that needs be sorted. Next, you could build an index. Fetch item from disk by key, compare, update index. Building an index on a database can be a relative time-consuming process, but I imagine that it's faster than forcing Windows' to swap a lot of Virtual Memory, pushing all other processes to the background. You could also partition your blocks; have the process that "generates" the first block only filter out the lines that start with an "A", the second with a "B".

            I are Troll :suss:

            M 1 Reply Last reply
            0
            • L Lost User

              Member 4194593 wrote:

              Which way is best (overall) for merging, and why?

              What Luc said, because he's right. You were competing with Sort.exe, which isn't intended to work with large files. Now, a database doesn't have any problems with sorting, because they needn't load the entire data into memory. If you can manipulate the data, then you can add a "key" to each item that needs be sorted. Next, you could build an index. Fetch item from disk by key, compare, update index. Building an index on a database can be a relative time-consuming process, but I imagine that it's faster than forcing Windows' to swap a lot of Virtual Memory, pushing all other processes to the background. You could also partition your blocks; have the process that "generates" the first block only filter out the lines that start with an "A", the second with a "B".

              I are Troll :suss:

              M Offline
              M Offline
              Member 4194593
              wrote on last edited by
              #13

              Eddy, Ok. You suggest use a database. What database do you suggest? If I had this database, would I need to program it or configure it in some way? Is it as easy to use as executing sort.exe? Don't get me wrong, I've been programming since '65, I am familiar with databases and their overhead. I still do not believe that any database solution would be as fast, as cheap, or easy to use as something like sort.exe. My offer to you is the same as to Stefan63 above. Send me an email so I can reply with a zip of an executable (of you prefer, the source and a makefile to create the executable). This executable will create the 5 large files (6 GB each) that I was testing with. Write your own sort to sort these files and sort the same files with sort.exe and compare the percentage faster your sort is than sort.exe for each file, and for the overall total testing (comparing this way evens the playing field for CPU, memory, and disk speed differences between your system and mine). Dave.

              L 1 Reply Last reply
              0
              • M Member 4194593

                Eddy, Ok. You suggest use a database. What database do you suggest? If I had this database, would I need to program it or configure it in some way? Is it as easy to use as executing sort.exe? Don't get me wrong, I've been programming since '65, I am familiar with databases and their overhead. I still do not believe that any database solution would be as fast, as cheap, or easy to use as something like sort.exe. My offer to you is the same as to Stefan63 above. Send me an email so I can reply with a zip of an executable (of you prefer, the source and a makefile to create the executable). This executable will create the 5 large files (6 GB each) that I was testing with. Write your own sort to sort these files and sort the same files with sort.exe and compare the percentage faster your sort is than sort.exe for each file, and for the overall total testing (comparing this way evens the playing field for CPU, memory, and disk speed differences between your system and mine). Dave.

                L Offline
                L Offline
                Lost User
                wrote on last edited by
                #14

                Member 4194593 wrote:

                You suggest use a database.

                No, my suggestion would involve Hadoop and a lot of PS2's :)

                Member 4194593 wrote:

                What database do you suggest?

                Sql Server if you can afford it, although it would be an expensive solution just for sorting. Sql Express would do too, I guess. You can only pack 4 Gb in a database in the free version, but there's no problem in adding several databases to the server.

                Member 4194593 wrote:

                If I had this database, would I need to program it or configure it in some way?

                Ideally, it'd use a bulk-import. You could also have it generate the index, but it'd be more fun to build such a thing by hand. The advantage would be that you can easily partition your data, even over multiple computers.

                Member 4194593 wrote:

                Is it as easy to use as executing sort.exe?

                From the users' point of view, this would be easier as their data simply already appears to be sorted by applying the index.

                Member 4194593 wrote:

                Don't get me wrong, I've been programming since '65, I am familiar with databases and their overhead. I still do not believe that any database solution would be as fast, as cheap, or easy to use as something like sort.exe.

                There's free databases that are scalable and well-documented. Yes, I get your point - but most people won't have a dataset that's this large, so the use-case shouldn't be built around your average user :)

                Member 4194593 wrote:

                Write your own sort to sort these files and sort the same files with sort.exe and compare the percentage faster your sort is than sort.exe for each file, and for the overall total testing (comparing this way evens the playing field for CPU, memory, and disk speed differences between your system and mine).

                My gratitude for the challenge, but I don't have the time nor the hardware. Have you considered writing a CodeProject article on the subject?

                I are Troll :suss:

                M 1 Reply Last reply
                0
                • L Lost User

                  Member 4194593 wrote:

                  You suggest use a database.

                  No, my suggestion would involve Hadoop and a lot of PS2's :)

                  Member 4194593 wrote:

                  What database do you suggest?

                  Sql Server if you can afford it, although it would be an expensive solution just for sorting. Sql Express would do too, I guess. You can only pack 4 Gb in a database in the free version, but there's no problem in adding several databases to the server.

                  Member 4194593 wrote:

                  If I had this database, would I need to program it or configure it in some way?

                  Ideally, it'd use a bulk-import. You could also have it generate the index, but it'd be more fun to build such a thing by hand. The advantage would be that you can easily partition your data, even over multiple computers.

                  Member 4194593 wrote:

                  Is it as easy to use as executing sort.exe?

                  From the users' point of view, this would be easier as their data simply already appears to be sorted by applying the index.

                  Member 4194593 wrote:

                  Don't get me wrong, I've been programming since '65, I am familiar with databases and their overhead. I still do not believe that any database solution would be as fast, as cheap, or easy to use as something like sort.exe.

                  There's free databases that are scalable and well-documented. Yes, I get your point - but most people won't have a dataset that's this large, so the use-case shouldn't be built around your average user :)

                  Member 4194593 wrote:

                  Write your own sort to sort these files and sort the same files with sort.exe and compare the percentage faster your sort is than sort.exe for each file, and for the overall total testing (comparing this way evens the playing field for CPU, memory, and disk speed differences between your system and mine).

                  My gratitude for the challenge, but I don't have the time nor the hardware. Have you considered writing a CodeProject article on the subject?

                  I are Troll :suss:

                  M Offline
                  M Offline
                  Member 4194593
                  wrote on last edited by
                  #15

                  Eddy, Thank you for the detailed reply. I think I'll stick to what I know. About writing an article - I never considered it. No one here on the CodeProject would like my code or what I was trying to do or the way I was doing it, they wouldn't like it on so many different levels that I wouldn't even try to enumerate. Dave.

                  L 1 Reply Last reply
                  0
                  • M Member 4194593

                    Eddy, Thank you for the detailed reply. I think I'll stick to what I know. About writing an article - I never considered it. No one here on the CodeProject would like my code or what I was trying to do or the way I was doing it, they wouldn't like it on so many different levels that I wouldn't even try to enumerate. Dave.

                    L Offline
                    L Offline
                    Lost User
                    wrote on last edited by
                    #16

                    Member 4194593 wrote:

                    No one here on the CodeProject would like my code

                    That's part of the job; when someone comes along and sings "try this", then we'll cut it up and look at it from all sides to see whether or not it's better than our old habits. Now, I might not like the use-case, but that doesn't mean that it's not interesting topic to a lot of people (otherwise you'd have 0 replies) ..and sometimes, it's more interesting how you got a result - keep in mind that most people have problems with pointers :)

                    I are Troll :suss:

                    M 1 Reply Last reply
                    0
                    • L Lost User

                      Member 4194593 wrote:

                      No one here on the CodeProject would like my code

                      That's part of the job; when someone comes along and sings "try this", then we'll cut it up and look at it from all sides to see whether or not it's better than our old habits. Now, I might not like the use-case, but that doesn't mean that it's not interesting topic to a lot of people (otherwise you'd have 0 replies) ..and sometimes, it's more interesting how you got a result - keep in mind that most people have problems with pointers :)

                      I are Troll :suss:

                      M Offline
                      M Offline
                      Member 4194593
                      wrote on last edited by
                      #17

                      Eddy, 1) My program is a console app. 2) It is totally MASM (except API calls for I/O). 3) It honors the Intel ABI, but in addition it saves and restores all registers around calls and saves flags around all API calls. 4) It uses mainly register calling sequences, not calling arguments (I think that is correct - from my ancient Fortran AutoCoder, "Parameters" are what you expect, "Argument" are what you get). 5) It returns no values, only zero or non-zero flags - any returned values are saved in fixed locations. 6) It uses no call stack frame - ebp is used as a general purpose register. 7) It has tons of commentary to explain What I did so I can pick up the program 20 years from now and understand Why I did what I did. 8) It uses many levels of conditional assembly to do validations and/or timing displays, not checking option flags and branching around unused code - the unused code is not even assembled. 9) There are conditional code segments in the RedBlackTree implementation to validate correctness and to display the tree structure (a short 31 element tree build). 10) The conditional code snipits save and restore any used registers to prevent any side effects. 11) It does not use anonymous symbols, instead, it uses lots of meaningful long labels everywhere, including places where a label is not needed but separates code groups. 12) Need I say more. Nobody will like it here. Dave.

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups