Best coding practice.

Member 4194593

I have a MASM routine that generates assembly code to implement a Quicksort. All is working, including support for embedded keys in records, support for signed/unsigned, big-endian/little endian, keys of BYTES/WORDS/DWORDS/FLOATS/DOUBLES/EXTENDED_FLOATING, null terminated strings for BYTES or WORDS (element count=0), multiple elements for any data type (counts from 1-2048), multiple data types for a secondary key including its use as a tie-breaker for equal primary keys to make the QuickSort Stable. I will add support for MMX and XMM keys, but the current quest is to add the capability to declare Ascending or Descending in any mixture for either the primary key or for the secondary key (one can be ascending, the other descending, or both the same - either ascending or descending). I could just invert the conditional transfers and keep the comparison order the same, or invert the comparison order and keep the conditional transfers the same, or finally, just swap the indexes at entry and restore them at exit (XCHG) and keep all of the rest of the extensive code the same. Note, the source code is extensive, but the macro invocation conditionally assembles only what is required and there is not a lot of checking what should be done next. Inserting the XCHG's for each comparison would add code (small) and execution cycles (small), but the compare logic is the highest use code of the whole sort. Conditionally assembling a version for ascending order and a different version for descending order is basically a copy and paste with an exchange of either the indices (and has no extra code or cycles), or a change of the conditional transfers. It seems that maintaining the two copies in parallel might be dangerous - fixes to one set of code would have to always be applied to the other set, and we all know how often that strategy fails - duplicate versions aren't! So the question is, should I swap the indices and which way should I do this - swap the indices or swap the comparison order, or should I create two versions, one for ascending and the other for descending? Please feel free to jump in with both feet and kick up a little dust, but this is a Quicksort so I don't want to hear about C++ or C# and overloaded operators with all of the bloat that goes along with HLL. Hmm, I just thought of a way to allow both the primary key and the secondary key to have null terminated strings. The lowest offset of ether key has to end at the higher offset, and the highest offset has to end at the end of the record. More goodie t

Luc Pattyn

Hi, Wow. Quite a question. not sure why you are doing all this in assembly, but whatever the language, this would be my approach: for each comparison, calculate a difference, which is negative if object1<object2, zero if equal, and positive-non-zero when object1>object2 then multiply that difference by a factor which is +1 for ascending and -1 for descending. The result indicates precede (negative), don't know (zero), or follow (positive-non-zero). When you have multiple sort criteria, handle the primary one first, if that results in "don't know", continue with the next criterion. [EDITED because of HTML eater] :)

Luc Pattyn [Forum Guidelines] [My Articles]

- before you ask a question here, search CodeProject, then Google - the quality and detail of your question reflects on the effectiveness of the help you are likely to get - use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets

modified on Tuesday, April 7, 2009 9:14 PM

Member 4194593 · modified on Tuesday, April 7, 2009 9:14 PM

Luc, Thank you for the reply. And I was worried about adding two XCHGs, and you want me to put in a MUL? OMG! :) Actually, you can get the same effect for descending by just changing the ordering of the compare arguments, i.e. compare object2 to object1. This is exactly what C++ does for an IF, it keeps your comparison code (i.e. >=, generates a jae or jge depending on sign) but reverses the compared values and skips around the body of the IF if it is not >=. This, however, leaves the code much different than the code for an ascending sort, somewhat harder to maintain and enhance. That was why I was thinking about XCHG'ing the indices and keeping the same code for both sort types. Another difficulty with that approach is that floating compares do not affect the usual flags, only the flags in the FPU. You have to extract the FPU flags and make special compares to effect a ja or a jb, and the tests are of a TEST nature with a je or jne, not a ja or jb. I have all of this working, and I am now testing the tie-breaker/secondary_key code. Dave Augustine.

Luc Pattyn

if you don't want to multiply by +1/-1 you can conditionally negate the difference, that's bound to be cheaper than two conditional XCHG. However a single MUL is probably the fastest on any Pentium or better thanks to having separate integer units and needing no unpredictable change of program flow. An alternative is self-modifying code where you execute either a NOP or a NEG on diff. :-D

Luc Pattyn [Forum Guidelines] [My Articles]

- before you ask a question here, search CodeProject, then Google - the quality and detail of your question reflects on the effectiveness of the help you are likely to get - use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets

Member 4194593

Luc, Thank you again. When I get done, I will post (if not too big) or send you some zips of an ascending and a descending DWORD sort and the same for float - just the actual quick sort code. Dave.

Member 4194593 · modified on Tuesday, April 7, 2009 9:14 PM

Luc, I have completed development on my Quicksort. The following is the timing for the sort using just an array of DWORDS. The rest of the tests work as planed, ascending/descending, signed/unsigned, big-endian/little-endian, two different sort keys allowed, BYTES/WORDS/DWORDS/Floats/Doubles/extended_floating, BYTE or WORD null terminated strings, up to 2048 elements per key, fully automated test and tracing environment to verify results. Do you have a feel for what a good timing benchmark might be for this size of a sort?

The following timings are only for the actual sorting, not the setup. The data
is prepared, start time taken, sorted, end time taken, time difference is
accumulated, loop back 1023 more times.

The random data originally was created by randomly selecting a number from the
set, saving the number, shrinking the set, and recalculating a new random number
on a decremented range. This was taking too long to shrink up a 1024*1024 DWORD
buffer for each extraction. The method was changed to just create a random
number in the range of 0 to ((1024*1024)-1), save the random number, decrement
the range, and calculate another random number for the decremented range. Since
there are potentially many duplicates, and probably not all numbers are present,
the checking was changed from comparing the sorted array against a known
ascending or descending array, to individually comparing each DWORD pair
(comparing element n to element n+1) and verifying that it is not above (if an
ascending sort) or not below (if a descending sort).

Note: The sort completed with only a 4096 byte stack, sorting 1024*1024 DWORDS.

Building Ascending data.
Building Descending data.
1024*1024 Ascending DWORDS sorted 1024 times Ascending, time in msec: 37144, good compare. 36.27 msec/iteration.
1024*1024 Ascending DWORDS sorted 1024 times Descending, time in msec: 46130, good compare. 45.05 msec/iteration.
1024*1024 Descending DWORDS sorted 1024 times Ascending, time in msec: 47689, good compare. 45.57 msec/iteration.
1024*1024 Descending DWORDS sorted 1024 times Descending, time in msec: 53870, good compare. 52.61 msec/iteration.
Building Random data.
1024*1024 Random DWORDS sorted 1024 times Ascending, time in msec: 158389, good compare. 154.67 msec/iteration.
1024*1024 Random DWORDS sorted 1024 times Descending, time in msec: 141898, good compare. 138.57 msec/iteration.

I do have code samples (.LST output) if you are interested. Dave.

Luc 648011

I don't fully understand your test cases, however here are two questions you can ask yourself: 1. why is there a big difference between "Ascending DWORDS sorted 1024 times Ascending" and "Descending DWORDS sorted 1024 times Descending"? 2. how does the timing compare with a regular quicksort as is offered in some libraries? :)

Member 4194593

Luc, What's with the membership change? After 5 years and 1,000,000 posts do they take away your Gold Membership? You must have been a very naughty boy. The timing differences, I believe, are just normal Windows getting in the way of everything. I mean, my development machine is not connected to the internet, it is only doing one thing (executing my program), so why is there all this activity as indicated by the Task Manager? Paging with 4 million bytes involved probably plays a part in it also. I will execute the two tests again with tracing enabled and pick some shorter tests (my standard testing is the character set A-Z for easy visual checking of results). The tracing will report the number of compares and exchanges, as well as report all of the exchanges as they occur. My algo starts by swapping the middle element for the right end (the pivot) and then finally determines that there is nothing to do (because it is already in order) so it just re-swaps the middle and end, but then it sorts the front half and then sorts the back half, both of which are in order so the same thing happens again twice, then 4 times... If the sort has to reverse the order (ascending data sorted descending), then after the first swap to put the middle at the end, it will end up swapping end elements, working to the middle (since everything is on the wrong side of the pivot), and ends up with a re-swap of the middle and the end. Dave.

Luc 648011

Hi Dave, Don't worry, I'm using a new account to avoid all the crap gold members get to digest on the home page. And its not 1 million posts, even CG does not reach 100K! If all numbers are ascending and that is what you don't want, everything gets swapped. If all numbers are descending and that is what you don't want, everything gets swapped. I expect these two test cases to have the same speed. I understood they did. If all numbers are ascending and that is what you want, no swaps occur. If all numbers are descending and that is what you want, no swaps occur. I expect these two test cases to have the same speed. I understood they did not, they were quite different, and by much more than some Windows unpredictability could explain.

Member 4194593 wrote:

all this activity as indicated by the Task Manager? Paging with 4 million bytes involved probably plays a part in it

Huh. What is your hardware? how could 4MB be cause for paging? If this is a miniature PC, try the same code on a real one! :)

Member 4194593

Luc, I found an error in the Descending Descending where I had supplied the ascending data instead of the descending data, and I corrected that. I also modified the timing procedure to create all of the data first, then execute the tests, and then added a loop to run all of the tests 10 times. I added code to call SetPriorityClass to set the priority to HIGH_PRIORITY_CLASS. I added code to put the start of each procedure on an aligned 16 location to make sure that the same code in the ascending and descending routines stayed on the same relative boundaries to avoid differences in the hardware branch predictions - this would only cause a 1 or 2 cycle difference in time, but with 1,000,000,000 * 1,000,000,000 compares it will make a difference - note that something as small as a 26 record sort takes 109 compares and 14 exchanges. I put the loops for the left scan and the right scan on 16 BYTE boundaries to aid in the hardware branch prediction. I modified the code to clean up the floating jump on condition tests (had to re-order my register usages to free up eax, then debug and re-test). I modified the test code to add a timing test for floats. The following are the results from the first two iterations of the tests (almost the same times as for DWORDS):

uilding Ascending data.
Building Descending data.
Building Random data.
Floating tests DWORD tests
1024*1024 Ascending Floats sorted 1024 times Ascending, time in msec: 32359, good compare. 32068, good compare.
1024*1024 Ascending Floats sorted 1024 times Descending, time in msec: 36407, good compare. 37310, good compare.
1024*1024 Descending Floats sorted 1024 times Ascending, time in msec: 36749, good compare. 37276, good compare.
1024*1024 Descending Floats sorted 1024 times Descending, time in msec: 36674, good compare. 38611, good compare.
1024*1024 Random Floats sorted 1024 times Ascending, time in msec: 130539, good compare. 132599, good compare.
1024*1024 Random Floats sorted 1024 times Descending, time in msec: 131883, good compare. 131187, good compare.

1024*1024 Ascending Floats sorted 1024 times Ascending, time in msec: 32005, good compare. 32005, good compare.
1024*1024 Ascending Floats sorted 1024 times Descending, time in msec: 37254, good compare. 36943, good compare.
1024*1024 Descending Floats sorted 1024 times Ascending, time in msec: 37126, good compare. 35773, good compare.
10

Luc 648011

OK, this looks good. The timing anomaly has disappeared. Well done. :)

Member 4194593

Luc, It was an interesting search. The only "mistake" was the use of ascending data instead of descending data for the test. The rest of the changes were to avoid the branch prediction problems - keeping all of the tight loops on a 16 byte boundary and keeping the loops down to 16 bytes or less. Dave.