Posts made by Sanmayce

Sanmayce

Glad that you like it, one dozen of guys helped me in benchmarking (and thus improving) my C etudes, having only two old laptops (Core 2) limits a lot my testing :mad: I feel overusing their good will, so a "dedicated" room will serve well avoiding the constant need for favors. Not everyone can save money in this nasty worldwide economic crisis for a new toy.

Sanmayce

My suggestion is CodeProject to have a room for benchmarking, meaning at least one guy (either from CP or a member) to offer to fellow members some assistance in computing/benchmarking some code. Guys lacking computational power X| or wanting to see how a particular code/function is gonna perform on some newer, more expensive machine (in my case Intel 6700K) will benefit such a FAVOR! Imagine, a CodeProject fast machine used as a tool to improve the performance of some project - a real help it would be! :laugh: In my case I need some 3 hours of single-threaded 6700K|5960x running my latest decompressor Goldenboy.

Sanmayce

Oh, I am a bad typer: >This type of post belongs in the Collaboration / Beta Testing forum[^]. Smile | Smile | :) I tried it before with no response whatsoever, so I thought that Nakamichi being a FREE tool can be benchmarked by CP community in here. Can't get it really, these three console decompressors are in TOP 5 decompression-speedwise, and no responses, hm.

Sanmayce

Didn't get the last three words, my bad. What is the right forum then? I couldn't find it among other sub-forums? >This type of post belongs in the Collaboration / Beta Testing forum[^]. Smile | :) I tried it before with no response whatsoever, so I thought that Nakamichi being a FREE tool can be benchmarked by CP community in here. Can't get it really, these three console decompressors are in TOP 5 decompression-speedwise, and no responses, hm.

Sanmayce

Please help me to benchmark my C code on some modern PC (AMD would be great). I have access only to Core 2 Q9550s so I need to see what the three speedy decompressors really offer. What I need is the resultant file 'Results.txt' - it contains the console dump with speed results. The benchmark is downloadable at: http://www.sanmayce.com/Downloads/Oniyanma_vs_Zstd_64-bits_v0.0.1.zip[^] The results from Oniyanma_vs_Zstd_64-bits_v0.0.1.zip comparison on laptop with Core 2 Q9550s @2.8GHz:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Testdatafile \ Decompressor | Uncompressed | LZ4 1 thread | Zstd 1 thread | Nakamichi 'Oniyanma' |

| #1 Encyclopedia_of_Language_and_Linguistics.txt | 59,416,161 | 21,118,907; 1033.1 MB/s | 19,435,030; 358.1 MB/s | 18,675,464; 320 MB/s |
| #2 Google_Books_corpus_All_Nodes_ripped_7,477,257_1gramlist_out_of_3,473,595_English_books.txt | 81,324,663 | 32,656,314; 987.4 MB/s | 27,657,847; 319.3 MB/s | 28,825,883; 256 MB/s |
| #3 enwik8 | 100,000,000 | 42,283,793; 943.9 MB/s | 39,573,323; 325.7 MB/s | 35,425,948; 192 MB/s |
| #4 glibc-2.21.tar | 146,759,168 | 32,362,346; 1364.9 MB/s | 25,428,301; 561.4 MB/s | 31,307,973; 704 MB/s |

Note: #1 testdatafile is ASCII TXT #2 testdatafile is ASCII wordlist #3 testdatafile is XML #4 testdatafile is C sources The Oniyanma_vs_Zstd_64-bits_v0.0.1.zip contains:

D:\WorkTemp\Oniyanma_vs_Zstd_64-bits_v0.0.1>dir

04/27/2015 12:33

Sanmayce

I need to know how new Intel (2nd/3rd/4th/5th) generations CPUs execute textual decompression using XMM registers, the penalties on my Core 2 are discouraging so I will be glad to see how optimized for XMM CPU-RAM subsystems perform. If you want to help me in benchmarking here it is: Slowest LZSS Compressor in C[^] Just run Speed_Benchmark_MAKE_archives.bat and share the outcome Results.txt, that will do.

Sanmayce

Okay, this is more serious, 31 patterns are needed to exhaust your request. 0 internal '*': Step01: *ABCDE* 1 internal '*': Step02: *ABCD*E* Step03: *ABC*DE* Step04: *AB*CDE* Step05: *A*BCDE* 2 internal '*': Step06: *A*B*CDE* Step07: *A*BC*DE* Step08: *A*BCD*E* Step09: *AB*C*DE* Step10: *AB*CD*E* Step11: *ABC*D*E* 3 internal '*': Step12: *A*B*C*DE* Step13: *A*B*CD*E* Step14: *A*BC*D*E* Step15: *AB*C*D*E* Above 15 patterns are optional, they would yield hits with 'high' rank earlier than all the 31 ones below. Finding all substrings (without elision): Step01+15: *A*B*C*D*E* Finding all substrings (with elision): C(5,4) equals number of wildcard patterns, 5 choose 4 = 5!/(4!*(5-4)!) = 5*4*3*2*1/(4*3*2*1*1) = 5 Step02+15: *B*C*D*E* Step03+15: *A*C*D*E* Step04+15: *A*B*D*E* Step05+15: *A*B*C*E* Step06+15: *A*B*C*D* C(5,3) equals number of wildcard patterns, 5 choose 3 = 5!/(3!*(5-3)!) = 5*4*3*2*1/(3*2*1*2*1) = 10 Step07+15: *A*B*C* Step08+15: *A*B*D* Step09+15: *A*B*E* Step10+15: *A*C*D* Step11+15: *A*C*E* Step12+15: *A*D*E* Step13+15: *B*C*D* Step14+15: *B*C*E* Step15+15: *B*D*E* Step16+15: *C*D*E* C(5,2) equals number of wildcard patterns, 5 choose 2 = 5!/(2!*(5-2)!) = 5*4*3*2*1/(2*1*3*2*1) = 10 Step17+15: *A*B* Step18+15: *A*C* Step19+15: *A*D* Step20+15: *A*E* Step21+15: *B*C* Step22+15: *B*D* Step23+15: *B*E* Step24+15: *C*D* Step25+15: *C*E* Step26+15: *D*E* C(5,1) equals number of wildcard patterns, 5 choose 1 = 5!/(1!*(5-1)!) = 5*4*3*2*1/(1*4*3*2*1) = 5 Step27+15: *A* Step28+15: *B* Step29+15: *C* Step30+15: *D* Step31+15: *E* In essence the number e.g. C(5,4) equals all ways to order any four of five elements, which is 5!/(5-4)! divided by all ways to order any group of 4, which is 4!. Thus, the number of distinct ways to choose 4 out of 5 is 120 divided by 24. By chance I wrote the fastest wildcard search function (iterative) so these 31 patterns won't be much of a problem:

int WildcardMatch_Iterative_Kaze(const char* mask, const char* name) {
// Revision 1:
/*
const char* maskSTACK;
const char* nameSTACK;
int BacktrackFlag = 0;
Backtrack:
for (nameSTACK = name, maskSTACK = mask; *nameSTACK; ++nameSTACK, ++maskSTACK) {
if (*maskSTACK == '&') {
mask = maskSTACK+1;
if (!*mask) return 1;
name = nameSTACK;
BacktrackFlag = -1;
goto Backtrack;
}
//else if (*maskSTACK == '+') {
//} else {
else if (*maskSTACK != '+') {
//if (tolower(*nameSTACK) != tolower(*maskSTACK)) {

Sanmayce

Hi srikanth2321, probably there are some better ways to do that, but my suggestion is the most simple: Step1: Do Wildcard search for *ABCDE* Step2: Do Wildcard search for *ABCD*E* Step3: Do Wildcard search for *ABC*DE* Step4: Do Wildcard search for *AB*CDE* Step5: Do Wildcard search for *A*BCDE* My friend, I sense the problem, you should be more precise both in testing and explaining, for instance your second example is buggy. If you want to play you can try my superb search tool Kazahana (downloadable here at CP). Let me know if you find any difficulties. The first exmple done with Step3:

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>copy con test.txt
ABC EFGH IJKL DE
EFGH UIOP ABC GHY JKLU EF
^Z
1 file(s) copied.

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type test.txt
ABC EFGH IJKL DE
EFGH UIOP ABC GHY JKLU EF

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>Kazahana.exe "*ABC*DE*" test.txt 999
Kazahana, a superfast exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcher, r. 1-++fix+nowait_critical_nixFIX_Wolfram+fixITER+EX+CS, copyleft Kaze 2014-Mar-25.
Enforcing Case Insensitive wildcard mode ...
Enforcing SLOW wildcard mode ...
Pattern: *abc*de*
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 999KB ... OK

Kazahana: Total/Checked/Dumped xgrams: 2/2/1
Kazahana: Performance: 0 KB/clock
Kazahana: Performance: 0 xgrams/clock
Kazahana: Performance: Total/fread() clocks: 17/0
Kazahana: Performance: I/O time, i.e. fread() time, is 0 percents
Kazahana: Performance: RDTSC I/O time, i.e. fread() time, is 0 ticks
Kazahana: Done.

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>type Kazahana.txt
ABC EFGH IJKL DE

D:\BiShowdown_Cubesort_Tcheburaschkasort_Intel_64bit>

Sanmayce

Cubesort is awesome, with big potential on top of that. The thing that impresses me most is its construction speed, not to mention the 'free' dump. Couldn't resist not to try my 'secret weapon' - a frozen/unfinished sub-project called 'Tcheburaschkasort'. It is entirely based on the titaniumESQUEly strong Bayer-McCreight algorithm which I implemented specifically and only as order 3 (three children maximum). It's been some time since I wrote it, enough to forget this-and-that, but the initial idea was to use next dimension i.e. to use millions of B-trees. My unfinished Tcheburaschkasort currently uses 24bit pseudo-hash (in fact first 1|2|3 byte(s) ARE a slot) giving a forest of 16,777,216 B-Trees of order 3:

UINT PseudoHash(const char *str, unsigned int wrdlen)
{
UINT hash32 = 0;
const char *p = str;
// Use 256-based system for first 3 bytes i.e. left is more significant:
// Currently 3bytes long i.e. 24bit:
if (wrdlen>=1) {
hash32 = hash32 | ((*p)<<16);
p += 1;
}
if (wrdlen>=2) {
hash32 = hash32 | ((*p)<<8);
p += 1;
}
if (wrdlen>=3) {
hash32 = hash32 | ((*p)<<0);
p += 1;
}
return hash32;
}

Each slot houses the root of a tree. Not using any hash Tcheburaschkasort resembles a hypercube of order 0, using hash makes it 1D i.e. of order 1 - lookup hashtable is the axis. Blah-blah, in short the randomness of those prefixes (1..3bytes long) decides the speed performance. In example with KT10000000.txt all first 3 bytes are equal, 'A8C', so there is one only tree and performance is poor. Because all the 10,000,000 lines start from board position A8. Knight-tours: each square represents a board position, as A1(bottom-left), H8(top-right):

-----------------
|a8 |b8 |...|h8 |

|a7 |...| |

|...|...| |

|a1 |...|...|h1 |

When making the data more diverse, e.g. reverse the order of tours then for KT10000000.txt we will have 50 trees, see below. Let us see what boost comes from this diversity. So, Tcheburaschkasort vs Cubesort results for KT10M filedataset (10,000,000 Knight-Tours from A8, reversed): NOTE: Tcheburaschkasort is based on my superfast multi-purpose text ripper Leprechaun. Tcheburaschkasort still DOES NOT make in-order dump/traversal, resultant file is unsorted, to be done. This DOES NOT affect results, though. The full log is as follows:

E:\BiShowdown_Cubesort_Tcheburaschkasort_

Sanmayce

Having benchmarked Cubesort with Knight-Tours my amateurish evaluation is of two words: Mutsi! Mutsunka! My Sandokan (r3fix) is inferior compared to Cubesort, 10,000,000 lines (128 bytes each) were sorted and after that sorted file was sorted once more:

------------------------------------------------------------------------------------------
| sorter \ filedataset | KT10000000.txt | KT10000000_SORTED.txt |

| Cubesort | 080 seconds | 045 seconds |
| Windows' sort.exe | 118 seconds | 115 seconds |
| Sandokan | 131 seconds; swappings: 36,826,242+ | 114 seconds; swappings: 0 |

The benchmark is downloadable at: www.sanmayce.com/Downloads/TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit.zip[^] The full log follows:

E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit>dir
Volume in drive E is SSD_Sanmayce
Volume Serial Number is 9CF6-FEA3

Directory of E:\TriShowdown_Sandokan_vs_Windows_vs_Cubesort_Intel_64bit

06/29/2014 11:45 PM

..

06/29/2014 11:45 PM 12,566 cubesort-1.0_non-original.c
06/29/2014 11:45 PM 86,016 cubesort-1.0_non-original.exe
06/29/2014 11:45 PM 202,880 Cubesort.zip
06/29/2014 11:45 PM 24,490 Knight-tour_r8dump.c
06/29/2014 11:45 PM 79,872 Knight-tour_r8dump.exe
06/29/2014 11:45 PM 116 Make_EXEs.bat
06/29/2014 11:45 PM 1,604 MokujIN prompt.lnk
06/29/2014 11:45 PM 301,406 Sandokan_Logo.pdf
06/29/2014 11:45 PM 170,433 Sandokan_QuickSortExternal_4+GB_r3fix.c
06/29/2014 11:45 PM 122,368 Sandokan_QuickSortExternal_4+GB_r3fix.exe
06/29/2014 11:45 PM 6,144 timer64.exe
06/29/2014 11:45 PM 1,111 TriShowdown_Sandokan_vs_Windows_vs_Cubesort.bat
12 File(s) 1,009,006 bytes
2 Dir(s) 27,537,633,280 bytes free

E:\TriShowdown_Sandokan

Sanmayce

Hi Mr. Hoven, looks nice feels exciting, surely the showdowns are gonna be must-see. My suggestion is to make a simple console tool to ease the benchmarking, I will be the first to run some superheavy English phraselists/wordlists (up to 400,000,000 lines) on my laptop T7500. One very useful filedataset is to sort Knight-Tours (64x2=128 bytes long lines), because you can generate even billions of them in reasonable time, my package (that do that) can be downloaded freely at: www.sanmayce.com/Downloads/Sandokan_r3_vs_Windows-sort.zip[^] My old console sorter Sandokan outperformed (only for internal sorting) the built-in Windows 7 64bit sort.com:

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>GENERATE_3million_Knight-Tours_and_SORT_them.bat

Microsoft Windows [Version 6.1.7601]
Generating 3,000,000 Knight-Tours and dumping them into file 390,000,000 bytes ...

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>Knight-tour_r8dump_Microsoft_V16_32bit_Ox.exe a8 3000000 1>KT3million.txt
Sorting these 3,000,000 Knight-Tours ...

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>"Sandokan_QuickSortExternal_4+GB_32bit_Microsoft.exe" KT3million.txt /fast /descend 768
Sandokan_QuickSortExternal_4+GB r.3, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 390,000,000
Counting lines ...
Lines encountered: 3,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 22
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 4
Trying to allocate memory for the file itself in MB: 371 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 3,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
| RightEnd: 000,001,500,225; NumberOfSplittings: 0,000,335,016; Done: 100% ...
NumberOfComparisons: 71,052,662
The time to sort 3,000,000 items via Quicksort+Insertionsort was 21,029 clocks.
Dumping the sorted data ...

Done 100% ...
Dumped 3,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 1,560 clocks.
Total time: 27,144 clocks.
Performance: 14,366 bytes/clock.
Done successfully.

E:\Sandokan_r3-+++\Sandokan_r3_vs_Windows-sort>sha1sum_Microsoft_V16_32bit_Ox.exe "QuickSortExternal_4+GB.txt"
a053faa74ffb6cad41e61e273a1a0e0049cb25e7 QuickSortExternal_4+GB.txt

Sanmayce

Mr.Losinger, have you seen the Wikipedia's article on LZSS: http://en.wikipedia.org/wiki/LZSS[^] Now check my amateurish attempt to shed light on LZSS: http://www.sanmayce.com/Nakamichi/[^] LZSS is simple enough yet writing an article is not easy, my wish is to contribute to the decompression part with one of the fastest (in top 3) etudes, benchmarks, C/Assembly snippets, not to define/explain the algorithm. For a start, I need modern machine to test on, sadly I own only Core2, that's a nasty break for me since I use DWORD fetching (sucks on Core2), XMM and even ZMM which is so interesting to be seen what can bring to the table. My naive expectations were that some fellow coders will help me at least with its benchmarking, sadly I see no interested coders on this topic. One of my wishes is to show an etude, Nakamichi 'Sanagi' to be exact, 512bit targeted, luckily Intel C optimizer v14 (which I used) supports them but still no computer in my reach can run the executable. "Tianhe-2 the world's fastest supercomputer according to the TOP500 list for June and November 2013 utilizes Xeon Phi accelerators based on Knights Corner." /Wikipedia/ In my view such an benchmark is both important and interesting, thus the incoming Knights Corner/Landing processors will show the power of ZMMWORD moves. "... Far more interesting is that Intel's Knights Landing not only will be an expansion card but also will serve as a standalone platform or processor ..." /http://www.admin-magazine.com/Articles/Exploring-the-Xeon-Phi/ Simply, the near future will bring monstrous bandwidths, not utilizing them is just ... lame. I see how some 30 years old algorithms are/have to be 'pimped' and pumped. >it's unlikely that many people are going to be familiar with the LZSS algorithm. Agreed, but it is with everything like that, isn't it! >i'd suggest including a simpler compressor with a description of the algorithm. Sure. >you have to at least get the readers up to the point where they can follow your discussion of the decompressor. otherwise, you're starting the story in the middle. Of course, I am not some villain wanting to obscure information, but as I said my focus is entirely on textual decompression speed boosts, they alone took a lot of my tim

Sanmayce

Thanks, I checked all the links you gave me, also searched for LZ, LZSS, compression in article names. Strange, not a single article on LZ/LZSS, not to mention fast, let alone fastest. You see Mr.MacCutchan, I want to share a stand-alone file-to-file compressor written in C featuring fastest decompression on textual data. I don't see it as utility but rather as an illustration/sample how to decompress some needed text-like data at speeds 18%-40% of memcpy(). The focus I want to be precisely on DECOMPRESSION, I don't want to be dragged into mumbo-jumbisms as "How do I compress? The LZSS is not explained. This is not an article. Poor description and similar complains." Seeing some fellow members' articles on LZW and Huffman I see my approach in different light, I like benchmarks real-world ones, disassembly listings and comparisons with the best decompressors today. That's why I ask for some preemptive feedback/directions. Personally, it is annoying for me to want to share some code etude and instead of making some interesting place/topic for sharing ideas/benching with fellow coders to receive complains "THIS ARTICLE IS NOT GOOD", I am not a journalist to write articles, after all the CP logo says 'For those who code' not 'write articles'. My point, new ideas/improvements are to be easily shareable, I think.

Sanmayce

Guys, for 25 or so days I have been wrestling with the most simple approach in LZSS. Seeing how demanding (criticizing) are some fellow members, here I would like to hear from all interested in this topic programmers what one good article about LZSS should feature. So, please share here your point on what has to be covered what to be emphasized and other must-be-s in the article. Maybe I will write an article, the thing that stops me is the unappreciativeness. My way of doing things is all about enjoying the speeds of small etudes in C, that's why I ask preemptively: Does CODEPROJECT need the article 'Fastest textual decompression in C'?. Here you can see my etude heavily benchmarked against the best in the world: http://www.sanmayce.com/Nakamichi/[^]

Sanmayce

Whoever wants to run my superfast (maybe the fastest) LZSS decompression benchmark will help me a lot to learn more about supremacy of 512bit registers. Seeing how Haswell boasts 1TB/s L1 cache speeds made me curious how close to that amazing bandwidth one well-written memory etude can come. The benchmark package includes 2 executables, first compiled as 64bit using 64bit GP registers, second as 32bit using 512bit ZMM registers. The 807MB file included in the test is compressed down to 249MB (ZIP's maximum mode gives 77MB), the decompression speed is 956MB/s on my laptop with Core2 T7500. Given that my 'memcpy()' works at 1950MB/s and i7-4770K's at 13211MB/s, I expect on Haswell speeds exceeding 6x956MB/s (for one thread), is my estimation correct? The package: Fastest strstr-like function in C!?[^] I will be glad to see how both Intel & AMD i.e. Haswell & Excavator perform.

Sanmayce

>I've told you what the problem is: this is not an article: it is a blog post, at best. Okay, I am not arguing, my point was and still is that by sharing some etude it must be useful in first place and hopefully well described where I fail to do so too many times I guess. >Explain the why's and wherefore's of what you are doing. See, this is a simple 20 lines C code doing the most well-known task among all general etudes - returning a 32bit value out of smoe key. I didn't want to repeat how to bake bread, just to share the source and stats on different machines in order to give the reader an idea of is it worth downloading? I just looked at article "Sorting Algorithms In C#" by Mr. Clifton and in my humble (I am not a programmer) opinion it appears to me useless, simply it doesn't cover the most interesting and useful area of sorting - the external ones e.g. I have had some drafts in C that dealt with 1,000,000,000 keys VERY FAST, with mentioning this I just wanted to share how different opinions exist. >Good luck. Thank you, obviously my articles are not even articles, sorry, but blame me not - that's me, a C amateur trying to test/share interesting etudes.