Hash string function that keeps compare properties

Kochise

Is there a hash string function that preserves the capability to compare two strings just by comparing their respective hash values ? Currently, I have to perform a full Str.Compare("other string") in order to sort them. I had also calculated their hash value, but the sorting (Quick Sort, Bubble Sort, whatever...) fails using the hash value. The hash value is mostly random, especially on long strings. Comparing the hash values (int) would have been quite faster, I have to perform the quickest sort possible on text files larger than 100 MB ! So please don't come with CString as a solution. It works, but asks hours to do the job. CString is perfect to manipulate UI strings, but bad for heavy duty purposes... So, to sum up, it there a good hash function that will allow : "Minus string" < "Plus string" HASH("Minus string") < HASH("Plus string") Kochise In Code we trust !

Chris Losinger

Kochise wrote:

it there a good hash function that will allow ...

my gut reaction is No. the amount of information required to accurately compare two strings is all of the information in each string. but hashes for arbitrary input grind the information in the source data down into a package that's large enough that two different input strings will have a small (but still non-zero) chance of generating the same hash, but small enough to be passed around and used quickly. but if you have less data than you started with and the possibilty that two source values will produce the same hash, then there's no way to go from the hash back to the source value, and so there's no way to guarantee an accurate sort using just the hash. that's not to say hash values can't play a role in sorting, of course (they do, and there's a lot of info in Google about this very large topic), just that you can't generate a small numeric reversible hash for arbitrary character strings. Cleek | Image Toolkits | Thumbnail maker -- modified at 16:29 Thursday 27th October, 2005

Kochise

Nadanada, I'm not willing to 'compress' a string in an integer, and restore the string from the integer afterward. What I just want is to help my sorting routines by feeding it with fixed size integer the CPU could handle very easily, not variable length strings... Currently, I'm feeding a custom string tree. All strings are stacked as-is, and I used the HASH value as node's key. Even if they are mostly 'random', strings are inserted 'as-is' from their hash value. It works. I can even search for a particular string, it's hash value is found in the tree. The problem occurs when I'm trying to sort the tree, or at least walk it in a ascendant or descendant way. What I get is hash values from the minus one to the maximum one. But their corresponding string is non-sense, they don't follow the ascending or descending order I'm trying to get. Yeap, I found some pages on Google about hash functions : http://www.partow.net/programming/hashfunctions/ http://www.isthe.com/chongo/tech/comp/fnv/ http://www.cs.rice.edu/~scrosby/hash/ But none helped me in what I was trying to get. I can accept a little level of collision, because I know and control the input. And even if I get the same hash match, I can always control the two strings with a good old compare. At least I saved the whole tree walking which was processed in hash integer comparaison. So thanks at least to having helped me to loose focus on the 'hash trick', as this won't evolve in a success. I will try another method to sort out my strings using an integer. Kochise In Code we trust ! -- modified at 3:18 Friday 28th October, 2005

Chris Losinger

Kochise wrote:

But their corresponding string is non-sense, they don't follow the ascending or descending order I'm trying to get.

then just take the first four characters and stuff them into a long: long hash = (str[0] << 24) | (str[1] << 16) | (str[2] << 8) | (str[3]); that will allow you to very quickly compare the first four characters of the strings. you'll get a collission any time the first four characters of the two strings match. Cleek | Image Toolkits | Thumbnail maker

Kochise

I already tried this way, but I have many strings that starts with the same characters. So I'm trying to create a kinda dictionary of the whole strings, and find the good one the most quickly possible through the use of integer. I could have used tries as well, I think I'm going to test this. The following code match "TEST" and "TEST TEST" as being the same string, even if I stuff things of 16 characters. Maybe I'm wrong...

void CStringTree::_EncodeString
( sMSP& o_rsStringPointer
, const char* i_panStringBuffer
, int i_nStringSize // = -1
)
{
double l_nShifter = 79228162514264337593543950336.0; // 64^16
int l_nLoop = 16;
int l_nSize;
unsigned char l_nChar;
if
(
(i_nStringSize < 0)
|| (i_nStringSize > 16)
)
{
l_nSize = 16;
}
else
{
l_nSize = i_nStringSize;
}

o_rsStringPointer.nHashRaw = 0.0;
o_rsStringPointer.nHashCase = 0.0;
o_rsStringPointer.nHashNoCase = 0.0;

while(l_nLoop > 0)
{
l_nChar = *i_panStringBuffer;

if
(
     (l\_nSize == 0)
  || (l\_nChar == 0)
)
{ // Stuffing
  o\_rsStringPointer.nHashRaw    += ' ' \* l\_nShifter;
  o\_rsStringPointer.nHashCase   += ' ' \* l\_nShifter;
  o\_rsStringPointer.nHashNoCase += ' ' \* l\_nShifter;
}
else
{
  o\_rsStringPointer.nHashRaw    +=             l\_nChar  \* l\_nShifter;
  o\_rsStringPointer.nHashCase   += g\_panNormal\[l\_nChar\] \* l\_nShifter;
  o\_rsStringPointer.nHashNoCase += g\_panNoCase\[l\_nChar\] \* l\_nShifter;

  i\_panStringBuffer += 1;
}

l\_nShifter /= 64;
l\_nLoop    -= 1;
l\_nSize    -= 1;

}
}

Kochise In Code we trust ! -- modified at 9:06 Friday 28th October, 2005