Using HASH Tables. Looking for general discussion on this topic.
-
I have been using HASH Tables for many applications. 1. Keyword lookup for command line processing 2. Generic name lookup tables of names, etc. 3. Substitution for binary tree name lookup that do not require a minimum guaranteed lookup time I like HASH tables because they are easy to implement, but the key question is what HASH function does one use. Here is one I use: unsigned int HASH_Value( char *name ) { unsigned long int hashval; int i; hashval = 0; for( i = 0; i < HASH_MAX_NAME_SIZE; i++ ) { if( name[i] == '\0' ) break; hashval += name[i] * i + 1; } return( (unsigned int)(hashval)%HASH_MAX_TABLE_SIZE ); /* traditional hash function for( hashval = 0; *name != '\0'; name++ ) hashval = *name + 31 * hashval; return ( hashval % HASH_MAX_TABLE_SIZE ); */ } It works for me, what works for you? Please ignore any typos. Just looking for discussion on the topic.
"A little time, a little trouble, your better day" Badfinger
-
I have been using HASH Tables for many applications. 1. Keyword lookup for command line processing 2. Generic name lookup tables of names, etc. 3. Substitution for binary tree name lookup that do not require a minimum guaranteed lookup time I like HASH tables because they are easy to implement, but the key question is what HASH function does one use. Here is one I use: unsigned int HASH_Value( char *name ) { unsigned long int hashval; int i; hashval = 0; for( i = 0; i < HASH_MAX_NAME_SIZE; i++ ) { if( name[i] == '\0' ) break; hashval += name[i] * i + 1; } return( (unsigned int)(hashval)%HASH_MAX_TABLE_SIZE ); /* traditional hash function for( hashval = 0; *name != '\0'; name++ ) hashval = *name + 31 * hashval; return ( hashval % HASH_MAX_TABLE_SIZE ); */ } It works for me, what works for you? Please ignore any typos. Just looking for discussion on the topic.
"A little time, a little trouble, your better day" Badfinger
The important point is that there is no universal "best" hash function. What makes a "good" hash (i.e. minimal collisions, not too costly to compute) depends very much on the nature of the data you are hashing. A good hash for peoples' full names may not do too well on their SSNs or phone numbers for example. And of course, the size of your hash table has a huge influence on performance. There are lots of good discussions (and some not so good) to be found if you search things like "best hash function".
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
-
The important point is that there is no universal "best" hash function. What makes a "good" hash (i.e. minimal collisions, not too costly to compute) depends very much on the nature of the data you are hashing. A good hash for peoples' full names may not do too well on their SSNs or phone numbers for example. And of course, the size of your hash table has a huge influence on performance. There are lots of good discussions (and some not so good) to be found if you search things like "best hash function".
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
Thanx for quick response. This is a good start. Agreed: "No universal "best" hash function" "Minimize collisions" "lots of discussions on "best hash function", etc "What is nature of the data to be hashed?", etc However, I suspect that google uses proprietary hashing techniques caching searches for future searches, "who knows", collision mitigation, daisy chaining (nested hash tables, etc. my terminology) I suspect this subject maybe a lot deeper than the search for the standard "best hash function". hence my proposal for a discussion. It may come to nothing, but it's out there.
"A little time, a little trouble, your better day" Badfinger
-
I have been using HASH Tables for many applications. 1. Keyword lookup for command line processing 2. Generic name lookup tables of names, etc. 3. Substitution for binary tree name lookup that do not require a minimum guaranteed lookup time I like HASH tables because they are easy to implement, but the key question is what HASH function does one use. Here is one I use: unsigned int HASH_Value( char *name ) { unsigned long int hashval; int i; hashval = 0; for( i = 0; i < HASH_MAX_NAME_SIZE; i++ ) { if( name[i] == '\0' ) break; hashval += name[i] * i + 1; } return( (unsigned int)(hashval)%HASH_MAX_TABLE_SIZE ); /* traditional hash function for( hashval = 0; *name != '\0'; name++ ) hashval = *name + 31 * hashval; return ( hashval % HASH_MAX_TABLE_SIZE ); */ } It works for me, what works for you? Please ignore any typos. Just looking for discussion on the topic.
"A little time, a little trouble, your better day" Badfinger
After searching for a good hash for strings, I settled on the following:
uint32_t string_hash(const char* s)
{
uint64_t hash = 0;
auto size = strlen(s);for(size_t i = 0; i < size; ++i)
{
hash = s[i] + (hash << 16) + (hash << 6) - hash;
}return hash;
}And then you truncate the result to be a valid index into your hash table.
Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing. -
I have been using HASH Tables for many applications. 1. Keyword lookup for command line processing 2. Generic name lookup tables of names, etc. 3. Substitution for binary tree name lookup that do not require a minimum guaranteed lookup time I like HASH tables because they are easy to implement, but the key question is what HASH function does one use. Here is one I use: unsigned int HASH_Value( char *name ) { unsigned long int hashval; int i; hashval = 0; for( i = 0; i < HASH_MAX_NAME_SIZE; i++ ) { if( name[i] == '\0' ) break; hashval += name[i] * i + 1; } return( (unsigned int)(hashval)%HASH_MAX_TABLE_SIZE ); /* traditional hash function for( hashval = 0; *name != '\0'; name++ ) hashval = *name + 31 * hashval; return ( hashval % HASH_MAX_TABLE_SIZE ); */ } It works for me, what works for you? Please ignore any typos. Just looking for discussion on the topic.
"A little time, a little trouble, your better day" Badfinger
-
I say: use libraries. Unless, maybe, the domain is embedded with required extremely small footprint.
"If we don't change direction, we'll end up where we're going"
megaadam wrote:
Unless, maybe, the domain is embedded with required extremely small footprint.
Welcome to my world.
GCS/GE d--(d) s-/+ a C+++ U+++ P-- L+@ E-- W+++ N+ o+ K- w+++ O? M-- V? PS+ PE Y+ PGP t+ 5? X R+++ tv-- b+(+++) DI+++ D++ G e++ h--- r+++ y+++* Weapons extension: ma- k++ F+2 X The shortest horror story: On Error Resume Next
-
I have been using HASH Tables for many applications. 1. Keyword lookup for command line processing 2. Generic name lookup tables of names, etc. 3. Substitution for binary tree name lookup that do not require a minimum guaranteed lookup time I like HASH tables because they are easy to implement, but the key question is what HASH function does one use. Here is one I use: unsigned int HASH_Value( char *name ) { unsigned long int hashval; int i; hashval = 0; for( i = 0; i < HASH_MAX_NAME_SIZE; i++ ) { if( name[i] == '\0' ) break; hashval += name[i] * i + 1; } return( (unsigned int)(hashval)%HASH_MAX_TABLE_SIZE ); /* traditional hash function for( hashval = 0; *name != '\0'; name++ ) hashval = *name + 31 * hashval; return ( hashval % HASH_MAX_TABLE_SIZE ); */ } It works for me, what works for you? Please ignore any typos. Just looking for discussion on the topic.
"A little time, a little trouble, your better day" Badfinger
Use code tags.
jmaida wrote:
what works for you?
Two of your examples use strings as the key. However the first would appear to be a fixed set. You could attempt to optimize based on that set. I have done so in the past to achieve zero collisions. However micro optimizations based on guessing is a waste of time. Optimize based on profiling the application using realistic data. (My example above for zero collisions was in fact a waste of time.) If I was using C or C++ I would use an existing library. Your code example is mixing the hash value with the hash table which works for very limited cases but in general the two should be distinct (thus the library.) Recalculating the hash every single time might not be ideal. But avoiding that means using a more complex structure.
jmaida wrote:
Substitution for binary tree name lookup that do not require a minimum guaranteed lookup time
I do not understand that statement. Hash table and binary tree are distinct data structures. You can replace one with the other but there are considerations for both which your statement does not make clear to me. I do know that I replaced a complex tree (not a normal binary tree) with a hash table and gained about a 30% speed improvement so perhaps you are referring to something like that.
-
I have been using HASH Tables for many applications. 1. Keyword lookup for command line processing 2. Generic name lookup tables of names, etc. 3. Substitution for binary tree name lookup that do not require a minimum guaranteed lookup time I like HASH tables because they are easy to implement, but the key question is what HASH function does one use. Here is one I use: unsigned int HASH_Value( char *name ) { unsigned long int hashval; int i; hashval = 0; for( i = 0; i < HASH_MAX_NAME_SIZE; i++ ) { if( name[i] == '\0' ) break; hashval += name[i] * i + 1; } return( (unsigned int)(hashval)%HASH_MAX_TABLE_SIZE ); /* traditional hash function for( hashval = 0; *name != '\0'; name++ ) hashval = *name + 31 * hashval; return ( hashval % HASH_MAX_TABLE_SIZE ); */ } It works for me, what works for you? Please ignore any typos. Just looking for discussion on the topic.
"A little time, a little trouble, your better day" Badfinger
Greetings and Kind Regards May I please inquire is there a reason you do not utilize any of these? std::hash - cppreference.com[^]
-
Use code tags.
jmaida wrote:
what works for you?
Two of your examples use strings as the key. However the first would appear to be a fixed set. You could attempt to optimize based on that set. I have done so in the past to achieve zero collisions. However micro optimizations based on guessing is a waste of time. Optimize based on profiling the application using realistic data. (My example above for zero collisions was in fact a waste of time.) If I was using C or C++ I would use an existing library. Your code example is mixing the hash value with the hash table which works for very limited cases but in general the two should be distinct (thus the library.) Recalculating the hash every single time might not be ideal. But avoiding that means using a more complex structure.
jmaida wrote:
Substitution for binary tree name lookup that do not require a minimum guaranteed lookup time
I do not understand that statement. Hash table and binary tree are distinct data structures. You can replace one with the other but there are considerations for both which your statement does not make clear to me. I do know that I replaced a complex tree (not a normal binary tree) with a hash table and gained about a 30% speed improvement so perhaps you are referring to something like that.
What I meant to say is If one does not required minimum lookup time (it's my understanding though may be wrong, that a balanced binary tree can provide a minimum lookup time), then hashing is an inexpensive alternative.
"A little time, a little trouble, your better day" Badfinger
-
Greetings and Kind Regards May I please inquire is there a reason you do not utilize any of these? std::hash - cppreference.com[^]
-
Greetings and Kind Regards May I please inquire is there a reason you do not utilize any of these? std::hash - cppreference.com[^]
-
After searching for a good hash for strings, I settled on the following:
uint32_t string_hash(const char* s)
{
uint64_t hash = 0;
auto size = strlen(s);for(size_t i = 0; i < size; ++i)
{
hash = s[i] + (hash << 16) + (hash << 6) - hash;
}return hash;
}And then you truncate the result to be a valid index into your hash table.
Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing. -
Worked the best using random generated names. Generated 100 hash values with only 2 collisions. not bad. I'll call it the GREG UTAS HASH
"A little time, a little trouble, your better day" Badfinger
It's not mine! I found it somewhere on the net but don't recall where. EDIT: Sorry for just saying "After searching...". Now I see how it can be misinterpreted.
Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing. -
Greetings and Kind Regards May I please inquire is there a reason you do not utilize any of these? std::hash - cppreference.com[^]
-
It's not mine! I found it somewhere on the net but don't recall where. EDIT: Sorry for just saying "After searching...". Now I see how it can be misinterpreted.
Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.I am giving you credit for funding it. I have 3 variations of hash functions and it's the best so far. One day I will post them, but too much going on here. Our weather here has gone frigid (for us) going into the teens. Trying to protect plants and such.
"A little time, a little trouble, your better day" Badfinger