hashing algorithms

Daniel Turini

James Simpson wrote: I see the uses behind it, but I dont fully understand how the algorithms can take any peice of data and effectivly shrink or grow it to a fixed size (say 128 bit) and still keep it unique! You are right: one can do it only for finite and small sets of data. James Simpson wrote: I think from what I have read that it can not gaurentee its uniqueness, but reduce the chances of two items having the same hash value to a very very very small value! Yes, in this case we have what we call a "collision". So, normally your hash table entries are not strings, like in my sample, but linked lists (or arrays) of strings. So, you still can deal with collisions, and a good hash function will keep it at a minimum. If you want to look a program that generates perfect hash functions for limited sets of data take a look at gperf[^] Trying to make bits uncopyable is like trying to make water not wet. -- Bruce Schneier

markkuk

Data validation is the use for cryptographically secure, "one way" hashing algorithms. There are other uses for hash functions that have different requirements, e.g. quick searching of data.

l a u r e n

for example... a user signs up with a web site giving a username and password .. everytime they log in u dont want to be sending the password over the wire so u create a hash of the password and send that instead and u store a hash of the same password on the server no-one can get the password from packet sniffing the wire cos it isnt sent :)

"there is no spoon"
biz stuff about me

ColinDavies

l a u r e n wrote: no-one can get the password from packet sniffing the wire cos it isnt sent True but the hash was sent. And the dealers will be happy. :-) Regardz Colin J Davies

*** WARNING *
This could be addictive
**The minion's version of "Catch :bob: "

It's a real shame that people as stupid as you can work out how to use a computer. said by Christian Graus in the Soapbox

Shog9 0

:laugh::laugh:

Your sincerity about keeping the soapbox organized and civilized is so obvious. I solute your effort. -- Anonymous, 10/18/03

Jorgen Sigvardsson

Ooooh! Tomorrow I'll be pushing SHA-1 and MD5 to kids on school yards. The first 32 bits are free... -- Your life as it has been is over. From this time forward you will service us.

Daniel Turini

That's why I love CP: today I've learned a new meaning for "hash" :) Trying to make bits uncopyable is like trying to make water not wet. -- Bruce Schneier

peterchen

one use that has not been mentioned is speeding up lookup e.g. in an associative map. Imagine you have a dictionary [key, value] with quite a many long words as keys. Instead of using string comparisons, you just compare the hashes. The hashes of the strings in the dicitonary can be precalculated, and the dicitonary indexed by the hash instead of the string. of course there's a slight chance of two keys have the same hash - this must be treated separately, so you end up with a dictionary [key-hash, [ vector([key,value]) ] ] However, with a well-chosen hash this can be much faster.

"Vierteile den, der sie Hure schimpft mit einem türkischen Säbel."
sighist | Agile Programming | doxygen

Paul Oss

Daniel Turini wrote: When you need to search for "test", you only compute its hash value (5) and look at a[5], without need to search through all the array. Assuming that your hash algorithm is truly generating unique #'s for each unique string. Ie, if your algorithm takes two completely different strings, and due to weaknesses in your math, return the same hash value, you're screwed. ;) Paul

David Crow

In the context of hashing, this "uniqueness" you speak of is called collision. It's not always a bad thing. In some instances, the normalization (spreading out) of data is all that's required (i.e., collision is expected). If your specific implementation will not tolerate collision, it must be dealt with using a variety of methods (e.g., separate chaining and open addressing).

Five birds are sitting on a fence. Three of them decide to fly off. How many are left?

BrainJar

That's because hashing algorithms don't grow or shrink the input data, and they don't generate unique values. If a particular hash algorithm generates a 128 bit value, it generally means that it has a 16 byte array (or vector) of initial values. It takes the first 16 bytes of the input data and performs some addition and bit shifting and/or other operations to combine those values with the internal 16 bytes. (These 16 bytes are generally processed as four, 32-bit integers rather than as 16, 8-bit integers, but you get the point.) This changes data in the interal 16 byte vector. The program then repeats the operations on it with the next 16 bytes of input, then the next 16 bytes and so on. In the end, the program just outputs whatever result is left in the internal vector. In fact, most hashes must perform some type of padding on the input to get an even multiple of 16 (or whatever the vector length is) bytes. It doesn't matter how large or small the input length is, only 16 bytes are done at a time and only 16 bytes will be outputed. Increasing the input length just increases the number of times thru the computation loop. The trick to a good hash, like MD4, MD5 or SHA1, is to use computations that result in an avalanche effect: a small change in the input, like a single character in a 10MB file, radically alters the output. It's possible to have two different inputs hash to the same 128-bit value, it fact it's inevitable. But your chances of finding two out of the 2^128 possibilities are pretty slim.

Nitron

:omg: You mean you never knew that people used this to identify data? ;P - Nitron

"Those that say a task is impossible shouldn't interrupt the ones who are doing it." - Chinese Proverb

Terry Denham

This is exactly what I did with an ADO Recordset class that was generated using the #import directive. One of my peers was needing to pull back a fairly large set of data but do to some requirements we weren't able to build the data with a set of joins so we had to have this data represented in the recordset. Some of the records would be linked to other records in the same set. The process was taking about 14 hours to process about 140000 records due to the large number of loops that it had. I had them remove one of the loop and wrote a CAdoRecordsetIndex class that would be an associative map on the records in the CAdoRecordset class based on what what columns you told it to build the index on. Then when you needed to find the values you would pass in the array of values that you wanted to search for, the index class would turn this into a key, find the bookmark to the record that had this key, I used the Vector as the value incase there were multiple records that had the same nonunique key. Just this change alone took the processing from 14 Hours to 15 Minutes just by using the associative hash. This could have been improved some more if I would have changed the hash bucket size. I used the default of 17 (which is usually a prime number). If I would have used a larger prime I would have reduced the amount of time looking in the vector.

ColinDavies

peterchen wrote: of course there's a slight chance of two keys have the same hash - I remmeber writing a routine to see if this would happen once with some data. Probability theory says that it is possible, but with my data it never occured. Just by increasing the Hash a few bytes the probability drops of astronomically. Regardz Colin J Davies

*** WARNING *
This could be addictive
**The minion's version of "Catch :bob: "

It's a real shame that people as stupid as you can work out how to use a computer. said by Christian Graus in the Soapbox

leppie

peterchen wrote: However, with a well-chosen hash this can be much faster. Dont you mean longer? :laugh: leppie::AllocCPArticle("Zee blog");
Seen on my Campus BBS: Linux is free...coz no-one wants to pay for it.

Jorgen Sigvardsson

You need this book[^] :) Buy it and read it. You'll love it :) -- Here I am now, I'm your saviour. You can see I'm the one!

Gary R Wheeler

The point of the hash is not to be able to reproduce the original data from the hash. The point is to produce a smaller value that can be used to validate a similar or equivalent message at a later time. At one point in time, suppose you have a message A1, which has hash value H1. At some later time, suppose you receive message A2. For message A2 you calculate hash H2. If H2 == H1, then you can conclude that messages A1 and A2 are at least similar (if not equivalent), depending upon the hash algorithm.

Software Zen: delete this;

Daniel Turini

I explained how to deal with collisions (a bit simplistic, for didactic reasons) on the next post Trying to make bits uncopyable is like trying to make water not wet. -- Bruce Schneier

James Simpson

I have this book - bought it a couple of years ago. And yes - I did explain things :) James Simpson Web Developer imebgo@hotmail.com P S - This is what part of the alphabet would look like if Q and R were eliminated
Mitch Hedberg