Each of these has involved searching the data structure from a fixed starting point, until eventually we find what we're looking for or determine it must not be present.
An alternative approach is to use a hash table.
In this approach, the key value is used to uniquely determine the location of the data item in the table: we take the key value and apply a hashing function which tells us what row of the table the data must be in, and we jump straight to that row of the table.
This means that instead of taking linear or logarithmic time to search the data structure we can find what we want in constant time (the time taken to apply the hash function).
So what are the complications?
For instance, suppose we are storing information about students, and each student has an integer student number that we'll use as our key.
We could create a struct to hold information for individual students, and our hash table could simply be an array of these structs.
If we knew our student numbers were in the range 000000000 to 999999999 then we could simply create an array of size 1000000000 and then use the employee number as an index into the array, i.e.:
struct student { int id; string name; float gpa; }; employee hashtable[1000000000]; // the hash function takes a student struct and // tells us what table position it must be in int hash_function(student s) { return s->id; }This works perfectly, but if we only have a few thousand students then the hashtable is much bigger than it really needs to be.
Suppose that we decide to use a different hash function, where we only look at the last four digits of the student number:
int hash_function(student s) { return (s->id & 10000); }Now we can get away with an array of just 10,000 structs - which might be very reasonable if we're expecting several thousand students.
Unfortunately, we can't be certain the last four digits uniquely identify students. Suppose one student had id 123456789 and another student had id 111116789: the hash function would tell us they both used array position 6789. (This is called a collision.)
The goal is to come up with a hash function that allows us to use a very compact table (not much bigger than the expected amount of data) and yet minimizes the likelihood that different entries collide. (In fact, we usually allow for a small probability of collision, and include some means of figuring out what to do when a collision does occur.)
A perfect hash function is one which takes the key, identifies the table row, has no chance of collision, and allows us to use a table whose size exactly matches the number of data entries. Establishing some functions is sometimes possible when you know (or can predict) exactly the set of key values you'll be working with. |
Here are a few simple examples of hash functions and table sizes:
Prime numbers are integers that are only divisible by 1 and themselves (e.g. 2, 3, 5, 7, 9, 11, 13, 17, 19, 23, 29, etc). They have properties that make them particularly useful in a variety of areas, including hash functions, data compression, and data encryption. For now, let us simply take it for granted that dividing by a prime number might give a lower probability of collision than dividing by a non-prime. Hence we could pick a prime number around 10,000 (e.g. 9511) and use that as our table size and mod value in the hash function. |
part1 = first 3 digits of the key part2 = next 3 digits of the key part3 = last 3 digits of the key P = some prime number (e.g. 9511) hashvalue = ((part1 * part2) + part3) % P;Note that here we could run into collision problems if many people had 000 as either the first or middle 3 digits of their student number, since all those people would have (part1 * part2) == 0.
One of the most common techniques is simply to have each entry of the hash table be a pointer to the start of a linked list: with the linked list containing all the elements that mapped to that particular location.
The lists will (hopefully) be quite small (as long as our hash function doesn't create too many collisions). Thus we can apply the hash function (O(1)) and quickly search the small linked list to find the specific entry we're interested in.
A drawback with this is the overhead for the secondary data structure (lists or whatever), since we've got one of these for every table entry whether it's used or not and whether there are collisions or not.
Another possibility is, when a collision occurs, simply scan down the table to find the next untaken spot.
When inserting a new entry we simply scan down until a free spot is found, and when searching for a previously inserted entry we simply scan down until we find one with the right id or until we find an empty spot (in which case we assume the thing we're looking for isn't in the table).
I.e. if the hash function says position 6, and position 6 is already filled with some other entry, then check position 7, then 8, then 9 etc etc until a free spot is reached.
An advantage of this is that we don't need a secondary data structure (chaining required us to use pointers to lists or trees or whatever), so if no collisions occur it is quite efficient.
One problem with this is that if many collisions occur we often wind up with big clusters of taken spots, making the collision resolution slower. For example, in the example below a hit anywhere in 2-12 will result in the next position filled being 13, then a hit anywhere in 2-13 will result in the next position being 14, etc. Once clusters start to form they can quickly cause performance degradation.
Position Status 0 free 1 free 2 taken 3 taken 4 taken 5 taken 6 taken 7 taken 8 taken 9 taken 10 taken 11 taken 12 taken 13 free 14 free 15 free 16 free 17 free 18 free 19 free 20 free 21 free 22 free
To get around the clustering problem, we could have a series of hash functions, h1(), h2(), h3(), etc. If h1(Key) gives us a position occupied by some other key then try h2(key), if that doesn't work try h3(key), etc.
If the hash functions distribute the values quite differently around the table then clustering isn't as likely, but it does require having (and sometimes applying) a variety of different hash functions.
If chaining is used for collision resolution then deletion is also no problem: simply delete the desired element from the list of elements in the appropriate table row.
However, if linear probes or secondary hash functions or other similar techniques are used for collision resolution then deleting values creates a problem.
Suppose hash(key1) == hash(key2) == hash(key3), and we're using linear probing.
In the hash table we might see key1 in position h, key2 in position h+1, and key3 in position h+2.
Now suppose we delete key2.
The next time we go looking for key3, we apply the hash function, look in position h, realize what we want isn't there, and look in position h+1. We see it's empty, therefore we conclude that key3 isn't in the table!
This means when we delete key2 we also need to move key3 up into key2's position, but how would we know key3 wasn't in position h+2 because it belonged there, instead of as a result of collision?
The typical solution is, when collision occurs, keep pointers to the elements that have moved on because of the collision. (Or use chaining!)
#include <iostream> #include <iomanip> #include <string> #include <ctime> #include <list> using namespace std; // print some extra info while in debugging mode const bool DEBUG = false; // can change keytype and datatype to anything // as long as the generators and hash functions // are updated appropriately typedef string keytype; typedef float datatype; // define what the contents of a record look like struct record { keytype key; datatype data; }; // define a hash table of record pointers, // with methods to insert a new record into the table // and lookup a record based on its key class hashtable { private: list*table; int tsize; int hash(keytype k); public: hashtable(int sz = 0); ~hashtable(); bool insert(record *r); record *lookup(keytype k); keytype randomkey(); }; // ------------- methods dependent on the key type -------------- // the hash function relies on knowledge of the keytype, // here assumed to be a string // // we're using a rotating hash function, // where, on processing each character, we: // make a copy of the current hash value // and shift it 12 bits // make another copy of the hash value // and shift it 6 bits // take the exclusive-or of the two shifted // values and the next key character int hashtable::hash(keytype k) { int length = k.length(); int hash = length; for (int i = 0; i < length; i++) { if (DEBUG) { int h1 = hash << 12; int h2 = hash << 6; int h3 = k[i]; cout << h1 << "^" << h2 << "^" << h3 << "=" << hash << endl; } hash = (hash << 12) ^ (hash << 6) ^ k[i]; } if (hash < 0) hash = -hash; return (hash % tsize); } // generate random key keytype hashtable::randomkey() { keytype k; // here we rely on the actual datatype of the key, // currently known to be a string, // we'll generate a random string of length 4..12 int length = 4 + (random() % 8); for (int i = 0; i < length; i++) { char c = (random() % 26) + 'a'; k += c; } return k; } // ------------- methods independent of the key type ------------ // the constructor allocates a table with sz lists of records, // and remembers the size of the table hashtable::hashtable(int sz) { if (sz < 0) sz = 0; table = new list [sz]; if (!table) tsize = 0; else tsize = sz; // initialize the random number generator srandom((unsigned int)(time(NULL))); } // the destructor deallocates each record in each list // in the table // and also computes the number of collisions in the // hash table and the length of the largest chain hashtable::~hashtable() { int collisions = 0; int largest = 0; int entries = 0; if (table) { list ::iterator iter; for (int i = 0; i < tsize; i++) { int pos = 0; for (iter = table[i].begin(); iter != table[i].end(); iter++) { record *r = *iter; if (r) delete r; if (pos > 0) collisions++; if (pos > largest) largest = pos; pos++; entries++; } } delete table; } cout << "Total collisions: " << collisions; cout << " out of " << entries << " entries"; cout << ", largest chain: " << largest << endl; } // insert calls the hash function to find where to insert the // record, and pushes the record into the back of that list bool hashtable::insert(record *r) { if (!r) return false; if (!table) return false; int pos = hash(r->key); if ((pos < 0) || (pos >= tsize)) { cout << "Hash generated position " << pos << " on " << r->key << endl; return false; } table[pos].push_back(r); cout << "inserting " << setw(2) << r->key << ":" << r->data; cout << " in hash row " << pos << endl; return true; } // lookup calls the hash function to find which list should // contain the record with the specified key, // then searches that list and returns the record found // (or null if no matching record is found) record *hashtable::lookup(keytype k) { if (!table) return NULL; list ::iterator iter; int pos = hash(k); if ((pos < 0) || (pos >= tsize)) return NULL; for (iter = table[pos].begin(); iter != table[pos].end(); iter++) { record *r = *iter; if (!r) continue; if (r->key == k) return r; } return NULL; } // the main routine generates a bunch of records with // random key values, makes note of what key values they had, // and inserts them in the hash table // it then goes through its list of key values and tests the // hash table to see if it can find them int main() { int size = 0; int numtests = 0; string entry; // allow the user to select the size of the table cout << "How large a table would you like to work with?" << endl; cout << "(e.g. a prime number about the size of "; cout << "your planned number of data entries)" << endl; do { cin >> entry; if (atoi(entry.c_str()) < 1) { cout << entry << " is not a positive integer value, "; cout << endl << "please try again" << endl; } else size = atoi(entry.c_str()); } while (size < 1); // allow the user to select the number of test records cout << "How many test values would you like to insert?" << endl; do { cin >> entry; if (atoi(entry.c_str()) < 1) { cout << entry << " is not a positive integer value, "; cout << endl << "please try again" << endl; } else numtests = atoi(entry.c_str()); } while (numtests < 0); // allocate the hash table, quit if it fails hashtable *H = new hashtable(size); if (H == NULL) { cout << "unable to allocate sufficient table space, sorry!" << endl; return 1; } // allocate space for the test records, quit if it fails keytype *keyvals = new keytype[size]; if (keyvals == NULL) { cout << "unable to allocate sufficient test data, sorry!" << endl; delete H; return 2; } // create the desired number of test records, // each with a random key, // remember their key values in the keyvals array, // and insert them in the hash table cout << "Creating records with random keys and inserting in hash table" << endl; for (int i = 0; i < numtests; i++) { keyvals[i] = H->randomkey(); record *r = new record; if (!r) continue; r->key = keyvals[i]; r->data = i; H->insert(r); } // go through the list of remembered keys and try to // retrieve each of them from the hash table cout << endl << "Looking for the records we created" << endl; for (int j = 0; j < numtests; j++) { record *s = H->lookup(keyvals[j]); if (!s) cout << "Could not find record " << keyvals[j] << endl; else { cout << setw(2) << s->key << ":" << s->data; cout << " found successfully" << endl; } } // deallocate the hash table and the storage // for remembered keys delete H; delete keyvals; } /************************************************************ RESULTING OUTPUT: (using table size of 1009 and 10 entries, DEBUG off) How large a table would you like to work with? (e.g. a prime number about 20% larger than your planned number of data entries) How many test values would you like to insert? Creating records with random keys and inserting in hash table inserting cacvwg:0 in hash row 109 inserting incwg:1 in hash row 909 inserting mzatvxhzwi:2 in hash row 521 inserting pyymfxuha:3 in hash row 274 inserting dwfeesgcaso:4 in hash row 341 inserting vjxuswrang:5 in hash row 153 inserting muejpmjorht:6 in hash row 556 inserting oocqowhqr:7 in hash row 687 inserting nmjmflzn:8 in hash row 705 inserting nirx:9 in hash row 659 Looking for the records we created cacvwg:0 found successfully incwg:1 found successfully mzatvxhzwi:2 found successfully pyymfxuha:3 found successfully dwfeesgcaso:4 found successfully vjxuswrang:5 found successfully muejpmjorht:6 found successfully oocqowhqr:7 found successfully nmjmflzn:8 found successfully nirx:9 found successfully --------- RESULTING OUTPUT WITH DEBUG TURNED ON ---------------- (i.e. showing the computation of the hash function as it processed each character in the current key) How large a table would you like to work with? (e.g. a prime number about 20% larger than your planned number of data entries) How many test values would you like to insert? Creating records with random keys and inserting in hash table 16384^256^119=4 68644864^1072576^121=16759 1926991872^30109248^105=67579321 266506240^-1002468800^106=1930493481 inserting wyij:0 in hash row 194 24576^384^103=6 102658048^1604032^120=25063 1620803584^-2122158592^105=101059000 -1629057024^1182505536^114=-518394263 -794615808^658672768^103=-660796878 449736704^-127190592^101=-136205081 inserting gxirge:1 in hash row 368 24576^384^115=6 102707200^1604800^107=25075 1959440384^-2116867392^104=101141675 -1716879360^1181133312^111=-182871384 -1029246976^-217408576^106=-540267921 -960847872^1796926080^104=833383338 inserting skhojh:2 in hash row 908 36864^576^102=9 153247744^2394496^122=37414 1855954944^1102741120^113=151448058 1575948288^-914899904^97=791011057 -1966993408^640354368^102=-1801933791 -1853726720^977668480^116=-1394010074 1184841728^-249922304^99=-1413191180 1406545920^-514893632^117=-1216004765 -1769254912^-1504039616^101=-1298569035 inserting fzqaftcue:3 in hash row 868 20480^320^104=5 85098496^1329664^113=20776 1822887936^1102224448^114=84331121 1489182720^1164119168^105=756386866 47091712^1745566272^98=497036521 inserting hqrib:4 in hash row 802 28672^448^103=7 119173120^1862080^104=29095 1637515264^-1048155648^97=117840296 -827977728^121280576^113=-1608717727 -930934784^1730284608^97=-912488399 297930752^-397998016^105=-1348395999 1367511040^1631980096^117=-108718039 inserting ghaqaiu:5 in hash row 873 40960^640^97=10 170790912^2668608^103=41697 1786933248^-2119562816^103=168208423 -2019921920^-903976512^114=-349668953 1100685312^1627810944^114=1300502962 -724623360^659766400^110=547179762 -1043406848^-418956416^119=-207872786 -1648398336^-1300824640^119=650763255 -1730449408^-228364864^109=801738167 -2116366336^-1375245504^112=1790451117 inserting aggrrnwwmp:6 in hash row 455 16384^256^114=4 68624384^1072256^105=16754 2010025984^31406656^98=67599593 446832640^-2073393024^111=1980869154 inserting ribo:7 in hash row 509 32768^512^119=8 136802304^2137536^118=33399 2128306176^33254784^102=134737334 14573568^-939296384^113=2132807142 -778104832^927366208^102=-925033999 -1765646336^-1235547776^103=-421958618 -1768001536^643463616^116=546925031 1083916288^822242560^108=-1329329740 inserting wvfqfgtl:8 in hash row 268 45056^704^118=11 187392000^2928000^99=45750 2094936064^-1041008448^113=185060835 -729083904^1263676480^101=-1121105743 -1815982080^-162592448^121=-1613153243 -711749632^1666600512^111=1703762233 1575153664^-1317565504^101=-1228546513 1471827968^626977088^102=-325747803 1267884032^-1188148864^117=1927592230 -1029746688^-1358267072^118=-222549515 1389584384^1699433856^121=1838492982 inserting vcqeyoefuvy:9 in hash row 521 Looking for the records we created 16384^256^119=4 68644864^1072576^121=16759 1926991872^30109248^105=67579321 266506240^-1002468800^106=1930493481 wyij:0 found successfully 24576^384^103=6 102658048^1604032^120=25063 1620803584^-2122158592^105=101059000 -1629057024^1182505536^114=-518394263 -794615808^658672768^103=-660796878 449736704^-127190592^101=-136205081 gxirge:1 found successfully 24576^384^115=6 102707200^1604800^107=25075 1959440384^-2116867392^104=101141675 -1716879360^1181133312^111=-182871384 -1029246976^-217408576^106=-540267921 -960847872^1796926080^104=833383338 skhojh:2 found successfully 36864^576^102=9 153247744^2394496^122=37414 1855954944^1102741120^113=151448058 1575948288^-914899904^97=791011057 -1966993408^640354368^102=-1801933791 -1853726720^977668480^116=-1394010074 1184841728^-249922304^99=-1413191180 1406545920^-514893632^117=-1216004765 -1769254912^-1504039616^101=-1298569035 fzqaftcue:3 found successfully 20480^320^104=5 85098496^1329664^113=20776 1822887936^1102224448^114=84331121 1489182720^1164119168^105=756386866 47091712^1745566272^98=497036521 hqrib:4 found successfully 28672^448^103=7 119173120^1862080^104=29095 1637515264^-1048155648^97=117840296 -827977728^121280576^113=-1608717727 -930934784^1730284608^97=-912488399 297930752^-397998016^105=-1348395999 1367511040^1631980096^117=-108718039 ghaqaiu:5 found successfully 40960^640^97=10 170790912^2668608^103=41697 1786933248^-2119562816^103=168208423 -2019921920^-903976512^114=-349668953 1100685312^1627810944^114=1300502962 -724623360^659766400^110=547179762 -1043406848^-418956416^119=-207872786 -1648398336^-1300824640^119=650763255 -1730449408^-228364864^109=801738167 -2116366336^-1375245504^112=1790451117 aggrrnwwmp:6 found successfully 16384^256^114=4 68624384^1072256^105=16754 2010025984^31406656^98=67599593 446832640^-2073393024^111=1980869154 ribo:7 found successfully 32768^512^119=8 136802304^2137536^118=33399 2128306176^33254784^102=134737334 14573568^-939296384^113=2132807142 -778104832^927366208^102=-925033999 -1765646336^-1235547776^103=-421958618 -1768001536^643463616^116=546925031 1083916288^822242560^108=-1329329740 wvfqfgtl:8 found successfully 45056^704^118=11 187392000^2928000^99=45750 2094936064^-1041008448^113=185060835 -729083904^1263676480^101=-1121105743 -1815982080^-162592448^121=-1613153243 -711749632^1666600512^111=1703762233 1575153664^-1317565504^101=-1228546513 1471827968^626977088^102=-325747803 1267884032^-1188148864^117=1927592230 -1029746688^-1358267072^118=-222549515 1389584384^1699433856^121=1838492982 vcqeyoefuvy:9 found successfully */