CSCI 260 Fall 2010: Notes

CSCI 260 Fall 2010: Hash tables, functions, and collisions

So far, we have considered a variety of methods for storing data in a searchable format (lists, binary search trees, avl trees, etc).

Each of these has involved searching the data structure from a fixed starting point, until eventually we find what we're looking for or determine it must not be present.

An alternative approach is to use a hash table.

In this approach, the key value is used to uniquely determine the location of the data item in the table: we take the key value and apply a hashing function which tells us what row of the table the data must be in, and we jump straight to that row of the table.

This means that instead of taking linear or logarithmic time to search the data structure we can find what we want in constant time (the time taken to apply the hash function).

So what are the complications?

The hash table must be big enough to hold all the data items we want.
The hash function must be able to take the key and tell us, in constant time, which table entry the data belongs in.
The function must not map any two keys to the same table position (we'll drop this restriction later when we talk about collision resolution).

For instance, suppose we are storing information about students, and each student has an integer student number that we'll use as our key.

We could create a struct to hold information for individual students, and our hash table could simply be an array of these structs.

If we knew our student numbers were in the range 000000000 to 999999999 then we could simply create an array of size 1000000000 and then use the employee number as an index into the array, i.e.:

struct student {
   int    id;
   string name;
   float  gpa;
};

employee hashtable[1000000000];

// the hash function takes a student struct and
//     tells us what table position it must be in
int hash_function(student s)
{
   return s->id;
}

This works perfectly, but if we only have a few thousand students then the hashtable is much bigger than it really needs to be.

Suppose that we decide to use a different hash function, where we only look at the last four digits of the student number:

int hash_function(student s)
{
   return (s->id & 10000);
}

Now we can get away with an array of just 10,000 structs - which might be very reasonable if we're expecting several thousand students.

Unfortunately, we can't be certain the last four digits uniquely identify students. Suppose one student had id 123456789 and another student had id 111116789: the hash function would tell us they both used array position 6789. (This is called a collision.)

The goal is to come up with a hash function that allows us to use a very compact table (not much bigger than the expected amount of data) and yet minimizes the likelihood that different entries collide. (In fact, we usually allow for a small probability of collision, and include some means of figuring out what to do when a collision does occur.)

A perfect hash function is one which takes the key, identifies the table row, has no chance of collision, and allows us to use a table whose size exactly matches the number of data entries. Establishing some functions is sometimes possible when you know (or can predict) exactly the set of key values you'll be working with.

Here are a few simple examples of hash functions and table sizes:

Divide by a large prime number
One approach is to take the desired table size and simply divide the key value (e.g. student id) by the size of the table. This is essentially what we did in the 4-digit version discussed above.

Prime numbers are integers that are only divisible by 1 and themselves (e.g. 2, 3, 5, 7, 9, 11, 13, 17, 19, 23, 29, etc). They have properties that make them particularly useful in a variety of areas, including hash functions, data compression, and data encryption. For now, let us simply take it for granted that dividing by a prime number might give a lower probability of collision than dividing by a non-prime. Hence we could pick a prime number around 10,000 (e.g. 9511) and use that as our table size and mod value in the hash function.

Extract a portion of the key
Another approach is to take selected digits (or even selected bits) from the id and string those together to form the hash value, e.g. take the 2nd, 4th, 6th, and 8th digits from the student number and concatenate those together to form a hash value in the range 0000 to 9999.
Manipulate portions of the key
We could create a function that divides the key into two or more components and then builds a formula from that, e.g.:
```
part1 = first 3 digits of the key
part2 = next 3 digits of the key
part3 = last 3 digits of the key
P = some prime number (e.g. 9511)
hashvalue = ((part1 * part2) + part3) % P;
```
Note that here we could run into collision problems if many people had 000 as either the first or middle 3 digits of their student number, since all those people would have (part1 * part2) == 0.

Collision Resolution

Most of the time we have to live with the fact that our hash function will be fast but not perfect - it might map some sets of keys to the same entry in the table. As a result, we need to find some way to handle collisions.

Chaining
One of the most common techniques is simply to have each entry of the hash table be a pointer to the start of a linked list: with the linked list containing all the elements that mapped to that particular location.
The lists will (hopefully) be quite small (as long as our hash function doesn't create too many collisions). Thus we can apply the hash function (O(1)) and quickly search the small linked list to find the specific entry we're interested in.
A drawback with this is the overhead for the secondary data structure (lists or whatever), since we've got one of these for every table entry whether it's used or not and whether there are collisions or not.
Linear probing
Another possibility is, when a collision occurs, simply scan down the table to find the next untaken spot.
When inserting a new entry we simply scan down until a free spot is found, and when searching for a previously inserted entry we simply scan down until we find one with the right id or until we find an empty spot (in which case we assume the thing we're looking for isn't in the table).
I.e. if the hash function says position 6, and position 6 is already filled with some other entry, then check position 7, then 8, then 9 etc etc until a free spot is reached.
An advantage of this is that we don't need a secondary data structure (chaining required us to use pointers to lists or trees or whatever), so if no collisions occur it is quite efficient.
One problem with this is that if many collisions occur we often wind up with big clusters of taken spots, making the collision resolution slower. For example, in the example below a hit anywhere in 2-12 will result in the next position filled being 13, then a hit anywhere in 2-13 will result in the next position being 14, etc. Once clusters start to form they can quickly cause performance degradation.
```
Position Status
    0      free
    1      free
    2      taken
    3      taken
    4      taken
    5      taken
    6      taken
    7      taken
    8      taken
    9      taken
   10      taken
   11      taken
   12      taken
   13      free
   14      free
   15      free
   16      free
   17      free
   18      free
   19      free
   20      free
   21      free
   22      free
```
Secondary hash functions
To get around the clustering problem, we could have a series of hash functions, h1(), h2(), h3(), etc. If h1(Key) gives us a position occupied by some other key then try h2(key), if that doesn't work try h3(key), etc.
If the hash functions distribute the values quite differently around the table then clustering isn't as likely, but it does require having (and sometimes applying) a variety of different hash functions.

Deleting from hash tables

If a collision-free hash table, deletion is no problem: simply delete the desired element and proceed as normal.

If chaining is used for collision resolution then deletion is also no problem: simply delete the desired element from the list of elements in the appropriate table row.

However, if linear probes or secondary hash functions or other similar techniques are used for collision resolution then deleting values creates a problem.

Suppose hash(key1) == hash(key2) == hash(key3), and we're using linear probing.

In the hash table we might see key1 in position h, key2 in position h+1, and key3 in position h+2.

Now suppose we delete key2.

The next time we go looking for key3, we apply the hash function, look in position h, realize what we want isn't there, and look in position h+1. We see it's empty, therefore we conclude that key3 isn't in the table!

This means when we delete key2 we also need to move key3 up into key2's position, but how would we know key3 wasn't in position h+2 because it belonged there, instead of as a result of collision?

The typical solution is, when collision occurs, keep pointers to the elements that have moved on because of the collision. (Or use chaining!)

SAMPLE CODE

#include <iostream>
#include <iomanip>
#include <string>
#include <ctime>
#include <list>
using namespace std;

// print some extra info while in debugging mode
const bool DEBUG = false;

// can change keytype and datatype to anything 
//     as long as the generators and hash functions
//     are updated appropriately
typedef string keytype;
typedef float datatype;

// define what the contents of a record look like
struct record {
   keytype key;
   datatype data;
};

// define a hash table of record pointers,
//    with methods to insert a new record into the table
//    and lookup a record based on its key
class hashtable {
   private:
      list *table;
      int tsize;
      int hash(keytype k);
   public:
      hashtable(int sz = 0);
      ~hashtable();
      bool insert(record *r);
      record *lookup(keytype k);  
      keytype randomkey();
};

// ------------- methods dependent on the key type --------------

// the hash function relies on knowledge of the keytype,
//     here assumed to be a string
//
// we're using a rotating hash function, 
// where, on processing each character, we:
//    make a copy of the current hash value
//         and shift it 12 bits
//    make another copy of the hash value
//         and shift it 6 bits
//    take the exclusive-or of the two shifted
//         values and the next key character
int hashtable::hash(keytype k)
{
   int length = k.length();
   int hash = length;
   for (int i = 0; i < length; i++) {
       if (DEBUG) {
          int h1 = hash << 12;
          int h2 = hash << 6;
          int h3 = k[i];
          cout << h1 << "^" << h2 << "^" << h3 << "=" << hash << endl;
       }
       hash = (hash << 12) ^ (hash <<  6) ^ k[i];
   }
   if (hash < 0) hash = -hash;
   return (hash % tsize);
}

// generate random key
keytype hashtable::randomkey()
{
   keytype k;
   // here we rely on the actual datatype of the key,
   //      currently known to be a string,
   // we'll generate a random string of length 4..12
   int length = 4 + (random() % 8);
   for (int i = 0; i < length; i++) {
       char c = (random() % 26) + 'a';
       k += c;
   }
   return k;
}

// ------------- methods independent of the key type ------------

// the constructor allocates a table with sz lists of records,
//     and remembers the size of the table
hashtable::hashtable(int sz)
{
   if (sz < 0) sz = 0;
   table = new list[sz];
   if (!table) tsize = 0;
   else tsize = sz;
   // initialize the random number generator
   srandom((unsigned int)(time(NULL)));
}

// the destructor deallocates each record in each list
//     in the table
// and also computes the number of collisions in the 
//     hash table and the length of the largest chain   
hashtable::~hashtable()
{
   int collisions = 0;
   int largest = 0;
   int entries = 0;
   if (table) {
      list::iterator iter;
      for (int i = 0; i < tsize; i++) {
         int pos = 0;
         for (iter = table[i].begin(); iter != table[i].end(); iter++) {
             record *r = *iter;
             if (r) delete r;
             if (pos > 0) collisions++;
             if (pos > largest) largest = pos;
             pos++;
             entries++;
         }
      }
      delete table;
   }
   cout << "Total collisions: " << collisions;
   cout << " out of " << entries << " entries";
   cout << ", largest chain: " << largest << endl;
}

// insert calls the hash function to find where to insert the 
//    record, and pushes the record into the back of that list
bool hashtable::insert(record *r)
{
   if (!r) return false;
   if (!table) return false;
   int pos = hash(r->key);
   if ((pos < 0) || (pos >= tsize)) {
      cout << "Hash generated position " << pos << " on " << r->key << endl;
      return false;
   }
   table[pos].push_back(r);
   cout << "inserting " << setw(2) << r->key << ":" << r->data;
   cout << " in hash row " << pos << endl;
   return true;
}

// lookup calls the hash function to find which list should 
//    contain the record with the specified key,
//    then searches that list and returns the record found
// (or null if no matching record is found)
record *hashtable::lookup(keytype k)
{
   if (!table) return NULL;
   list::iterator iter;
   int pos = hash(k);
   if ((pos < 0) || (pos >= tsize)) return NULL;
   for (iter = table[pos].begin(); iter != table[pos].end(); iter++) {
       record *r = *iter;
       if (!r) continue;
       if (r->key == k) return r;
   }
   return NULL;
}

// the main routine generates a bunch of records with
//     random key values, makes note of what key values they had,
//     and inserts them in the hash table
// it then goes through its list of key values and tests the
//     hash table to see if it can find them
int main()
{
   int size = 0;
   int numtests = 0;
   string entry;

   // allow the user to select the size of the table
   cout << "How large a table would you like to work with?" << endl;
   cout << "(e.g. a prime number about the size of ";
   cout << "your planned number of data entries)" << endl;
   do {
      cin >> entry;
      if (atoi(entry.c_str()) < 1) {
         cout << entry << " is not a positive integer value, ";
         cout << endl << "please try again" << endl;
      } else size = atoi(entry.c_str());
   } while (size < 1);

   // allow the user to select the number of test records
   cout << "How many test values would you like to insert?" << endl;
   do {
      cin >> entry;
      if (atoi(entry.c_str()) < 1) {
         cout << entry << " is not a positive integer value, ";
         cout << endl << "please try again" << endl;
      } else numtests = atoi(entry.c_str());
   } while (numtests < 0);

   // allocate the hash table, quit if it fails
   hashtable *H = new hashtable(size);
   if (H == NULL) {
      cout << "unable to allocate sufficient table space, sorry!" << endl;
      return 1;
   }

   // allocate space for the test records, quit if it fails
   keytype *keyvals = new keytype[size];
   if (keyvals == NULL) {
      cout << "unable to allocate sufficient test data, sorry!" << endl;
      delete H;
      return 2;
   }

   // create the desired number of test records,
   //    each with a random key,
   // remember their key values in the keyvals array,
   //    and insert them in the hash table
   cout << "Creating records with random keys and inserting in hash table" << endl;
   for (int i = 0; i < numtests; i++) {
       keyvals[i] = H->randomkey();
       record *r = new record;
       if (!r) continue;
       r->key = keyvals[i];
       r->data = i;
       H->insert(r);
   }

   // go through the list of remembered keys and try to
   //    retrieve each of them from the hash table
   cout << endl << "Looking for the records we created" << endl;
   for (int j = 0; j < numtests; j++) {
       record *s = H->lookup(keyvals[j]);
       if (!s) cout << "Could not find record " << keyvals[j] << endl;
       else {
            cout << setw(2) << s->key << ":" << s->data;
            cout << " found successfully" << endl;
       }
   }

   // deallocate the hash table and the storage
   //    for remembered keys
   delete H;
   delete keyvals;
}

/************************************************************
     RESULTING OUTPUT:
  (using table size of 1009 and 10 entries, DEBUG off)

How large a table would you like to work with?
(e.g. a prime number about 20% larger than your planned number of data entries)
How many test values would you like to insert?
Creating records with random keys and inserting in hash table
inserting cacvwg:0 in hash row 109
inserting incwg:1 in hash row 909
inserting mzatvxhzwi:2 in hash row 521
inserting pyymfxuha:3 in hash row 274
inserting dwfeesgcaso:4 in hash row 341
inserting vjxuswrang:5 in hash row 153
inserting muejpmjorht:6 in hash row 556
inserting oocqowhqr:7 in hash row 687
inserting nmjmflzn:8 in hash row 705
inserting nirx:9 in hash row 659

Looking for the records we created
cacvwg:0 found successfully
incwg:1 found successfully
mzatvxhzwi:2 found successfully
pyymfxuha:3 found successfully
dwfeesgcaso:4 found successfully
vjxuswrang:5 found successfully
muejpmjorht:6 found successfully
oocqowhqr:7 found successfully
nmjmflzn:8 found successfully
nirx:9 found successfully

 --------- RESULTING OUTPUT WITH DEBUG TURNED ON ----------------
 (i.e. showing the computation of the hash function as it processed
       each character in the current key)

How large a table would you like to work with?
(e.g. a prime number about 20% larger than your planned number of data entries)
How many test values would you like to insert?
Creating records with random keys and inserting in hash table
16384^256^119=4
68644864^1072576^121=16759
1926991872^30109248^105=67579321
266506240^-1002468800^106=1930493481
inserting wyij:0 in hash row 194
24576^384^103=6
102658048^1604032^120=25063
1620803584^-2122158592^105=101059000
-1629057024^1182505536^114=-518394263
-794615808^658672768^103=-660796878
449736704^-127190592^101=-136205081
inserting gxirge:1 in hash row 368
24576^384^115=6
102707200^1604800^107=25075
1959440384^-2116867392^104=101141675
-1716879360^1181133312^111=-182871384
-1029246976^-217408576^106=-540267921
-960847872^1796926080^104=833383338
inserting skhojh:2 in hash row 908
36864^576^102=9
153247744^2394496^122=37414
1855954944^1102741120^113=151448058
1575948288^-914899904^97=791011057
-1966993408^640354368^102=-1801933791
-1853726720^977668480^116=-1394010074
1184841728^-249922304^99=-1413191180
1406545920^-514893632^117=-1216004765
-1769254912^-1504039616^101=-1298569035
inserting fzqaftcue:3 in hash row 868
20480^320^104=5
85098496^1329664^113=20776
1822887936^1102224448^114=84331121
1489182720^1164119168^105=756386866
47091712^1745566272^98=497036521
inserting hqrib:4 in hash row 802
28672^448^103=7
119173120^1862080^104=29095
1637515264^-1048155648^97=117840296
-827977728^121280576^113=-1608717727
-930934784^1730284608^97=-912488399
297930752^-397998016^105=-1348395999
1367511040^1631980096^117=-108718039
inserting ghaqaiu:5 in hash row 873
40960^640^97=10
170790912^2668608^103=41697
1786933248^-2119562816^103=168208423
-2019921920^-903976512^114=-349668953
1100685312^1627810944^114=1300502962
-724623360^659766400^110=547179762
-1043406848^-418956416^119=-207872786
-1648398336^-1300824640^119=650763255
-1730449408^-228364864^109=801738167
-2116366336^-1375245504^112=1790451117
inserting aggrrnwwmp:6 in hash row 455
16384^256^114=4
68624384^1072256^105=16754
2010025984^31406656^98=67599593
446832640^-2073393024^111=1980869154
inserting ribo:7 in hash row 509
32768^512^119=8
136802304^2137536^118=33399
2128306176^33254784^102=134737334
14573568^-939296384^113=2132807142
-778104832^927366208^102=-925033999
-1765646336^-1235547776^103=-421958618
-1768001536^643463616^116=546925031
1083916288^822242560^108=-1329329740
inserting wvfqfgtl:8 in hash row 268
45056^704^118=11
187392000^2928000^99=45750
2094936064^-1041008448^113=185060835
-729083904^1263676480^101=-1121105743
-1815982080^-162592448^121=-1613153243
-711749632^1666600512^111=1703762233
1575153664^-1317565504^101=-1228546513
1471827968^626977088^102=-325747803
1267884032^-1188148864^117=1927592230
-1029746688^-1358267072^118=-222549515
1389584384^1699433856^121=1838492982
inserting vcqeyoefuvy:9 in hash row 521

Looking for the records we created
16384^256^119=4
68644864^1072576^121=16759
1926991872^30109248^105=67579321
266506240^-1002468800^106=1930493481
wyij:0 found successfully
24576^384^103=6
102658048^1604032^120=25063
1620803584^-2122158592^105=101059000
-1629057024^1182505536^114=-518394263
-794615808^658672768^103=-660796878
449736704^-127190592^101=-136205081
gxirge:1 found successfully
24576^384^115=6
102707200^1604800^107=25075
1959440384^-2116867392^104=101141675
-1716879360^1181133312^111=-182871384
-1029246976^-217408576^106=-540267921
-960847872^1796926080^104=833383338
skhojh:2 found successfully
36864^576^102=9
153247744^2394496^122=37414
1855954944^1102741120^113=151448058
1575948288^-914899904^97=791011057
-1966993408^640354368^102=-1801933791
-1853726720^977668480^116=-1394010074
1184841728^-249922304^99=-1413191180
1406545920^-514893632^117=-1216004765
-1769254912^-1504039616^101=-1298569035
fzqaftcue:3 found successfully
20480^320^104=5
85098496^1329664^113=20776
1822887936^1102224448^114=84331121
1489182720^1164119168^105=756386866
47091712^1745566272^98=497036521
hqrib:4 found successfully
28672^448^103=7
119173120^1862080^104=29095
1637515264^-1048155648^97=117840296
-827977728^121280576^113=-1608717727
-930934784^1730284608^97=-912488399
297930752^-397998016^105=-1348395999
1367511040^1631980096^117=-108718039
ghaqaiu:5 found successfully
40960^640^97=10
170790912^2668608^103=41697
1786933248^-2119562816^103=168208423
-2019921920^-903976512^114=-349668953
1100685312^1627810944^114=1300502962
-724623360^659766400^110=547179762
-1043406848^-418956416^119=-207872786
-1648398336^-1300824640^119=650763255
-1730449408^-228364864^109=801738167
-2116366336^-1375245504^112=1790451117
aggrrnwwmp:6 found successfully
16384^256^114=4
68624384^1072256^105=16754
2010025984^31406656^98=67599593
446832640^-2073393024^111=1980869154
ribo:7 found successfully
32768^512^119=8
136802304^2137536^118=33399
2128306176^33254784^102=134737334
14573568^-939296384^113=2132807142
-778104832^927366208^102=-925033999
-1765646336^-1235547776^103=-421958618
-1768001536^643463616^116=546925031
1083916288^822242560^108=-1329329740
wvfqfgtl:8 found successfully
45056^704^118=11
187392000^2928000^99=45750
2094936064^-1041008448^113=185060835
-729083904^1263676480^101=-1121105743
-1815982080^-162592448^121=-1613153243
-711749632^1666600512^111=1703762233
1575153664^-1317565504^101=-1228546513
1471827968^626977088^102=-325747803
1267884032^-1188148864^117=1927592230
-1029746688^-1358267072^118=-222549515
1389584384^1699433856^121=1838492982
vcqeyoefuvy:9 found successfully

*/