CSCI 260 Fall 2010: Hash tables, functions, and collisions

So far, we have considered a variety of methods for storing data in a searchable format (lists, binary search trees, avl trees, etc).

Each of these has involved searching the data structure from a fixed starting point, until eventually we find what we're looking for or determine it must not be present.

An alternative approach is to use a hash table.

In this approach, the key value is used to uniquely determine the location of the data item in the table: we take the key value and apply a hashing function which tells us what row of the table the data must be in, and we jump straight to that row of the table.

This means that instead of taking linear or logarithmic time to search the data structure we can find what we want in constant time (the time taken to apply the hash function).

So what are the complications?

For instance, suppose we are storing information about students, and each student has an integer student number that we'll use as our key.

We could create a struct to hold information for individual students, and our hash table could simply be an array of these structs.

If we knew our student numbers were in the range 000000000 to 999999999 then we could simply create an array of size 1000000000 and then use the employee number as an index into the array, i.e.:

struct student {
   int    id;
   string name;
   float  gpa;
};

employee hashtable[1000000000];

// the hash function takes a student struct and
//     tells us what table position it must be in
int hash_function(student s)
{
   return s->id;
}
This works perfectly, but if we only have a few thousand students then the hashtable is much bigger than it really needs to be.

Suppose that we decide to use a different hash function, where we only look at the last four digits of the student number:

int hash_function(student s)
{
   return (s->id & 10000);
}
Now we can get away with an array of just 10,000 structs - which might be very reasonable if we're expecting several thousand students.

Unfortunately, we can't be certain the last four digits uniquely identify students. Suppose one student had id 123456789 and another student had id 111116789: the hash function would tell us they both used array position 6789. (This is called a collision.)

The goal is to come up with a hash function that allows us to use a very compact table (not much bigger than the expected amount of data) and yet minimizes the likelihood that different entries collide. (In fact, we usually allow for a small probability of collision, and include some means of figuring out what to do when a collision does occur.)

A perfect hash function is one which takes the key, identifies the table row, has no chance of collision, and allows us to use a table whose size exactly matches the number of data entries. Establishing some functions is sometimes possible when you know (or can predict) exactly the set of key values you'll be working with.

Here are a few simple examples of hash functions and table sizes:

Collision Resolution

Most of the time we have to live with the fact that our hash function will be fast but not perfect - it might map some sets of keys to the same entry in the table. As a result, we need to find some way to handle collisions.

Deleting from hash tables

If a collision-free hash table, deletion is no problem: simply delete the desired element and proceed as normal.

If chaining is used for collision resolution then deletion is also no problem: simply delete the desired element from the list of elements in the appropriate table row.

However, if linear probes or secondary hash functions or other similar techniques are used for collision resolution then deleting values creates a problem.


SAMPLE CODE
#include <iostream>
#include <iomanip>
#include <string>
#include <ctime>
#include <list>
using namespace std;

// print some extra info while in debugging mode
const bool DEBUG = false;

// can change keytype and datatype to anything 
//     as long as the generators and hash functions
//     are updated appropriately
typedef string keytype;
typedef float datatype;

// define what the contents of a record look like
struct record {
   keytype key;
   datatype data;
};

// define a hash table of record pointers,
//    with methods to insert a new record into the table
//    and lookup a record based on its key
class hashtable {
   private:
      list *table;
      int tsize;
      int hash(keytype k);
   public:
      hashtable(int sz = 0);
      ~hashtable();
      bool insert(record *r);
      record *lookup(keytype k);  
      keytype randomkey();
};

// ------------- methods dependent on the key type --------------

// the hash function relies on knowledge of the keytype,
//     here assumed to be a string
//
// we're using a rotating hash function, 
// where, on processing each character, we:
//    make a copy of the current hash value
//         and shift it 12 bits
//    make another copy of the hash value
//         and shift it 6 bits
//    take the exclusive-or of the two shifted
//         values and the next key character
int hashtable::hash(keytype k)
{
   int length = k.length();
   int hash = length;
   for (int i = 0; i < length; i++) {
       if (DEBUG) {
          int h1 = hash << 12;
          int h2 = hash << 6;
          int h3 = k[i];
          cout << h1 << "^" << h2 << "^" << h3 << "=" << hash << endl;
       }
       hash = (hash << 12) ^ (hash <<  6) ^ k[i];
   }
   if (hash < 0) hash = -hash;
   return (hash % tsize);
}

// generate random key
keytype hashtable::randomkey()
{
   keytype k;
   // here we rely on the actual datatype of the key,
   //      currently known to be a string,
   // we'll generate a random string of length 4..12
   int length = 4 + (random() % 8);
   for (int i = 0; i < length; i++) {
       char c = (random() % 26) + 'a';
       k += c;
   }
   return k;
}

// ------------- methods independent of the key type ------------

// the constructor allocates a table with sz lists of records,
//     and remembers the size of the table
hashtable::hashtable(int sz)
{
   if (sz < 0) sz = 0;
   table = new list[sz];
   if (!table) tsize = 0;
   else tsize = sz;
   // initialize the random number generator
   srandom((unsigned int)(time(NULL)));
}

// the destructor deallocates each record in each list
//     in the table
// and also computes the number of collisions in the 
//     hash table and the length of the largest chain   
hashtable::~hashtable()
{
   int collisions = 0;
   int largest = 0;
   int entries = 0;
   if (table) {
      list::iterator iter;
      for (int i = 0; i < tsize; i++) {
         int pos = 0;
         for (iter = table[i].begin(); iter != table[i].end(); iter++) {
             record *r = *iter;
             if (r) delete r;
             if (pos > 0) collisions++;
             if (pos > largest) largest = pos;
             pos++;
             entries++;
         }
      }
      delete table;
   }
   cout << "Total collisions: " << collisions;
   cout << " out of " << entries << " entries";
   cout << ", largest chain: " << largest << endl;
}

// insert calls the hash function to find where to insert the 
//    record, and pushes the record into the back of that list
bool hashtable::insert(record *r)
{
   if (!r) return false;
   if (!table) return false;
   int pos = hash(r->key);
   if ((pos < 0) || (pos >= tsize)) {
      cout << "Hash generated position " << pos << " on " << r->key << endl;
      return false;
   }
   table[pos].push_back(r);
   cout << "inserting " << setw(2) << r->key << ":" << r->data;
   cout << " in hash row " << pos << endl;
   return true;
}

// lookup calls the hash function to find which list should 
//    contain the record with the specified key,
//    then searches that list and returns the record found
// (or null if no matching record is found)
record *hashtable::lookup(keytype k)
{
   if (!table) return NULL;
   list::iterator iter;
   int pos = hash(k);
   if ((pos < 0) || (pos >= tsize)) return NULL;
   for (iter = table[pos].begin(); iter != table[pos].end(); iter++) {
       record *r = *iter;
       if (!r) continue;
       if (r->key == k) return r;
   }
   return NULL;
}

// the main routine generates a bunch of records with
//     random key values, makes note of what key values they had,
//     and inserts them in the hash table
// it then goes through its list of key values and tests the
//     hash table to see if it can find them
int main()
{
   int size = 0;
   int numtests = 0;
   string entry;

   // allow the user to select the size of the table
   cout << "How large a table would you like to work with?" << endl;
   cout << "(e.g. a prime number about the size of ";
   cout << "your planned number of data entries)" << endl;
   do {
      cin >> entry;
      if (atoi(entry.c_str()) < 1) {
         cout << entry << " is not a positive integer value, ";
         cout << endl << "please try again" << endl;
      } else size = atoi(entry.c_str());
   } while (size < 1);

   // allow the user to select the number of test records
   cout << "How many test values would you like to insert?" << endl;
   do {
      cin >> entry;
      if (atoi(entry.c_str()) < 1) {
         cout << entry << " is not a positive integer value, ";
         cout << endl << "please try again" << endl;
      } else numtests = atoi(entry.c_str());
   } while (numtests < 0);

   // allocate the hash table, quit if it fails
   hashtable *H = new hashtable(size);
   if (H == NULL) {
      cout << "unable to allocate sufficient table space, sorry!" << endl;
      return 1;
   }

   // allocate space for the test records, quit if it fails
   keytype *keyvals = new keytype[size];
   if (keyvals == NULL) {
      cout << "unable to allocate sufficient test data, sorry!" << endl;
      delete H;
      return 2;
   }

   // create the desired number of test records,
   //    each with a random key,
   // remember their key values in the keyvals array,
   //    and insert them in the hash table
   cout << "Creating records with random keys and inserting in hash table" << endl;
   for (int i = 0; i < numtests; i++) {
       keyvals[i] = H->randomkey();
       record *r = new record;
       if (!r) continue;
       r->key = keyvals[i];
       r->data = i;
       H->insert(r);
   }

   // go through the list of remembered keys and try to
   //    retrieve each of them from the hash table
   cout << endl << "Looking for the records we created" << endl;
   for (int j = 0; j < numtests; j++) {
       record *s = H->lookup(keyvals[j]);
       if (!s) cout << "Could not find record " << keyvals[j] << endl;
       else {
            cout << setw(2) << s->key << ":" << s->data;
            cout << " found successfully" << endl;
       }
   }

   // deallocate the hash table and the storage
   //    for remembered keys
   delete H;
   delete keyvals;
}

/************************************************************
     RESULTING OUTPUT:
  (using table size of 1009 and 10 entries, DEBUG off)

How large a table would you like to work with?
(e.g. a prime number about 20% larger than your planned number of data entries)
How many test values would you like to insert?
Creating records with random keys and inserting in hash table
inserting cacvwg:0 in hash row 109
inserting incwg:1 in hash row 909
inserting mzatvxhzwi:2 in hash row 521
inserting pyymfxuha:3 in hash row 274
inserting dwfeesgcaso:4 in hash row 341
inserting vjxuswrang:5 in hash row 153
inserting muejpmjorht:6 in hash row 556
inserting oocqowhqr:7 in hash row 687
inserting nmjmflzn:8 in hash row 705
inserting nirx:9 in hash row 659

Looking for the records we created
cacvwg:0 found successfully
incwg:1 found successfully
mzatvxhzwi:2 found successfully
pyymfxuha:3 found successfully
dwfeesgcaso:4 found successfully
vjxuswrang:5 found successfully
muejpmjorht:6 found successfully
oocqowhqr:7 found successfully
nmjmflzn:8 found successfully
nirx:9 found successfully

 --------- RESULTING OUTPUT WITH DEBUG TURNED ON ----------------
 (i.e. showing the computation of the hash function as it processed
       each character in the current key)

How large a table would you like to work with?
(e.g. a prime number about 20% larger than your planned number of data entries)
How many test values would you like to insert?
Creating records with random keys and inserting in hash table
16384^256^119=4
68644864^1072576^121=16759
1926991872^30109248^105=67579321
266506240^-1002468800^106=1930493481
inserting wyij:0 in hash row 194
24576^384^103=6
102658048^1604032^120=25063
1620803584^-2122158592^105=101059000
-1629057024^1182505536^114=-518394263
-794615808^658672768^103=-660796878
449736704^-127190592^101=-136205081
inserting gxirge:1 in hash row 368
24576^384^115=6
102707200^1604800^107=25075
1959440384^-2116867392^104=101141675
-1716879360^1181133312^111=-182871384
-1029246976^-217408576^106=-540267921
-960847872^1796926080^104=833383338
inserting skhojh:2 in hash row 908
36864^576^102=9
153247744^2394496^122=37414
1855954944^1102741120^113=151448058
1575948288^-914899904^97=791011057
-1966993408^640354368^102=-1801933791
-1853726720^977668480^116=-1394010074
1184841728^-249922304^99=-1413191180
1406545920^-514893632^117=-1216004765
-1769254912^-1504039616^101=-1298569035
inserting fzqaftcue:3 in hash row 868
20480^320^104=5
85098496^1329664^113=20776
1822887936^1102224448^114=84331121
1489182720^1164119168^105=756386866
47091712^1745566272^98=497036521
inserting hqrib:4 in hash row 802
28672^448^103=7
119173120^1862080^104=29095
1637515264^-1048155648^97=117840296
-827977728^121280576^113=-1608717727
-930934784^1730284608^97=-912488399
297930752^-397998016^105=-1348395999
1367511040^1631980096^117=-108718039
inserting ghaqaiu:5 in hash row 873
40960^640^97=10
170790912^2668608^103=41697
1786933248^-2119562816^103=168208423
-2019921920^-903976512^114=-349668953
1100685312^1627810944^114=1300502962
-724623360^659766400^110=547179762
-1043406848^-418956416^119=-207872786
-1648398336^-1300824640^119=650763255
-1730449408^-228364864^109=801738167
-2116366336^-1375245504^112=1790451117
inserting aggrrnwwmp:6 in hash row 455
16384^256^114=4
68624384^1072256^105=16754
2010025984^31406656^98=67599593
446832640^-2073393024^111=1980869154
inserting ribo:7 in hash row 509
32768^512^119=8
136802304^2137536^118=33399
2128306176^33254784^102=134737334
14573568^-939296384^113=2132807142
-778104832^927366208^102=-925033999
-1765646336^-1235547776^103=-421958618
-1768001536^643463616^116=546925031
1083916288^822242560^108=-1329329740
inserting wvfqfgtl:8 in hash row 268
45056^704^118=11
187392000^2928000^99=45750
2094936064^-1041008448^113=185060835
-729083904^1263676480^101=-1121105743
-1815982080^-162592448^121=-1613153243
-711749632^1666600512^111=1703762233
1575153664^-1317565504^101=-1228546513
1471827968^626977088^102=-325747803
1267884032^-1188148864^117=1927592230
-1029746688^-1358267072^118=-222549515
1389584384^1699433856^121=1838492982
inserting vcqeyoefuvy:9 in hash row 521

Looking for the records we created
16384^256^119=4
68644864^1072576^121=16759
1926991872^30109248^105=67579321
266506240^-1002468800^106=1930493481
wyij:0 found successfully
24576^384^103=6
102658048^1604032^120=25063
1620803584^-2122158592^105=101059000
-1629057024^1182505536^114=-518394263
-794615808^658672768^103=-660796878
449736704^-127190592^101=-136205081
gxirge:1 found successfully
24576^384^115=6
102707200^1604800^107=25075
1959440384^-2116867392^104=101141675
-1716879360^1181133312^111=-182871384
-1029246976^-217408576^106=-540267921
-960847872^1796926080^104=833383338
skhojh:2 found successfully
36864^576^102=9
153247744^2394496^122=37414
1855954944^1102741120^113=151448058
1575948288^-914899904^97=791011057
-1966993408^640354368^102=-1801933791
-1853726720^977668480^116=-1394010074
1184841728^-249922304^99=-1413191180
1406545920^-514893632^117=-1216004765
-1769254912^-1504039616^101=-1298569035
fzqaftcue:3 found successfully
20480^320^104=5
85098496^1329664^113=20776
1822887936^1102224448^114=84331121
1489182720^1164119168^105=756386866
47091712^1745566272^98=497036521
hqrib:4 found successfully
28672^448^103=7
119173120^1862080^104=29095
1637515264^-1048155648^97=117840296
-827977728^121280576^113=-1608717727
-930934784^1730284608^97=-912488399
297930752^-397998016^105=-1348395999
1367511040^1631980096^117=-108718039
ghaqaiu:5 found successfully
40960^640^97=10
170790912^2668608^103=41697
1786933248^-2119562816^103=168208423
-2019921920^-903976512^114=-349668953
1100685312^1627810944^114=1300502962
-724623360^659766400^110=547179762
-1043406848^-418956416^119=-207872786
-1648398336^-1300824640^119=650763255
-1730449408^-228364864^109=801738167
-2116366336^-1375245504^112=1790451117
aggrrnwwmp:6 found successfully
16384^256^114=4
68624384^1072256^105=16754
2010025984^31406656^98=67599593
446832640^-2073393024^111=1980869154
ribo:7 found successfully
32768^512^119=8
136802304^2137536^118=33399
2128306176^33254784^102=134737334
14573568^-939296384^113=2132807142
-778104832^927366208^102=-925033999
-1765646336^-1235547776^103=-421958618
-1768001536^643463616^116=546925031
1083916288^822242560^108=-1329329740
wvfqfgtl:8 found successfully
45056^704^118=11
187392000^2928000^99=45750
2094936064^-1041008448^113=185060835
-729083904^1263676480^101=-1121105743
-1815982080^-162592448^121=-1613153243
-711749632^1666600512^111=1703762233
1575153664^-1317565504^101=-1228546513
1471827968^626977088^102=-325747803
1267884032^-1188148864^117=1927592230
-1029746688^-1358267072^118=-222549515
1389584384^1699433856^121=1838492982
vcqeyoefuvy:9 found successfully

*/