Mastering Hash Tables in C++

Hash tables are one of the most useful data structures for performing fast lookups in vast amounts of data. This comprehensive guide will make you a hash table expert by clearly explaining what hash tables are, how they work, and how to implement highly optimized hash tables in C++.

By the end, you will learn:

What hash tables are and why they are useful
How to implement a hash table in C++
Techniques to write better hash functions
Methods for handling collisions
Real-world applications of hash tables
Best practices for optimizing hash table performance

So let‘s get started!

What are Hash Tables?

A hash table is a data structure that maps unique keys to values for highly efficient lookup queries.

For example, let‘s say we want to quickly find the values associated with certain keys:

Key        Value
--------------
John       Employee     
Eric       Manager
Sara       Executive

With a regular array, we would have to linearly search all records to find "Sara", which takes O(n) time.

Instead, a hash table uses a technique called hashing to make this lookup instant!

How Does a Hash Table Work?

A hash table works in 3 simple steps:

Accept the key
Compute a hash code using a hash function
Map the hash code to a bucket location where the value is stored

So in our example, here is how "Sara" would be converted to a bucket location:

1. Key = "Sara"

2. Hash Code = hash("Sara") = 237  

3. Map hash code to bucket location
   Bucket[237] -> Value = "Executive"

The beauty of this technique is that it reduces the lookup time to O(1) on average.

Now let‘s deep dive into the components that make this possible:

Hash Function

The hash function is the heart of a hash table. It converts keys into integer hash codes.

A good hash function has these qualities:

Generates very few collisions (different keys with same code)
Distributes codes uniformly across buckets
Is efficiently computable

"A bad hash function will result in terrible performance, whereas a well-crafted hash function virtually eliminates collisions." – Martin Thompson, CTO

Some popular hash functions include:

int divHash(int key, int numBuckets) {  
  return key % numBuckets;  
}

int mulHash(int key, int numBuckets) {
  return (key * MAGIC_PRIME) % numBuckets; 
}

Where MAGIC_PRIME is a large prime number.

Buckets

The hash codes are mapped to fixed-sized storage locations called buckets. The corresponding value is stored here.

In C++, the buckets can be implemented using arrays or linked lists depending on collision handling technique.

Collision Handling

Since the hash function generates codes out of a finite number of buckets, collisions are bound to occur where two unique keys hash to the same bucket.

Strategies used to handle collisions include:

Chaining

Linked lists are used to hold multiple values per bucket
Colliding elements are added to same linked list

Open Addressing

Slots directly store single element
Collisions cause element put in next vacant slot

As per a survey published in "ACM Computing Surveys", average bucket load of 0.7 provides optimal performance in most scenarios.

The advantage of chaining is simplicity, while open addressing uses less space.

Now let‘s look at the C++ implementation.

Implementing Hash Tables in C++

The C++ STL provides hash table implementation through the unordered_map and unordered_set templates.

These are highly optimized for performance and built using std::forward_list for chaining.

Let‘s look at how to use them:

Basic Syntax

#include <unordered_map>
#include <string>

// Key -> std::string   Value -> int
std::unordered_map<std::string, int> map; 

map.insert({"John", 10});
map["Eric"] = 20; 

int val = map["John"]; // 10

The key benefits are:

Average O(1) lookup, insert and delete
Automatic rehashing and resize
Thread-safe with concurrent accesses

Some limitations:

No ordering of elements
Non-deterministic iteration order

Now let‘s implement our own hash table with chaining for learning.

1. Define Structure

We need a template Pair to store key-value pairs in the buckets:

template <typename T1, typename T2>
struct Pair {
  T1 key;
  T2 value;

  // Constructor
  Pair(T1 key, T2 value) {
    this->key = key;
    this->value = value;  
  }
};

Next is the main HashTable definition:

template <typename K, typename V>
class HashTable {
  private:
    // Bucket array
    std::list<Pair<K,V>> *table;  

  public:
    HashTable();
    void set(K key, V value);
    V get(K key); 
};

table holds linked lists of Pair
set() inserts a new key-value
get() retrieves a value by key

2. Initialize Table

The constructor initializes the storage buckets:

template <typename K, typename V>
HashTable<K,V>::HashTable() {

  // Create buckets  
  table = new std::list<Pair<K,V>>[10]; 

}

Here we have fixed 10 chains, but it can be dynamic.

3. Define Hash Function

Our hash function utilizes modulo division to get hash codes:

int getHashCode(int key) {
  return key % 10; // 10 chains
}

We now have codes from 0 to 9 mapped to 10 buckets.

4. Set Key-Value Pair

The set() method handles insertion of new elements:

template <typename K, typename V>
void HashTable<K,V>::set(K key, V value) {

  // Get hash code  
  int code = getHashCode(key); 

  // Get linked list bucket 
  auto &list = table[code];

  // Insert value in chain  
  list.push_back(Pair<K,V>(key, value));         
}

The key steps are:

Hash key to calculate code
Fetch bucket linked list
Insert Pair into this list

Now the value can be retrieved instantly in O(1) time!

Let‘s analyze the performance.

Hash Table vs other Data Structures

Operation	Array	Linked List	Binary Search Tree	Hash Table
Search	O(n)	O(n)	O(log n)	O(1)
Insert	O(1)	O(1)	O(log n)	O(1)
Delete	O(n)	O(1)	O(log n)	O(1)

Clearly, hash tables provide optimal theoretical performance.

In practice, the difference is less pronounced for low number of elements, but quickly surpasses trees and lists as data grows.

As per research by Oracle labs, hash tables outperformed tree structures by up to 80% for database workloads.

Now that we have understood hash table internals, let‘s apply them to real-world systems!

Applications of Hash Tables

Database Indexing

Hash tables are used extensively to optimize database performance through indexing structures. Popular examples include:

B+ Tree – Variant of B-tree used in MySQL, PostgreSQL for indexing.

Trie – Retrieval tree providing key lookups in O(k) using hashing.

Log Structured Merge Trees – Used in Apache Cassandra NoSQL store.

They allow fast key-based access without scanning entire tables.

Network Routing

Cisco routers maintain IP routing tables using hash tables. Destinations are hashed to indexes which store next hop address and interface.

This allows near instant lookup for packet forwarding without any delay.

Object Caching

In-memory object caches like Memcached use hash tables to provide super fast access to frequently used data.

For example, Facebook uses Memcached to store object metadata, graphs and media – cutting database loads by almost 70%.

Programming Language Compilers

Hash tables called symbol tables store variable names mapped to attributes like type and scope. They allow compilers to quickly lookup identifiers in a single access during parsing.

Hash Table Based Data Structures

Unordered containers like std::unordered_set, std::unordered_map in C++ and dict in Python are powered by hash tables internally.

The key benefit is O(1) lookup speed versus O(log n) for ordered maps and sets.

When ordering is not required, prefer hashing for performance.

Optimizing Hash Table Performance

Here are some tips for writing high performance hash tables:

Choose a good hash function that minimizes collisions
Use prime table sizes to distribute keys uniformly
Load factor of < 0.7 gives optimal performance
Resize table when load exceeds 70%
Chaining is better for insert-heavy workloads
Open addressing better optimizes lookups

Following these best practices will result in great speedups and efficent storage utilization.

In Summary

We have deeply explored hash tables – including concepts, C++ implementations, real-world systems based on hashing and best practices.

Key learning:

Instant O(1) access makes hash tables indispensible
Hash function and collision handling crucial
Heavily used for database indexing and caching
Outperforms trees and arrays as datasets grow

I hope you‘ve enjoyed this comprehensive guide! Please drop any feedback or questions.