Double Hashing | Open Addressing | Hash Tables

To build our own spatial hash table, we will need to understand how to resolve the hash collisions we encounter when adding elements with double hashing.

Keywords: javascript, tutorial, spatial hash table, hash map, data structures, leetcode

By Carmen Cincotti  

Double hashing uses the idea of ​​applying a second hash function to the key when a collision occurs in a hash table.

Hash tables

We’ll take a closer look at double hashing as well as how we can use it to resolve collisions when filling a hash table. We will also see that double hashing avoids problems encountered by other collision techniques, such as clustering.

Recall that last week we talked about quadratic probing, and before that linear probing, which are different methods used to resolve hash collisions in order to find and place items in a hash table.

That said, let’s dive into it by learning more about double hashing.

Hey! Before continuing on - if you haven’t already read the previous articles on open addressing and hash tables, I really recommend that you take a look so that you can more easily follow along:

What is Double Hashing?

Double Hashing is a way to resolve hash collisions by using a second hash function.

Inserting a new key into the hash table

The formula

The double hashing formula for finding an open bucket or a particular element already placed in the hash table is the following:

i=(H(k)+xJ(k))modS i = ( H(k) + x * J(k) ) \mod S

where ii is the index of the underlying array, HH is the hash function, kk is the key of the element to be hashed. SS is the size of the table… and what about JJ ?

JJ is a second hash function (the double hash eh?). It is given the key of the element to hash. We multiply this number by xx which is the x-th iteration of the sequence.

Let’s take a look at what this means with an example.

An example

Consider an example using the hash function J(k)=3(kmod3)J(k) = 3 - (k \mod 3), and a table size S=4S = 4. Assume kk is an integer.

I also have a rather full hashtable, where the indices 00 and 33 in the underlying storage array are already occupied:

A hash table with some elements

I would like to insert an element with a key, k=10k = 10, which is hashed at index 00 - so H(10)=0H(10) = 0.

During our first iteration of the double hash sequence - we will see a formula filled in with the following values: H(10)=0H(10) = 0, J(10)=310mod3=2J(10) = 3 - 10 \mod 3 = 2, S=4S = 4, x=0x = 0:

i=0+02mod4=0 i = 0 + 0 * 2 \mod 4 = 0

Our formula returns the index 00 — but we have already seen that this location in the hash table is already occupied!

Inserting a new key into the hash table - occupied bucket found

It’s necessary to try again to find an open bucket. Before continuing, we need to first iterate xx by 11. Doing this, we calculate xJ(k)x * J(k) like this: 1J(10)=12=21 * J(10) = 1 * 2 = 2. Our next calculation of ii is as follows:

i=0+12mod4=2 i = 0 + 1 * 2 \mod 4 = 2

That’s it ! The probe found an open bucket at index 22:

Inserting a new key into the hash table - open bucket found

We’re done!

Cycles

Remember that a cycle is a series that never ends. This phenomenon occurs when a particular path is taken over and over endlessly when searching for an open bucket or an item already placed in the hash table.

To see how a cycle occurs I recommend you look at this example from my quadratic probing article.

Is double hashing susceptible to cycles?

Yes, double hashing is susceptible to cycles.

How can we avoid them?

A good way to avoid cycles is to carefully choose the second hash function, J(k)J(k) and the table size S S. TopCoder recommends that we use the following conditions to do this:

  • SS is a prime number
  • J(k)=H2(k)modSJ(k) = H_2(k) \mod S
  • J1J \geq 1 and J<SJ < S

where H2(k)H_2(k) is a second hash function.

The trick is that if those conditions are satisfied, we can ensure that the probing sequence is guaranteed to have order N, which makes us hit every bucket in our hash table.

Common operations

The common operations of a hash table that implements double hashing are similar to those of a hash table that implement other open address techniques such as linear or quadratic probing.

We will see what this means in the following sections.

Insertion

Inserting an item into a hash table using double hashing to resolve hash collisions is an easy task!

Consider the following example where we have an underlying array already populated with a few elements:

Hash table with elements at certain positions

If we would like to add another element to the table with key k4k_4 which is hashed at the same location of another element, we only need to iterate through the underlying array until we find an open bucket in order to place the element in the table:

Inserting a new key into the hash table

To do this, recall the formula i=(H(k)+xJ(k))modSi = ( H(k) + x * J(k) ) \mod S. We iterate xx in order to traverse the array.

The sequence defined by the double hash formula works like the probes we have already discussed. The underlying array is traversed looking for the element of interest:

Searching for an element

All is well if the item of interest is found!

However, if an empty bucket is found during the search, the search stops here. Nothing was found:

Searching for an element and hitting an empty location

Which raises the question - what will happen if we delete an element? 🤔

Deletion

Deleting an item while using double hashing is a bit tricky. Consider this example where I naively remove an item from a hash table:

Delete an element from hash table

We just deleted an item from the hash table. Then we look for the item which is hashed by the hash function in the same location as the item which was just deleted and which was placed earlier in the hash table to the right relative to the element that has just been deleted:

Search for an element from hash table that has been deleted incorrectly

We won’t find it! Why? 🤔

Because the double hash sequence stops when it encounters an empty bucket.

To avoid such unpredicted behavior - we must place at the location of the deletion of an element what is called a tombstone 🪦.

If the double hash sequence encounters a tombstone while searching the array, it will be ignored and the sequence will continue.

Let’s see an example with this idea in mind:

Delete an element from hash table and place a tombstone

Then, we search for an element like before, however this time we are able to find it thanks to the tombstone:

Search and find the element after deleting correctly with a tombstone

Disadvantages

Although double hashing avoids clustering problems, it pays for this luxury by suffering from cache misses.

Cache performance

The double hash acts like a linked list. That is to say from the point of view of the CPU, the sequence is almost pseudo-random. This would make the cache almost useless…

Before we continue, it must be said that research suggests that double hashing is always better than linear probing, and even probing quadratic because it totally avoids clustering, which gives double hashing a real performance boost.

Advantages

Avoidance of clustering

Primary clustering

Fortunately, double hashing avoids primary clustering that we have already seen during our discussion of linear probing.

To summarize, primary clustering is a phenomenon that occurs when elements are added to a hash table. Elements can tend to clump together, form clusters, which over time will have a significant impact on the performance of finding and adding elements, as we will approach the worst case O(n)O (n) time complexity.

Secondary clustering

To sum up, secondary clustering happens when every new element that hashes to the same bucket in the table has to follow the same path each time, and thus must append itself to this to path when it finds an empty space… which will make this path grow over time and therefore further aggravate the problem.

Time complexity

To summarize, a hash table gives us more efficient time complexity compared to an array or a linked list when we would like to search, insert or delete an element (on average).

Structure Access Search Insertion Deletion
Array O(1)O(1) O(n)O(n) O(n)O(n) O(n)O(n)
Linked List O(n)O(n) O(n)O(n) O(1)O(1) O(1)O(1)
Hash Table O(1)O(1) O(1)O(1) O(1)O(1) O(1)O(1)

And in the worst case scenario, we would see these time complexities:

Structure Access Search Insertion Deletion
Array O(1)O(1) O(n)O(n) O(n)O(n) O(n)O(n)
Linked List O(n)O(n) O(n)O(n) O(1)O(1) O(1)O(1)
Hash Table O(n)O(n) O(n)O(n) O(n)O(n) O(n)O(n)

Why is there a worst case? This is due to hash collisions. One should choose an underlying array that is large enough as well as a rather moderate load factor.

Resources


Comments for Double Hashing | Open Addressing | Hash Tables



Written by Carmen Cincotti, computer graphics enthusiast, language learner, and improv actor currently living in San Francisco, CA.  Follow @CarmenCincotti

Contribute

Interested in contributing to Carmen's Graphics Blog? Click here for details!