Implementation Rationale
Closed-addressing Containers
boost::unordered_[multi]set
and boost::unordered_[multi]map
adhere to the standard requirements for unordered associative
containers, so the interface was fixed. But there are
still some implementation decisions to make. The priorities are
conformance to the standard and portability.
The Wikipedia article on hash tables has a good summary of the implementation issues for hash tables in general.
Data Structure
By specifying an interface for accessing the buckets of the container the standard pretty much requires that the hash table uses closed addressing.
It would be conceivable to write a hash table that uses another method. For example, it could use open addressing, and use the lookup chain to act as a bucket but there are some serious problems with this:
-
The standard requires that pointers to elements aren’t invalidated, so the elements can’t be stored in one array, but will need a layer of indirection instead - losing the efficiency and most of the memory gain, the main advantages of open addressing.
-
Local iterators would be very inefficient and may not be able to meet the complexity requirements.
-
There are also the restrictions on when iterators can be invalidated. Since open addressing degrades badly when there are a high number of collisions the restrictions could prevent a rehash when it’s really needed. The maximum load factor could be set to a fairly low value to work around this - but the standard requires that it is initially set to 1.0.
-
And since the standard is written with a eye towards closed addressing, users will be surprised if the performance doesn’t reflect that.
So closed addressing is used.
Number of Buckets
There are two popular methods for choosing the number of buckets in a hash table. One is to have a prime number of buckets, another is to use a power of 2.
Using a prime number of buckets, and choosing a bucket by using the modulus of the hash function’s result will usually give a good result. The downside is that the required modulus operation is fairly expensive. This is what the containers used to do in most cases.
Using a power of 2 allows for much quicker selection of the bucket to use, but at the expense of losing the upper bits of the hash value. For some specially designed hash functions it is possible to do this and still get a good result but as the containers can take arbitrary hash functions this can’t be relied on.
To avoid this a transformation could be applied to the hash function, for an
example see
Thomas Wang’s article on integer hash functions.
Unfortunately, a transformation like Wang’s requires knowledge of the number
of bits in the hash value, so it was only used when size_t
was 64 bit.
Since release 1.79.0, Fibonacci hashing
is used instead. With this implementation, the bucket number is determined
by using (h * m) >> (w - k)
, where h
is the hash value, m
is 2^w
divided
by the golden ratio, w
is the word size (32 or 64), and 2^k
is the
number of buckets. This provides a good compromise between speed and
distribution.
Since release 1.80.0, prime numbers are chosen for the number of buckets in tandem with sophisticated modulo arithmetic. This removes the need for "mixing" the result of the user’s hash function as was used for release 1.79.0.
Open-addresing Containers
The C++ standard specification of unordered associative containers impose severe limitations on permissible implementations, the most important being that closed addressing is implicitly assumed. Slightly relaxing this specification opens up the possibility of providing container variations taking full advantage of open-addressing techniques.
The design of boost::unordered_flat_set
/unordered_node_set
and boost::unordered_flat_map
/unordered_node_map
has been
guided by Peter Dimov’s Development Plan for Boost.Unordered.
We discuss here the most relevant principles.
Hash Function
Given its rich functionality and cross-platform interoperability,
boost::hash
remains the default hash function of open-addressing containers.
As it happens, boost::hash
for integral and other basic types does not possess
the statistical properties required by open addressing; to cope with this,
we implement a post-mixing stage:
a ← h mul C,
h ← high(a) xor low(a),
where mul is an extended multiplication (128 bits in 64-bit architectures, 64 bits in 32-bit environments), and high and low are the upper and lower halves of an extended word, respectively. In 64-bit architectures, C is the integer part of 264∕φ, whereas in 32 bits C = 0xE817FB2Du has been obtained from Steele and Vigna (2021).
When using a hash function directly suitable for open addressing, post-mixing can be opted out of via a dedicated hash_is_avalanching
trait.
boost::hash
specializations for string types are marked as avalanching.
Platform Interoperability
The observable behavior of boost::unordered_flat_set
/unordered_node_set
and boost::unordered_flat_map
/unordered_node_map
is deterministically
identical across different compilers as long as their std::size_t
s are the same size and the user-provided
hash function and equality predicate are also interoperable
—this includes elements being ordered in exactly the same way for the same sequence of
operations.
Concurrent Containers
The same data structure used by Boost.Unordered open-addressing containers has been chosen
also as the foundation of boost::concurrent_flat_set
/boost::concurrent_node_set
and
boost::concurrent_flat_map
/boost::concurrent_node_map
:
-
Open-addressing is faster than closed-addressing alternatives, both in non-concurrent and concurrent scenarios.
-
Open-addressing layouts are eminently suitable for concurrent access and modification with minimal locking. In particular, the metadata array can be used for implementations of lookup that are lock-free up to the last step of actual element comparison.
-
Layout compatibility with Boost.Unordered flat containers allows for fast transfer of all elements between a concurrent container and its non-concurrent counterpart, and vice versa.
Hash Function and Platform Interoperability
Concurrent containers make the same decisions and provide the same guarantees as Boost.Unordered open-addressing containers with regards to hash function defaults and platform interoperability.