..

unordered_map的实现【更新中】

std::unordered_map::rehash 方法的描述【0】:

void rehash( size_type n );
Set number of buckets
Sets the number of buckets in the container to n or more.

If n is greater than the current number of buckets in the container (bucket_count), a rehash is forced. The new bucket count can either be equal or greater than n.

If n is lower than the current number of buckets in the container (bucket_count), the function may have no effect on the bucket count and may not force a rehash.

A rehash is the reconstruction of the hash table: All the elements in the container are rearranged according to their hash value into the new set of buckets. This may alter the order of iteration of elements within the container.

Rehashes are automatically performed by the container whenever its load factor is going to surpass its max_load_factor in an operation.

Notice that this function expects the number of buckets as argument. A similar function exists, unordered_map::reserve, that expects the number of elements in the container as argument.
If a rehash happens, all iterators are invalidated, but references and pointers to individual elements remain valid.
If no actual rehash happens, no changes.


下面是可能会触发 std::unordered_map rehash 的几个方法

std::unordered_map::operator[]
std::unordered_map::emplace
std::unordered_map::emplace_hint
std::unordered_map::insert
std::unordered_map::reserve

关于 Iterator validity的描述如下:

On most cases, all iterators in the container remain valid after the insertion. The only exception being when the growth of the container forces a rehash. In this case, all iterators in the container are invalidated.

A rehash is forced if the new container size after the insertion operation would increase above its capacity threshold (calculated as the container 's bucket_count multiplied by its max_load_factor).

References to elements in the unordered_map container remain valid in all cases, even after a rehash.

有下面这个计算公式

capacity threshold =  bucket_count * max_load_factor

上面提到的 bucket_countmax_load_factor 是什么意思呢? 分别看下解释。

bucket

std::unordered_map::bucket_count

size_type bucket_count() const noexcept;
Return number of buckets
Returns the number of buckets in the unordered_map container.

A bucket is a slot in the container's internal hash table to which elements are assigned based on the hash value of their key.

The number of buckets influences directly the load factor of the container's hash table (and thus the probability of collision). The container automatically increases the number of buckets to keep the load factor below a specific threshold (its max_load_factor), causing a rehash each time the number of buckets needs to be increased.

从网上找了个图来展示下bucket。
image

上面的 buckets vector 就是bucket,std::unordered_map 内部有一个 hashtable. 数据结构为

template <typename Types>
struct table : boost::unordered::detail::functions<typename Types::hasher, typename Types::key_equal>

每个 key 会通过哈希计算映射到一个位置,hash 计算有存在 hash 碰撞的可能性,则在同一个位置的元素会按顺序连接起来,导致了有的bucket没有数据,而其他bucket有多个数据的情况出现。
目前位置了解了 bucket 的概念,再看 std::unordered_mapbucket 大小。

#include <iostream>
#include <string>
#include <unordered_map>

// typedef std::unordered_map<std::string,std::string> stringmap;
using stringmap = std::unordered_map<std::string,std::string>;

int main ()
{
    /*
    explicit unordered_map ( size_type n = ,// see below
                         const hasher& hf = hasher(),
                         const key_equal& eql = key_equal(),
                         const allocator_type& alloc = allocator_type() );
    */
    /*
        使用 default constructor, 创建一个 empty unordered_map。

        C++11 版本,有5个构造方法。
        此处使用第一个也就是,default constructor来验证。
    */
    stringmap use_default_constructor_obj;
    std::cout << "use_default_constructor_obj's default_bucket_count = " << use_default_constructor_obj.bucket_count() << std::endl;

    return 0;
}

输出可能结果为:

use_default_constructor_obj's default_bucket_count: 11

如果使用 boost::unordered_map 的话【1】输出的结果为:

use_default_constructor_obj's default_bucket_count: 16

再看下 default constructor 的实现。

template <class K, class T, class H, class P, class A>
unordered_map<K, T, H, P, A>::unordered_map()
    : table_(
            boost::unordered::detail::default_bucket_count, 
            hasher(),
            key_equal(), 
            allocator_type()
        )
{
}
static const std::size_t default_bucket_count = 11; // 为什么是 11 呢???

同时也可以看出来, default contructor 中 的 size_type n 也不是 unordered_map 的大小设置, 而是 bucket 的个数值, 并且是最小 bucket 个数设置。

Minimum number of initial buckets.
This is not the number of elements in the container, but the minimum number of slots desired for the internal hash table on construction.
If this argument is not specified, the constructor determines this automatically (in a way that depends on the particular library implementation).
Member type size_type is an unsigned integral type.

看一下 range constructorbucket 默认值是多少。

  unordered_map ( InputIterator first, InputIterator last,
                  size_type n = /* see below */,
                  const hasher& hf = hasher(),
                  const key_equal& eql = key_equal(),
                  const allocator_type& alloc = allocator_type() );
// range constructor
// 此处 n 同样是 bucket 个数设置,可以不写。原因见下。
template <class K, class T, class H, class P, class A>
template <class InputIt>
unordered_map<K, T, H, P, A>::unordered_map(InputIt f, InputIt l, size_type n,
    const hasher& hf, const key_equal& eql, const allocator_type& a)
    : table_(boost::unordered::detail::initial_size(f, l, n), hf, eql, a)
{
    this->insert(f, l);
}

// initial_size 实现
template <class I>
inline std::size_t initial_size(I i, I j,
    // n 的 默认同样是 default_bucket_count
    std::size_t num_buckets = boost::unordered::detail::default_bucket_count)
{
    // TODO: Why +1?
    // 计算 bucket count, 此处 在个数的基础上 + 1 操作, why?
    return (std::max)(boost::unordered::detail::insert_size(i, j) + 1, num_buckets);
}

// 计算 iterator begin, last的距离,及 range 的个数
template <class I>
inline std::size_t insert_size(I i, I j,
    typename boost::unordered::detail::enable_if_forward<I, void*>::type = 0)
{
    return static_cast<std::size_t>(std::distance(i, j));
}

max_load_factor

当插入一个元素后,container size 大于 capacity threshold 会触发 rehash。 特别提到 unordered_map 中 元素 的 reference 依然有效, 怎么实现的。
看下源码。 (取自 boost_1_65_0)

template <class K, class T, class H, class P, class A>
typename unordered_map<K, T, H, P, A>::mapped_type&
    unordered_map<K, T, H, P, A>::operator[](const key_type& k)
{ 
    return table_.try_emplace_unique(k).first->second;
}

template <typename Key>
emplace_return try_emplace_unique(BOOST_FWD_REF(Key) k)
{
    std::size_t key_hash = this->hash(k);
    node_pointer pos = this->find_node(key_hash, k);
    if (pos) {
        return emplace_return(iterator(pos), false);
    } else {
        return emplace_return(
        	// 此处 iterator 会修改。
            iterator(this->resize_and_add_node_unique(
                boost::unordered::detail::func::construct_node_pair(
                    this->node_alloc(), boost::forward<Key>(k)),
                key_hash)),
            true);
    }
}

inline node_pointer resize_and_add_node_unique(
    node_pointer n, std::size_t key_hash)
{
    node_tmp b(n, this->node_alloc());
    this->reserve_for_insert(this->size_ + 1);
    return this->add_node_unique(b.release(), key_hash);
}

template <typename Types>
inline void table<Types>::reserve_for_insert(std::size_t size)
{
    if (!buckets_) {
        create_buckets((std::max)(bucket_count_, min_buckets_for_size(size)));
    } else if (size > max_load_) {
        std::size_t num_buckets =
            min_buckets_for_size((std::max)(size, size_ + (size_ >> 1)));

        if (num_buckets != bucket_count_)
            this->rehash_impl(num_buckets);
    }
}

rehash 方法的具体实现。

template <typename Types>
inline void table<Types>::rehash_impl(std::size_t num_buckets)
{
    BOOST_ASSERT(this->buckets_);

    this->create_buckets(num_buckets);
    link_pointer prev = this->get_previous_start();
    BOOST_TRY
    {
        while (prev->next_) {
            node_pointer n = next_node(prev);
            std::size_t key_hash = this->hash(this->get_key(n));
            std::size_t bucket_index = this->hash_to_bucket(key_hash);

            n->bucket_info_ = bucket_index;
            n->set_first_in_group();

            // Iterator through the rest of the group of equal nodes,
            // setting the bucket.
            for (;;) {
                node_pointer next = next_node(n);
                if (!next || next->is_first_in_group()) {
                    break;
                }
                n = next;
                n->bucket_info_ = bucket_index;
                n->reset_first_in_group();
            }

            // n is now the last node in the group
            bucket_pointer b = this->get_bucket(bucket_index);
            if (!b->next_) {
                b->next_ = prev;
                prev = n;
            } else {
                link_pointer next = n->next_;
                n->next_ = b->next_->next_;
                b->next_->next_ = prev->next_;
                prev->next_ = next;
            }
        }
    }
    BOOST_CATCH(...)
    {
        node_pointer n = next_node(prev);
        prev->next_ = node_pointer();
        while (n) {
            node_pointer next = next_node(n);
            destroy_node(n);
            --size_;
            n = next;
        }
        BOOST_RETHROW
    }
    BOOST_CATCH_END
}

参考

【0】rehash
【1】如果没有特别说明,则该博客上面所有的源码参考默认都是 boost_1_65_0 版。
【2】该文章中代码采用的在线编译器
【3】std::pair
【4】std::map
【5】std::unordered_map
【6】operator []