哈希字典Dict

Dict的组成

1.hash表的节点

2.hash表结构

3.字典

4.Dict结构

hash算法

哈希函数

什么情况下使用

rehash

rehash主要的函数dictExpand

怎么判断要进行扩容还是收缩

什么时候会用到扩展和收缩

渐进式rehash

渐进式rehash主要的函数dictRehash

字典API

字典的创建dictCreate

增加键值对dictAdd

查找给定键的值dictFetchValue

删除给定的键dictDelete

添加或覆盖键值对(更新键值对）dictReplace

Redis是一个键值型(Key - Value Pair)的数据库，我们可以根据键实现快速的增删改查。而键与值的映射关系正是通过Dict来实现的。

Dict的组成

Dict由三部分组成，分别是：哈希表（DictHashTable）、哈希节点（DictEntry）、字典（Dict）

1.hash表的节点

//dict.h
typedef struct dictEntry {
    void *key;
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;
    struct dictEntry *next;
} dictEntry;

hash表的节点, 主要是包含了key, val键值对和下一个节点的指针,。从该结构体我们可以看出, 这些节点会组成一个单向链表。
hash表的key是 Void * 指针类型,可以存储多种类型的数据，值 v 是一个共用体, 也可以是uint64整数, 也可以是int64整数, 还可以是浮点数。

2.hash表结构

typedef struct dictht {
    dictEntry **table;    //hash表数组，元素是 dictEntry*指针类型的
    unsigned long size;    //数组大小
    unsigned long sizemask;     //hash数组长度掩码, sizemask =  size - 1
    unsigned long used;    //hash表的已有的entry的个数
} dictht;

由hash表结构, 可知redis是通过数组做hash散列的.
hash表数组的 size 长度一般为 2^n 的大小, 保证这样的大小可以通&操作来计算出hash槽的位置。因为如果数组容量不是2^n的大小, 那么就需要通过取模来获取hash表的槽索引, 取模操作相对&操作性能会差一些.
szimmask 表示数组掩码, 一般是 size -1, 如: 假如数组长度为8, 那么掩码就为7, 掩码转换成二进制就为 ob00000111

3.字典

type 和 privdata 字段是针对不同类型的键值对，为创建多态字典而设计的。
type 字段是一个指向 dictType 结构的指针，每个 dictType 结构保存了一簇用于操作特定类型键值对的函数，Redis 会为用途不同的字典设置不同的类型特定函数。
privdata 字段则保存了需要传给那些类型特定函数的可选参数。
ht 字段是一个包含两个项的数组，数组中的每个项都是一个 dictht 哈希表。一般情况下，只使用ht[0]，只有当哈希表的键值对数量超过负载(元素过多)时，才会将键值对迁移到ht[1]，这一步迁移被称为 rehash (重哈希)。
rehashidx 字段也是和 rehash 有关，用于记录目前 rehash 的进度，如果没有 rehash，值就是 -1。
pauserehash代表rehash是否暂停。

typedef struct dict {
    dictType *type;    //内置一些不同类型的hash函数
    void *privdata;    //携带的私有数据
    dictht ht[2];     //hash字典的数组, 长度为2, 主要是用于渐进式hash时使用
    long rehashidx;    //rehash的进度, -1表示未进行
    int16_t pauserehash;    //rehash是否暂停， 1则暂停，0则继续
} dict;

//字典类型    该结构体内部是一些函数指针
typedef struct dictType {
    uint64_t (*hashFunction)(const void *key);    //计算hash值的函数
    void *(*keyDup)(void *privdata, const void *key);
    void *(*valDup)(void *privdata, const void *obj);
    int (*keyCompare)(void *privdata, const void *key1, const void *key2);
    void (*keyDestructor)(void *privdata, void *key);
    void (*valDestructor)(void *privdata, void *obj);
    int (*expandAllowed)(size_t moreMem, double usedRatio);
} dictType;

4.Dict结构

了解了上面的三个结构体，就可以知道Dict的结构了。

Dict使用了哈希表，其实就是一个数组，数组的每个元素称为一个哈希桶。所以，我们常说，一个哈希表是由多个哈希桶组成的，每个哈希桶中保存了键值对数据。

hash表一般的实现都是数组+链表来实现。数组来做hash散列, 链表用来解决hash冲突的问题

hash算法

哈希函数

类型处理函数中的第一个函数 hashFunction 就是计算某个键的哈希值的函数，对于不同类型的 key，哈希值的计算是不同的，所以在字典进行创建的时候，需要指定哈希函数。

redis默认使用的是dictGenHashFunction函数。dictGenHashFunction内部就只是调用了siphash函数。siphash是复杂的，但我们只需要知道有这么些函数就行了。

uint64_t dictGenHashFunction(const void *key, int len) {
    return siphash(key,len,dict_hash_function_seed);
}

uint64_t siphash(const uint8_t *in, const size_t inlen, const uint8_t *k) {
#ifndef UNALIGNED_LE_CPU
    uint64_t hash;
    uint8_t *out = (uint8_t*) &hash;
#endif
    uint64_t v0 = 0x736f6d6570736575ULL;
    uint64_t v1 = 0x646f72616e646f6dULL;
    uint64_t v2 = 0x6c7967656e657261ULL;
    uint64_t v3 = 0x7465646279746573ULL;
    uint64_t k0 = U8TO64_LE(k);
    uint64_t k1 = U8TO64_LE(k + 8);
    uint64_t m;
    const uint8_t *end = in + inlen - (inlen % sizeof(uint64_t));
    const int left = inlen & 7;
    uint64_t b = ((uint64_t)inlen) << 56;
    v3 ^= k1;
    v2 ^= k0;
    v1 ^= k1;
    v0 ^= k0;

    for (; in != end; in += 8) {
        m = U8TO64_LE(in);
        v3 ^= m;

        SIPROUND;

        v0 ^= m;
    }

    switch (left) {
    case 7: b |= ((uint64_t)in[6]) << 48; /* fall-thru */
    case 6: b |= ((uint64_t)in[5]) << 40; /* fall-thru */
    case 5: b |= ((uint64_t)in[4]) << 32; /* fall-thru */
    case 4: b |= ((uint64_t)in[3]) << 24; /* fall-thru */
    case 3: b |= ((uint64_t)in[2]) << 16; /* fall-thru */
    case 2: b |= ((uint64_t)in[1]) << 8; /* fall-thru */
    case 1: b |= ((uint64_t)in[0]); break;
    case 0: break;
    }

    v3 ^= b;

    SIPROUND;

    v0 ^= b;
    v2 ^= 0xff;

    SIPROUND;
    SIPROUND;

    b = v0 ^ v1 ^ v2 ^ v3;
#ifndef UNALIGNED_LE_CPU
    U64TO8_LE(out, b);
    return hash;
#else
    return b;
#endif
}

什么情况下使用

在当要将一个新的键值对添加到字典里面或者通过键查找值的时候
出现哈希冲突的时候使用

那么在插入或者查找dictEntry 时候，我们就需要找到其所在下标的索引。(哈希表是个数组）具体算法如下：

1.通过宏 dictHashKey 计算得到该键对应的哈希值

#define dictHashKey(d, key) (d)->type->hashFunction(key)

2.将哈希值和哈希表的 sizemask 属性做位与(相当于取模运算)，得到索引值 index，其中 ht[x] 可以是 ht[0] 或者 ht[1]。

index = dictHashKey(d, key) & d->ht[x].sizemask;

3.哈希冲突

redis字典实现是使用链地址法。

哈希冲突那肯定是在键值对插入的时候。看看键值对插入函数dictAdd。

int dictAdd(dict *d, void *key, void *val)
{
    dictEntry *entry = dictAddRaw(d,key,NULL);

    if (!entry) return DICT_ERR;
    dictSetVal(d, entry, val);
    return DICT_OK;
}

#define dictSetVal(d, entry, _val_) do { \
    if ((d)->type->valDup) \
        (entry)->v.val = (d)->type->valDup((d)->privdata, _val_); \
    else \
        (entry)->v.val = (_val_); \
} while(0)

dictAdd会调用dictAddRaw生成一个新的哈希节点，然后调用dictSetVal去设置这个哈希节点的val值,那么剩下key还没有设置，设置key在dictAddRaw函数内。

dictEntry *dictAddRaw(dict *d, void *key, dictEntry **existing)
{
    long index;
    dictEntry *entry;
    dictht *ht;

    if (dictIsRehashing(d)) _dictRehashStep(d);//若符合，则执行rehash

    if ((index = _dictKeyIndex(d, key, dictHashKey(d,key), existing)) == -1)//1.获取索引
        return NULL;

    //2.根据是否rehash状态获取ht[1]还是获取ht[0]，如果正在refash,则用哈希表ht[1]进行插入
    ht = dictIsRehashing(d) ? &d->ht[1] : &d->ht[0];
    //3.分配内存生成新的hash节点，然后在链表头部插入这个hash节点,
    entry = zmalloc(sizeof(*entry));
    entry->next = ht->table[index];
    ht->table[index] = entry;
    ht->used++;

    //4.设置键
    dictSetKey(d, entry, key);
    return entry;
}

rehash

随着字典操作的不断执行，哈希表保存的键值对会不断增多或减少。当哈希表保存的键值对数量太多或者太少时，需要对哈希表大小进行扩容或者收缩。

而不管是扩容还是收缩，必定是会创建新的哈希表这会导致哈希表的size和sizemask发生变化，而key的查询是于sizemask相关的。所以必须对哈希表中的每一个key进行重新计算索引，然后插入新的哈希表，这个过程就是rehash。

过程如下：

计算出新hash表的realsize
按照新的realsize申请内存空间，创建dictht,并赋值给dicth.ht[1]
设置dict.rehashidx=0,表示开始rehash

rehash主要的函数dictExpand

int dictExpand(dict *d, unsigned long size) {
    return _dictExpand(d, size, NULL);
}

int _dictExpand(dict *d, unsigned long size, int* malloc_failed)
{
    if (malloc_failed) *malloc_failed = 0;

    /* the size is invalid if it is smaller than the number of
     * elements already inside the hash table */
    if (dictIsRehashing(d) || d->ht[0].used > size)
        return DICT_ERR;
    
    //步骤1，计算出新的realsize
    dictht n; /* the new hash table */
    unsigned long realsize = _dictNextPower(size);

    /* Detect overflows */
    if (realsize < size || realsize * sizeof(dictEntry*) < realsize)
        return DICT_ERR;

    /* Rehashing to the same table size is not useful. */
    if (realsize == d->ht[0].size) return DICT_ERR;
    
    //步骤2,申请内存空间，创建dictht
    /* Allocate the new hash table and initialize all pointers to NULL */
    n.size = realsize;
    n.sizemask = realsize-1;
    if (malloc_failed) {
        n.table = ztrycalloc(realsize*sizeof(dictEntry*));
        *malloc_failed = n.table == NULL;
        if (*malloc_failed)
            return DICT_ERR;
    } else
        n.table = zcalloc(realsize*sizeof(dictEntry*));

    n.used = 0;

    /* Is this the first initialization? If so it's not really a rehashing
     * we just set the first hash table so that it can accept keys. */
    if (d->ht[0].table == NULL) {
        d->ht[0] = n;
        return DICT_OK;
    }
    
    //步骤3，设置rehashidx=0,标识开始rehash
    /* Prepare a second hash table for incremental rehashing */
    d->ht[1] = n;
    d->rehashidx = 0;
    return DICT_OK;
}

//哈希表大小是二次幂
/* Our hash table capability is a power of two */
static unsigned long _dictNextPower(unsigned long size)
{
    unsigned long i = DICT_HT_INITIAL_SIZE;

    if (size >= LONG_MAX) return LONG_MAX + 1LU;
    while(1) {
        if (i >= size)
            return i;
        i *= 2;
    }
}

怎么判断要进行扩容还是收缩

触发扩容的条件，负载因子>5

负载因子 = 保存的节点数（used）/ 哈希表大小（size）。

即是load_factor = ht[0].used / ht[0].size。

static int dict_can_resize = 1;
static unsigned int dict_force_resize_ratio = 5;

static int _dictExpandIfNeeded(dict *d)
{
    /* Incremental rehashing already in progress. Return. */
    if (dictIsRehashing(d)) return DICT_OK;
    
    // 大小为0需要设置初始哈希表大小为4
    if (d->ht[0].size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE);

    /* If we reached the 1:1 ratio, and we are allowed to resize the hash
     * table (global setting) or we should avoid it but the ratio between
     * elements/buckets is over the "safe" threshold, we resize doubling
     * the number of buckets. */
    if (d->ht[0].used >= d->ht[0].size &&
        (dict_can_resize ||
         d->ht[0].used/d->ht[0].size > dict_force_resize_ratio) &&
        dictTypeExpandAllowed(d))    // 负载因子超过5，执行 dictExpand
    {
        return dictExpand(d, d->ht[0].used + 1);
    }
    return DICT_OK;
}

触发收缩的条件:哈希表的负载因子小于 0.1 时

//dict.c
int dictResize(dict *d)
{
    unsigned long minimal;

    if (!dict_can_resize || dictIsRehashing(d)) return DICT_ERR;
    minimal = d->ht[0].used;
    if (minimal < DICT_HT_INITIAL_SIZE)
        minimal = DICT_HT_INITIAL_SIZE;
    return dictExpand(d, minimal);
}


//dictResize函数的使用场景
//server.h
#define HASHTABLE_MIN_FILL        10      /* Minimal hash table fill 10% */

//server.c
int htNeedsResize(dict *dict) {
    long long size, used;

    size = dictSlots(dict);
    used = dictSize(dict);
    return (size > DICT_HT_INITIAL_SIZE &&
            (used*100/size < HASHTABLE_MIN_FILL));
}
void tryResizeHashTables(int dbid) {
    if (htNeedsResize(server.db[dbid].dict))
        dictResize(server.db[dbid].dict);
    ....................
}

扩展或收缩哈希表都是用到了函数dictExpand，而新的哈希表最终大小都是 2 的 n 次幂

什么时候会用到扩展和收缩

扩展是在每次执行添加键值对或者修改键值对时候进行判断的，即是每次添加键值对的时候都会执行_dictExpandIfNeeded。
判断是否需要收缩是可以认为是使用定时器，定时去对比判断的。

dictExpand只是扩展了哈希表，还没有把哈希表内的键值对保存到ht[1]中，这个操作在其他地方进行的。这就需要到渐进式rehash。

渐进式rehash

扩展和收缩哈希表都需要将 ht[0] 里面的所有键值对 rehash 到 ht[1] 里面。但是，rehash 动作并不是一次性、集中式完成的，而是分多次、渐进式地完成的。因为一次性将所有键值对全部 rehash 到 ht[1] 的话，庞大的计算量可能会导致服务器在一段时间内停止服务。

渐进式rehash主要的函数dictRehash

每次进行新增、查询、修改、删除时，都会检查下dict.rehashidx是否大于-1,如果是则将dict.ht[0].table[rehashidx]的entry链表赋值给dict.ht[1]中，并将rehashidx++。直至dict.ht[0]的所有数据都rehash到dict.ht[1]中。
将dict.ht[1]赋值给dict.ht[0],将dict.ht[1]初始化为空哈希表，并释放原来的dict.ht[0]内存。
将rehashidx赋值为-1，代表rehash结束。

注意:在rehash过程中，新增操作的话，是直接写入到ht[1]，而查询、修改和删除则是会在dict.ht[0]和dict.ht[1]依次查找并执行。这样可以确保ht[0]的数据只减不增，随着rehash最终变为空。

渐进式rehash做的优化：为了防止 ht[0] 是个稀疏表（遍历很久遇到的都是NULL），从而导致函数阻塞时间太长，rehash做了个优化，引入“最大空格访问数”，也即代码中的 enmty_visits，初始值为 n*10。当遇到NULL的数量超过这个初始值直接返回。

//dictRehash被_dictRehashStep调用
static void _dictRehashStep(dict *d) {
    if (d->pauserehash == 0) dictRehash(d,1);
}
//每次执行增加、删除等操作时候，都会判断是否要执行_dictRehashStep

int dictRehash(dict *d, int n) {
    //最多有10*n个空桶
    int empty_visits = n*10; /* Max number of empty buckets to visit. */
    if (!dictIsRehashing(d)) return 0;

    while(n-- && d->ht[0].used != 0) {
        dictEntry *de, *nextde;

        /* Note that rehashidx can't overflow as we are sure there are more
         * elements because ht[0].used != 0 */
        assert(d->ht[0].size > (unsigned long)d->rehashidx);
        //是空桶的话，且超过10*n个空桶，就跳过
        while(d->ht[0].table[d->rehashidx] == NULL) {
            d->rehashidx++;
            if (--empty_visits == 0) return 1;
        }
        de = d->ht[0].table[d->rehashidx];
        /* Move all the keys in this bucket from the old to the new hash HT */
        while(de) {
            uint64_t h;

            nextde = de->next;
            /* Get the index in the new hash table */
            h = dictHashKey(d, de->key) & d->ht[1].sizemask;
            de->next = d->ht[1].table[h];
            d->ht[1].table[h] = de;
            d->ht[0].used--;
            d->ht[1].used++;
            de = nextde;
        }
        d->ht[0].table[d->rehashidx] = NULL;
        d->rehashidx++;
    }

    //判断是否已经把ht[0]的所有数据都rehash到ht[1]，若是则赋值ht[0]
    if (d->ht[0].used == 0) {
        zfree(d->ht[0].table);
        d->ht[0] = d->ht[1];
        _dictReset(&d->ht[1]);
        d->rehashidx = -1;
        return 0;
    }

    /* More to rehash... */
    return 1;
}

渐进式 rehash 的好处在于它采取分而治之的方式，将 rehash 键值对所需的计算工作均摊到对字典的每个添加、删除、查找和更新操作上，从而避免了集中式 rehash 而带来的庞大计算量。

字典API

字典的创建dictCreate

主要是分配空间，初始化一些数据。

static void _dictReset(dict *ht) {
    ht->table = NULL;
    ht->size = 0;
    ht->sizemask = 0;
    ht->used = 0;
}

/* Create a new hash table */
static dict *dictCreate(dictType *type, void *privDataPtr) {
    dict *ht = malloc(sizeof(*ht));
    _dictInit(ht,type,privDataPtr);
    return ht;
}

/* Initialize the hash table */
static int _dictInit(dict *ht, dictType *type, void *privDataPtr) {
    _dictReset(ht);
    ht->type = type;
    ht->privdata = privDataPtr;
    return DICT_OK;
}

增加键值对dictAdd

判断是否需要进行rehash的。
通过哈希函数找到要插入的key的槽位置(通过函数_dictKeyIndex)
判断要插入哪个哈希表，并插入，最终设置value。

/* Add an element to the target hash table */
int dictAdd(dict *d, void *key, void *val)
{
    dictEntry *entry = dictAddRaw(d,key,NULL);

    if (!entry) return DICT_ERR;
    dictSetVal(d, entry, val);
    return DICT_OK;
}

dictEntry *dictAddRaw(dict *d, void *key, dictEntry **existing)
{
    long index;
    dictEntry *entry;
    dictht *ht;

    if (dictIsRehashing(d)) _dictRehashStep(d);//判断是否需要进行rehash

    /* Get the index of the new element, or -1 if
     * the element already exists. */
    if ((index = _dictKeyIndex(d, key, dictHashKey(d,key), existing)) == -1)
        return NULL;

    /* Allocate the memory and store the new entry.
     * Insert the element in top, with the assumption that in a database
     * system it is more likely that recently added entries are accessed
     * more frequently. */
    ht = dictIsRehashing(d) ? &d->ht[1] : &d->ht[0];
    entry = zmalloc(sizeof(*entry));
    entry->next = ht->table[index];
    ht->table[index] = entry;
    ht->used++;

    /* Set the hash entry fields. */
    dictSetKey(d, entry, key);
    return entry;
}

static long _dictKeyIndex(dict *d, const void *key, uint64_t hash, dictEntry **existing)
{
    unsigned long idx, table;
    dictEntry *he;
    if (existing) *existing = NULL;

    /* Expand the hash table if needed */
    if (_dictExpandIfNeeded(d) == DICT_ERR)
        return -1;
    for (table = 0; table <= 1; table++) {
        idx = hash & d->ht[table].sizemask;
        /* Search if this slot does not already contain the given key */
        he = d->ht[table].table[idx];
        while(he) {
            if (key==he->key || dictCompareKeys(d, key, he->key)) {
                if (existing) *existing = he;
                return -1;
            }
            he = he->next;
        }
        if (!dictIsRehashing(d)) break;
    }
    return idx;
}

查找给定键的值dictFetchValue

利用 dictFind 找到给定键的 dictEntry，然后获得值，值的类型不确定，所以返回一个万能指针。

其主要是调用了dictFind。该操作内部有执行rehash工作，说明了渐进式rehash是分摊到查找、删除、更新等操作里面，从而避免了集中式 rehash 而带来的庞大计算量。

步骤：

计算指定key的哈希值
遍历hash表，根据hash值和sizemask计算出table的索引(是遍历Dict两个hash表，因为渐进式rehash中，不能确定该key是在哪个哈希表的）
取得hash节点链表信息并进行遍历

void *dictFetchValue(dict *d, const void *key) {
    dictEntry *he;

    he = dictFind(d,key);
    return he ? dictGetVal(he) : NULL;
}

dictEntry *dictFind(dict *d, const void *key)
{
    dictEntry *he;
    uint64_t h, idx, table;

    if (dictSize(d) == 0) return NULL; /* dict is empty */
    if (dictIsRehashing(d)) _dictRehashStep(d);    //符合就执行rehash
    h = dictHashKey(d, key);    //通过给定的哈希函数和去计算key的哈希值
    for (table = 0; table <= 1; table++) {
        idx = h & d->ht[table].sizemask;    //hash值和sizemask计算出table的索引
        he = d->ht[table].table[idx];        //所在的哈希表节点
        while(he) {
            if (key==he->key || dictCompareKeys(d, key, he->key))
                return he;
            he = he->next;
        }
        if (!dictIsRehashing(d)) return NULL;
    }
    return NULL;
}

删除给定的键dictDelete

通过哈希算法找到对应的键，从对应链表移除

int dictDelete(dict *ht, const void *key) {
    return dictGenericDelete(ht,key,0) ? DICT_OK : DICT_ERR;
}

static dictEntry *dictGenericDelete(dict *d, const void *key, int nofree) {
    uint64_t h, idx;
    dictEntry *he, *prevHe;
    int table;

    if (d->ht[0].used == 0 && d->ht[1].used == 0) return NULL;

    if (dictIsRehashing(d)) _dictRehashStep(d);
    h = dictHashKey(d, key);    //计算出key的哈希值

    for (table = 0; table <= 1; table++) {
        idx = h & d->ht[table].sizemask;    //通过哈希值和sizemask找到该索引
        he = d->ht[table].table[idx];       //找到该链表结点
        prevHe = NULL;
        //通过链表里寻找
        while(he) {
            if (key==he->key || dictCompareKeys(d, key, he->key)) {
                /* Unlink the element from the list */
                if (prevHe)
                    prevHe->next = he->next;
                else
                    d->ht[table].table[idx] = he->next;
                if (!nofree) {
                    dictFreeKey(d, he);
                    dictFreeVal(d, he);
                    zfree(he);
                }
                d->ht[table].used--;
                return he;
            }
            prevHe = he;
            he = he->next;
        }
        if (!dictIsRehashing(d)) break;
    }
    return NULL; /* not found */
}

添加或覆盖键值对(更新键值对）dictReplace

添加一个元素，如果键已经存在，则丢弃旧值。

该函数也主要是调用dictAddRaw。

如果键是从头开始添加的，则返回1；如果已经有一个元素具有这样的键，并且dictReplace()刚刚执行了值更新操作，则返回0

int dictReplace(dict *d, void *key, void *val)
{
    dictEntry *entry, *existing, auxentry;

    entry = dictAddRaw(d,key,&existing);
    if (entry) {
        dictSetVal(d, entry, val);
        return 1;
    }

    auxentry = *existing;
    dictSetVal(d, existing, val);
    dictFreeVal(d, &auxentry);
    return 0;
}