Redis源码阅读(九) – db

Redis是kv内存数据库，它的db是dict实现的，结构如下：

/* Redis database representation. There are multiple databases identified
 * by integers from 0 (the default database) up to the max configured
 * database. The database number is the 'id' field in the structure. */
typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    dict *blocking_keys;        /* Keys with clients waiting for data (BLPOP)*/
    dict *ready_keys;           /* Blocked keys that received a PUSH */
    dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */
    int id;                     /* Database ID */
    long long avg_ttl;          /* Average TTL, just for stats */
} redisDb;

db不只用来存储键值对，还有一些其他的功能:

dict: 存放键值对
expires: 存放键和过期时间的映射
blocking_keys: 存放block操作的key与client的映射
ready_keys: 接收到push操作的blocked key
watched_keys: WATCHED keys

object

Redis使用object封装底层的各种数据结构，结构如下:

typedef struct redisObject {
    unsigned type:4;
    unsigned encoding:4;
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits decreas time). */
    int refcount;
    void *ptr;
} robj;

object实现了简单的引用计数refcount，不同的地方可能会使用相同的object，当refcount为0时会自动释放。这样做既能节省内存，也简化了生命周期管理:

void decrRefCount(robj *o) {
    if (o->refcount == 1) {
        switch(o->type) {
        case OBJ_STRING: freeStringObject(o); break;
        case OBJ_LIST: freeListObject(o); break;
        case OBJ_SET: freeSetObject(o); break;
        case OBJ_ZSET: freeZsetObject(o); break;
        case OBJ_HASH: freeHashObject(o); break;
        case OBJ_MODULE: freeModuleObject(o); break;
        default: serverPanic("Unknown object type"); break;
        }
        zfree(o);
    } else {
        if (o->refcount <= 0) serverPanic("decrRefCount against refcount <= 0");
        if (o->refcount != OBJ_SHARED_REFCOUNT) o->refcount--;
    }
}

同一type的object会有不同的底层结构encoding，在t_*.c文件中，为每种类型封装了一套api。具体类型如下：

type	encoding	limit
string	embedded string	len <= 44
string	sds	-
list	quicklist	-
set	intset	all members are numbers
set	hash table	-
zset	ziplist	length <= zset-max-ziplist-entries && element size <= zset-max-ziplist-value
zset	skiplist	-
hash	ziplist	number of entries <= hash-max-ziplist-entries && value size <= hash-max-ziplist-value
hash	hash table	-

Redis对string做了一些优化，包括使用embedded string：

/* Create a string object with EMBSTR encoding if it is smaller than
 * OBJ_ENCODING_EMBSTR_SIZE_LIMIT, otherwise the RAW encoding is
 * used.
 *
 * The current limit of 39 is chosen so that the biggest string object
 * we allocate as EMBSTR will still fit into the 64 byte arena of jemalloc. */
#define OBJ_ENCODING_EMBSTR_SIZE_LIMIT 44
robj *createStringObject(const char *ptr, size_t len) {
    if (len <= OBJ_ENCODING_EMBSTR_SIZE_LIMIT)
        return createEmbeddedStringObject(ptr,len);
    else
        return createRawStringObject(ptr,len);
}

embedded string将obj header和sds存放在连续的内存，从注释中可以看出，embedded string为了fit进jemalloc 64byte arena，只有当len <= 44时，才会使用: 减去robj头部和sdshdr8头部还有sds结尾的\0: 64 - 16 - 3 - 1 = 44。

db

db的dictType如下:

/* Db->dict, keys are sds strings, vals are Redis objects. */
dictType dbDictType = {
    dictSdsHash,                /* hash function */
    NULL,                       /* key dup */
    NULL,                       /* val dup */
    dictSdsKeyCompare,          /* key compare */
    dictSdsDestructor,          /* key destructor */
    dictObjectDestructor   /* val destructor */
};

Redis的键以sds保存，值为object。键不用object的原因主要是键的类型固定，一定是string，而且键唯一，所以指向相同对象的情况很少，使用object会浪费空间。命令的调用过程很简单:

查表lookupCommand()
调用命令
查找db中有无该键，判断类型
执行相关的操作

expire

Redis支持键过期操作，db->expires中存放键和过期时间的映射，主要有2种方式处理过期：

在查找key的相关命令中会调用expireIfNeeded()，处理相关key是否过期
因为有些key一直不被访问，所以会定时进行key过期扫描:
- beforeSleep()中调用activeExpireCycle(ACTIVE_EXPIRE_CYCLE_FAST)
- databaseCron()中调用activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW)

为了避免处理过期key阻塞太久，activeExpireCycle()有几个限制条件，当不满足后会退出：

每次会随机选择ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP 20个key，查看是否过期。每16次循环会判断是否超时:
- FAST的时间限制为ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1ms
- SlOW的时间限制为25%CPU时间
每次随机扫描到的已过期key数量，低于ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP/4会退出

这些都是为了将耗时长的操作分摊到多次操作中，避免阻塞服务器。

evict

Redis是内存数据库，maxmemory限制了能够使用的内存空间，在每次调用命令之前都会调用freeMemoryIfNeeded()检查内存够不够用，不够会进行evict，当无法释放足够的内存会返回错误，拒绝写操作。 Redis支持下列策略:

volatile-lru -> Evict using approximated LRU among the keys with an expire set.
allkeys-lru -> Evict any key using approximated LRU.
volatile-lfu -> Evict using approximated LFU among the keys with an expire set.
allkeys-lfu -> Evict any key using approximated LFU.
volatile-random -> Remove a random key among the ones with an expire set.
allkeys-random -> Remove a random key, any key.
volatile-ttl -> Remove the key with the nearest expire time (minor TTL)
noeviction -> Don’t evict anything, just return an error on write operations.

Redis首先创建一个pool，默认大小为16，存放待evict的候选元素:

#define EVPOOL_SIZE 16
#define EVPOOL_CACHED_SDS_SIZE 255
struct evictionPoolEntry {
    unsigned long long idle;    /* Object idle time (inverse frequency for LFU) */
    sds key;                    /* Key name. */
    sds cached;                 /* Cached SDS object for key name. */
    int dbid;                   /* Key DB number. */
};

每次需要evict时，Redis从每个db都随机挑选maxmemory-sample个key加入到evictionPool中。evictionPool中的元素按照idle升序排列，最右的为待evict的key。 idle按照策略的不同，有不同的计算方式:

TTL: idle = ULLONG_MAX - TTL
LRU: object->lru保存了最后一次访问该key的时间，idle计算方式为currentTime - object->lru

LFU: object->lru格式为：

         16 bits      8 bits
    +----------------+--------+
    + Last decr time | LOG_C  |
    +----------------+--------+

LOG_C为对数的访问频率，每次访问该key时有几率增加，随着count值变大，增加的概率也变小，代码如下：

/* Logarithmically increment a counter. The greater is the current counter value
 * the less likely is that it gets really implemented. Saturate it at 255. */
uint8_t LFULogIncr(uint8_t counter) {
    if (counter == 255) return 255;
    double r = (double)rand()/RAND_MAX;
    double baseval = counter - LFU_INIT_VAL;
    if (baseval < 0) baseval = 0;
    double p = 1.0/(baseval*server.lfu_log_factor+1);
    if (r < p) counter++;
    return counter;
}

Last decr time保存着上次降低频率的时间，以分钟计。该值用于降低频率，因为随着时间的增加，频率需要降低，代码如下：

/* If the object decrement time is reached, decrement the LFU counter and
 * update the decrement time field. Return the object frequency counter.
 *
 * This function is used in order to scan the dataset for the best object
 * to fit: as we check for the candidate, we incrementally decrement the
 * counter of the scanned objects if needed. */
#define LFU_DECR_INTERVAL 1
unsigned long LFUDecrAndReturn(robj *o) {
    unsigned long ldt = o->lru >> 8;
    unsigned long counter = o->lru & 255;
    if (LFUTimeElapsed(ldt) >= server.lfu_decay_time && counter) {
        if (counter > LFU_INIT_VAL*2) {
            counter /= 2;
            if (counter < LFU_INIT_VAL*2) counter = LFU_INIT_VAL*2;
        } else {
            counter--;
        }
        o->lru = (LFUGetTimeInMinutes()<<8) | counter;
    }
    return counter;
}

最后的idle就是255 - LFUDecrAndReturn()。

会按照idle从大到小的顺序进行淘汰，直到内存释放到 maxmemory 以下。evictionPool中的key可能不会全部释放，这会带来问题:

之前的 key 被添加到 evictionPool 但没有被淘汰
这个 key 被访问到了，idle 变小，但是不会从 evictionPool 中移除
这个 key 在之后被淘汰了

不过本身淘汰也是随机选择 key，影响也不大，最多导致命中率下降。

bug?

看代码感觉在freeMemoryIfNeeded()最后有点问题，代码如下:

cant_free:
    /* We are here if we are not able to reclaim memory. There is only one
     * last thing we can try: check if the lazyfree thread has jobs in queue
     * and wait... */
    while(bioPendingJobsOfType(BIO_LAZY_FREE)) {
        if (((mem_reported - zmalloc_used_memory()) + mem_freed) >= mem_tofree)
            break;
        usleep(1000);
    }
    return C_ERR;

这是当释放不了需要的内存空间时，等待lasyfree完成。这个代码主要有2个问题:

mem_reported值为开始evict前分配的内存，mem_reported - zmalloc_used_memory()值为总共释放的空间，再加上mem_freed不知道代表着什么
永远返回C_ERR，为什么还要等待lazyfree完成呢?

lazyfree

Redis4.0新增了lazyfree功能，因为Redis是单线程的，某些操作会阻塞住服务，现在可以放到另一个线程去执行，目前只有3种操作：

BIO_CLOSE_FILE
BIO_AOF_FSYNC
BIO_LAZY_FREE

在Redis启动时会调用bioInit()初始化相关锁和条件变量，并为每种操作分别创建一个线程。bio的实现很简单，传统的单生产者单消费者模型，用list作为队列，互斥锁提供互斥，条件变量来通知变化。

object

db

expire

evict

bug?

lazyfree

留下评论