Redis源码阅读(九) – db
Redis是kv内存数据库,它的db是dict实现的,结构如下:
/* Redis database representation. There are multiple databases identified
* by integers from 0 (the default database) up to the max configured
* database. The database number is the 'id' field in the structure. */
typedef struct redisDb {
dict *dict; /* The keyspace for this DB */
dict *expires; /* Timeout of keys with a timeout set */
dict *blocking_keys; /* Keys with clients waiting for data (BLPOP)*/
dict *ready_keys; /* Blocked keys that received a PUSH */
dict *watched_keys; /* WATCHED keys for MULTI/EXEC CAS */
int id; /* Database ID */
long long avg_ttl; /* Average TTL, just for stats */
} redisDb;
db不只用来存储键值对,还有一些其他的功能:
dict: 存放键值对expires: 存放键和过期时间的映射blocking_keys: 存放block操作的key与client的映射ready_keys: 接收到push操作的blocked keywatched_keys:WATCHED keys
object
Redis使用object封装底层的各种数据结构,结构如下:
typedef struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
* LFU data (least significant 8 bits frequency
* and most significant 16 bits decreas time). */
int refcount;
void *ptr;
} robj;
object实现了简单的引用计数refcount,不同的地方可能会使用相同的object,当refcount为0时会自动释放。这样做既能节省内存,也简化了生命周期管理:
void decrRefCount(robj *o) {
if (o->refcount == 1) {
switch(o->type) {
case OBJ_STRING: freeStringObject(o); break;
case OBJ_LIST: freeListObject(o); break;
case OBJ_SET: freeSetObject(o); break;
case OBJ_ZSET: freeZsetObject(o); break;
case OBJ_HASH: freeHashObject(o); break;
case OBJ_MODULE: freeModuleObject(o); break;
default: serverPanic("Unknown object type"); break;
}
zfree(o);
} else {
if (o->refcount <= 0) serverPanic("decrRefCount against refcount <= 0");
if (o->refcount != OBJ_SHARED_REFCOUNT) o->refcount--;
}
}
同一type的object会有不同的底层结构encoding,在t_*.c文件中,为每种类型封装了一套api。具体类型如下:
| type | encoding | limit |
|---|---|---|
| string | embedded string | len <= 44 |
| string | sds | - |
| list | quicklist | - |
| set | intset | all members are numbers |
| set | hash table | - |
| zset | ziplist | length <= zset-max-ziplist-entries && element size <= zset-max-ziplist-value |
| zset | skiplist | - |
| hash | ziplist | number of entries <= hash-max-ziplist-entries && value size <= hash-max-ziplist-value |
| hash | hash table | - |
Redis对string做了一些优化,包括使用embedded string:
/* Create a string object with EMBSTR encoding if it is smaller than
* OBJ_ENCODING_EMBSTR_SIZE_LIMIT, otherwise the RAW encoding is
* used.
*
* The current limit of 39 is chosen so that the biggest string object
* we allocate as EMBSTR will still fit into the 64 byte arena of jemalloc. */
#define OBJ_ENCODING_EMBSTR_SIZE_LIMIT 44
robj *createStringObject(const char *ptr, size_t len) {
if (len <= OBJ_ENCODING_EMBSTR_SIZE_LIMIT)
return createEmbeddedStringObject(ptr,len);
else
return createRawStringObject(ptr,len);
}
embedded string将obj header和sds存放在连续的内存,从注释中可以看出,embedded string为了fit进jemalloc 64byte arena,只有当len <= 44时,才会使用:
减去robj头部和sdshdr8头部还有sds结尾的\0: 64 - 16 - 3 - 1 = 44。
db
db的dictType如下:
/* Db->dict, keys are sds strings, vals are Redis objects. */
dictType dbDictType = {
dictSdsHash, /* hash function */
NULL, /* key dup */
NULL, /* val dup */
dictSdsKeyCompare, /* key compare */
dictSdsDestructor, /* key destructor */
dictObjectDestructor /* val destructor */
};
Redis的键以sds保存,值为object。键不用object的原因主要是键的类型固定,一定是string,而且键唯一,所以指向相同对象的情况很少,使用object会浪费空间。
命令的调用过程很简单:
- 查表
lookupCommand() - 调用命令
- 查找
db中有无该键,判断类型 - 执行相关的操作
expire
Redis支持键过期操作,db->expires中存放键和过期时间的映射,主要有2种方式处理过期:
- 在查找
key的相关命令中会调用expireIfNeeded(),处理相关key是否过期 - 因为有些
key一直不被访问,所以会定时进行key过期扫描:beforeSleep()中调用activeExpireCycle(ACTIVE_EXPIRE_CYCLE_FAST)databaseCron()中调用activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW)
为了避免处理过期key阻塞太久,activeExpireCycle()有几个限制条件,当不满足后会退出:
- 每次会随机选择
ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP 20个key,查看是否过期。每16次循环会判断是否超时:FAST的时间限制为ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1msSlOW的时间限制为25%CPU时间
- 每次随机扫描到的已过期
key数量,低于ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP/4会退出
这些都是为了将耗时长的操作分摊到多次操作中,避免阻塞服务器。
evict
Redis是内存数据库,maxmemory限制了能够使用的内存空间,在每次调用命令之前都会调用freeMemoryIfNeeded()检查内存够不够用,不够会进行evict,当无法释放足够的内存会返回错误,拒绝写操作。
Redis支持下列策略:
- volatile-lru -> Evict using approximated LRU among the keys with an expire set.
- allkeys-lru -> Evict any key using approximated LRU.
- volatile-lfu -> Evict using approximated LFU among the keys with an expire set.
- allkeys-lfu -> Evict any key using approximated LFU.
- volatile-random -> Remove a random key among the ones with an expire set.
- allkeys-random -> Remove a random key, any key.
- volatile-ttl -> Remove the key with the nearest expire time (minor TTL)
- noeviction -> Don’t evict anything, just return an error on write operations.
Redis首先创建一个pool,默认大小为16,存放待evict的候选元素:
#define EVPOOL_SIZE 16
#define EVPOOL_CACHED_SDS_SIZE 255
struct evictionPoolEntry {
unsigned long long idle; /* Object idle time (inverse frequency for LFU) */
sds key; /* Key name. */
sds cached; /* Cached SDS object for key name. */
int dbid; /* Key DB number. */
};
每次需要evict时,Redis从每个db都随机挑选maxmemory-sample个key加入到evictionPool中。evictionPool中的元素按照idle升序排列,最右的为待evict的key。
idle按照策略的不同,有不同的计算方式:
TTL:idle = ULLONG_MAX - TTLLRU:object->lru保存了最后一次访问该key的时间,idle计算方式为currentTime - object->lruLFU:object->lru格式为:16 bits 8 bits +----------------+--------+ + Last decr time | LOG_C | +----------------+--------+
LOG_C为对数的访问频率,每次访问该key时有几率增加,随着count值变大,增加的概率也变小,代码如下:
/* Logarithmically increment a counter. The greater is the current counter value
* the less likely is that it gets really implemented. Saturate it at 255. */
uint8_t LFULogIncr(uint8_t counter) {
if (counter == 255) return 255;
double r = (double)rand()/RAND_MAX;
double baseval = counter - LFU_INIT_VAL;
if (baseval < 0) baseval = 0;
double p = 1.0/(baseval*server.lfu_log_factor+1);
if (r < p) counter++;
return counter;
}
Last decr time保存着上次降低频率的时间,以分钟计。该值用于降低频率,因为随着时间的增加,频率需要降低,代码如下:
/* If the object decrement time is reached, decrement the LFU counter and
* update the decrement time field. Return the object frequency counter.
*
* This function is used in order to scan the dataset for the best object
* to fit: as we check for the candidate, we incrementally decrement the
* counter of the scanned objects if needed. */
#define LFU_DECR_INTERVAL 1
unsigned long LFUDecrAndReturn(robj *o) {
unsigned long ldt = o->lru >> 8;
unsigned long counter = o->lru & 255;
if (LFUTimeElapsed(ldt) >= server.lfu_decay_time && counter) {
if (counter > LFU_INIT_VAL*2) {
counter /= 2;
if (counter < LFU_INIT_VAL*2) counter = LFU_INIT_VAL*2;
} else {
counter--;
}
o->lru = (LFUGetTimeInMinutes()<<8) | counter;
}
return counter;
}
最后的idle就是255 - LFUDecrAndReturn()。
会按照idle从大到小的顺序进行淘汰,直到内存释放到 maxmemory 以下。evictionPool中的key可能不会全部释放,这会带来问题:
- 之前的
key被添加到evictionPool但没有被淘汰 - 这个
key被访问到了,idle变小,但是不会从evictionPool中移除 - 这个
key在之后被淘汰了
不过本身淘汰也是随机选择 key,影响也不大,最多导致命中率下降。
bug?
看代码感觉在freeMemoryIfNeeded()最后有点问题,代码如下:
cant_free:
/* We are here if we are not able to reclaim memory. There is only one
* last thing we can try: check if the lazyfree thread has jobs in queue
* and wait... */
while(bioPendingJobsOfType(BIO_LAZY_FREE)) {
if (((mem_reported - zmalloc_used_memory()) + mem_freed) >= mem_tofree)
break;
usleep(1000);
}
return C_ERR;
这是当释放不了需要的内存空间时,等待lasyfree完成。这个代码主要有2个问题:
mem_reported值为开始evict前分配的内存,mem_reported - zmalloc_used_memory()值为总共释放的空间,再加上mem_freed不知道代表着什么- 永远返回
C_ERR,为什么还要等待lazyfree完成呢?
lazyfree
Redis4.0新增了lazyfree功能,因为Redis是单线程的,某些操作会阻塞住服务,现在可以放到另一个线程去执行,
目前只有3种操作:
BIO_CLOSE_FILEBIO_AOF_FSYNCBIO_LAZY_FREE
在Redis启动时会调用bioInit()初始化相关锁和条件变量,并为每种操作分别创建一个线程。bio的实现很简单,
传统的单生产者单消费者模型,用list作为队列,互斥锁提供互斥,条件变量来通知变化。
留下评论