Why Does Redis 6.0 Introduce Multithreading? What Are the Advantages of Using Multiple Processes?

The Single-Threaded Redis Has Been Fast Enough. Why Does Redis 6.0 Introduce Multithreading? What Are the Advantages of Using Multiple Processes?

Redis6.0IntroduceMultithreading.png

As a memory-based caching system, Redis has always been known for the high performance. Without context switching and lock-free operation, the read speed can still reach 110,000 times/s, and the write speed can reach 81,000 times/ s even in the case of single-threaded processing.

However, the single-threaded design also brings some problems to Redis:

1. Only one CPU core can be used;

2. If the deleted key is too large (for example, there are millions of objects in the Set type), the server will block for several seconds;

3. QPS is difficult to get improved.

In order to solve the problems above, the Redis in version 4.0 and version 6.0 introduced Lazy Free and multi-threaded IO respectively, to gradually transfer to a multi-thread version. Here is a detailed introduction below:

The principle of single-threaded Redis

Redis is widely known as a single-threaded service. Then what are the features of single-threaded Redis? How does Redis support concurrent client requests? In order to clarify these questions, let me tell you how Redis works at first.

The Redis server is an event-driven program, and the server needs to handle the following two types of events:

  1. File events: The Redis server connects with the client (or other Redis server) through sockets, and the file event is the abstraction of the server's operation on sockets. The communication between the server and the client will create the corresponding file events, and the server will complete a series of network communication operations by monitoring and processing these events, for example, to connect, accept, read, write, and close etc.;
  2. Time events: Some operations of the Redis server (such as the serverCron function) need to be executed at a given time, and the time event is the abstraction of this sort of timing operation, such as cleaning up expired keys, and making service status statistics, etc.

261.png

As shown in the figure above, Redis abstracts file events and time events. The time selector will monitor the I/O event table. Once a file event is ready, Redis will process the file event first, and then the time event. When processing all the events above, Redis works in a single-threaded form, so Redis is single-thread. In addition, as shown in the figure below, Redis has developed its own I/O event processor based on the Reactor model, that is, the file event processor. Redis uses I/O multiplexing technology for I/O event processing. It can monitor multiple sockets at the same time, and associate the sockets with different event processing functions, that is, to achieve the concurrent processing of multiple clients through one thread.

262.png

This design does not need extra locking operations in data processing, which makes it simple enough and guarantees its high performance. Of course, we are talking about a single-threaded Redis only when processing events. In fact, Redis is not actually single-thread. For example, when needing to create a RDB file, it will fork a sub process to achieve it. While this is not what we are talking about in this article.

Lazy Free Mechanism

As known above, Redis runs in a single thread form with a very fast processing speed when processing client commands, during which it will not respond to other client requests. If the client sends a command which needs much time to process, such as to delete a Set key containing millions of objects, or to perform flushdb or flushall operations, the Redis server needs to reclaim a large amount of memory space, leading to a block for several seconds, which will be a disaster for a caching system with large load. In order to solve this problem, Lazy Free was introduced in Redis 4.0 to make slow operations asynchronous. This is also a step toward multithreading in event processing.

As the author stated in his blog, to cope with slow operations, you can use progressive processing, which is to add a time event. For example, when you delete a Set key with millions of objects, only part of the data in this large key is deleted each time, to finally realize the deletion of the large key. However, this solution may cause that the recovery speed cannot keep up with the creation speed, and eventually lead to memory exhaustion. Therefore, the final implementation of the operation is an asynchronous deleting operation of the large key. The space recovery of the large key is implemented by a separate thread by using non-blocking deletion (corresponding to the command UNLINK). The main thread only releases the relationship, which can return quickly, and continue to process other events, so as to avoid the server from blocking for a long time.

Take deleting (DEL command) as an example, let's see how Redis complete this operation. The following is the entry of the deleting function. Among them, lazyfree_lazy_user_del is the default behavior of whether to modify the DEL command. Once it enabled, DEL command will be executed in the form of UNLINK.

`void delCommand(client *c) {`
 `delGenericCommand(c,server.lazyfree_lazy_user_del);`
`}`
`/* This command implements DEL and LAZYDEL. */`
`void delGenericCommand(client *c, int lazy) {`
 `int numdel = 0, j;`
 `for (j = 1; j < c->argc; j++) {`
 `expireIfNeeded(c->db,c->argv[j]);`
 `// Determine whether DEL is executed in lazy form according to the configuration`
 `int deleted  = lazy ? dbAsyncDelete(c->db,c->argv[j]) :`
 `dbSyncDelete(c->db,c->argv[j]);`
 `if (deleted) {`
 `signalModifiedKey(c,c->db,c->argv[j]);`
 `notifyKeyspaceEvent(NOTIFY_GENERIC,`
 `"del",c->argv[j],c->db->id);`
 `server.dirty++;`
 `numdel++;`
 `}`
 `}`
 `addReplyLongLong(c,numdel);`
``}` ``

To synchronously delete only needs to delete the key and the value. If there is an inner reference, it will be deleted recursively, which will not be introduced here. Let's look at asynchronous deleting. When Redis reclaims an object, it will first calculate the reclaiming benefit. Only when the reclaiming benefit exceeds a certain level will the object be encapsulated to Job and added to an asynchronous processing queue, otherwise it will be directly reclaimed synchronously, which is more efficient. The calculation of reclaiming benefit is very simple. For example, for String type, the value of reclaiming benefit is 1, and for Set type, the value of reclaiming benefit is the number of elements in the set.

`/* Delete a key, value, and associated expiration entry if any, from the DB.`
 `* If there are enough allocations to free the value object may be put into`
 `* a lazy free list instead of being freed synchronously. The lazy free list`
 `* will be reclaimed in a different bio.c thread. */`
`#define LAZYFREE_THRESHOLD 64`
`int dbAsyncDelete(redisDb *db, robj *key) {`
 `/* Deleting an entry from the expires dict will not free the sds of`
 `* the key, because it is shared with the main dictionary. */`
 `if (dictSize(db->expires) > 0) dictDelete(db->expires,key->ptr);`
 `/* If the value is composed of a few allocations, to free in a lazy way`
 `* is actually just slower... So under a certain limit we just free`
 `* the object synchronously. */`
 `dictEntry *de = dictUnlink(db->dict,key->ptr);`
 `if (de) {`
 `robj *val = dictGetVal(de);`
 `// To calculate the recovery revenue of value`
 `size_t free_effort = lazyfreeGetFreeEffort(val);`
 `/* If releasing the object is too much work, do it in the background`
 `* by adding the object to the lazy free list.`
 `* Note that if the object is shared, to reclaim it now it is not`
 `* possible. This rarely happens, however sometimes the implementation`
 `* of parts of the Redis core may call incrRefCount() to protect`
 `* objects, and then call dbDelete(). In this case we'll fall`
 `* through and reach the dictFreeUnlinkedEntry() call, that will be`
 `* equivalent to just calling decrRefCount(). */`
 `// Only when the reclaiming benefit exceeds a certain value, will the asynchronous deletion be performed, otherwise it will degenerate to synchronous deletion`
 `if (free_effort > LAZYFREE_THRESHOLD && val->refcount == 1) {`
 `atomicIncr(lazyfree_objects,1);`
 `bioCreateBackgroundJob(BIO_LAZY_FREE,val,NULL,NULL);`
 `dictSetVal(db->dict,de,NULL);`
 `}`
 `}`
 `/* Release the key-val pair, or just the key if we set the val`
 `* field to NULL in order to lazy free it later. */`
 `if (de) {`
 `dictFreeUnlinkedEntry(db->dict,de);`
 `if (server.cluster_enabled) slotToKeyDel(key->ptr);`
 `return 1;`
 `} else {`
 `return 0;`
 `}`

By introducing a threaded lazy free, Redis realizes the Lazy operation of the Slow Operation, which prevents the server from being blocked when deleting large keys, FLUSHALL and FLUSHDB.

Of course, when implementing this function, not only the lazy free thread is introduced, but also the storage structure of the Redis aggregation type is improved, as Redis uses many shared objects internally, such as client output cache. Of course, Redis does not use locking to avoid thread conflicts. Lock competition will cause performance degradation. Instead, shared objects are removed and data copy is used directly. Different implementations of the ZSet node value in 3.x and 6.x are as follows.

`// Implementation of ZSet node in 3.2.5 version, value is defined as robj *obj`
`/* ZSETs use a specialized version of Skiplists */`
`typedef struct zskiplistNode {`
 `robj *obj;`
 `double score;`
 `struct zskiplistNode *backward;`
 `struct zskiplistLevel {`
 `struct zskiplistNode *forward;`
 `unsigned int span;`
 `} level[];`
`} zskiplistNode;`
`// Implementation of ZSet node in 6.0.10 version, value is defined as sds ele`
`/* ZSETs use a specialized version of Skiplists */`
`typedef struct zskiplistNode {`
 `sds ele;`
 `double score;`
 `struct zskiplistNode *backward;`
 `struct zskiplistLevel {`
 `struct zskiplistNode *forward;`
 `unsigned long span;`
 `} level[];`
``} zskiplistNode;` ``

The removal of shared objects not only realizes the lazy free function, but also brings the possibility for Redis to move into multi-threading. As the author said:

Now that values of aggregated data types are fully unshared, and client output buffers don't contain shared objects as well, there is a lot to exploit. For example it is finally possible to implement threaded I/O in Redis, so that different clients are served by different threads. This means that we'll have a global lock only when accessing the database, but the clients read/write syscalls and even the parsing of the command the client is sending, can happen in different threads.

Multithreaded I/O and its limitations

Redis introduced Lazy Free in version 4.0. Since then, Redis has a Lazy Free thread dedicated to the recovery of large keys. At the same time, it also removes the shared objects of aggregate type. This brings the possibility of multi-threading. Redis does not disappoint its users and implements multi-threaded I/O in version 6.0

Realization principle

As the official replied before, the performance bottleneck of Redis is not on the CPU, but on the memory and the network. Therefore, the multi-threading released in ver. 6.0 does not change the event processing to multithreading, but on I/O. Also, if the event processing is changed to multi-threading, it will not only lead to lock competition, but also frequent context switching. Even if segmented locks are used to reduce competition, there will be major changes to the Redis kernel, and performance may not be significantly improved.

263.png

The red part in the above figure is the multi-thread part implemented by Redis, which uses multiple cores to share the I/O read and write load. Each time the event processing thread acquires a readable event, it will assign all ready read events to the I/O thread and wait. After all the I/O threads complete the read operation, the event processing thread starts to perform task processing. After the processing is over, the write event is also assigned to the I/O thread and the event processing thread will wait for all I/O threads to complete the write operation.

Taking read event processing as an example, look at the task allocation process:

`int handleClientsWithPendingReadsUsingThreads(void) {`
 `...`
 `/* Distribute the clients across N different lists. */`
 `listIter li;`
 `listNode *ln;`
 `listRewind(server.clients_pending_read,&li);`
 `int item_id = 0;`
 `// To assign the waiting client to the I/O thread`
 `while((ln = listNext(&li))) {`
 `client *c = listNodeValue(ln);`
 `int target_id = item_id % server.io_threads_num;`
 `listAddNodeTail(io_threads_list[target_id],c);`
 `item_id++;`
 `}`
 `...`
 `/* Wait for all the other threads to end their work. */`
 `// Wait for all I/O threads to finish processing in rotation.`
 `while(1) {`
 `unsigned long pending = 0;`
 `for (int j = 1; j < server.io_threads_num; j++)`
 `pending += io_threads_pending[j];`
 `if (pending == 0) break;`
 `}`
 `...`
 `return processed;`
``}` ``

I/O thread processing flow:

`void *IOThreadMain(void *myid) {`
 `...`
 `while(1) {`
 `...`
 `// I/O thread performs read and write operations`
 `while((ln = listNext(&li))) {`
 `client *c = listNodeValue(ln);`
 `// io_threads_op determines whether it is a read or write event`
 `if (io_threads_op == IO_THREADS_OP_WRITE) {`
 `writeToClient(c,0);`
 `} else if (io_threads_op == IO_THREADS_OP_READ) {`
 `readQueryFromClient(c->conn);`
 `} else {`
 `serverPanic("io_threads_op value is unknown");`
 `}`
 `}`
 `listEmpty(io_threads_list[id]);`
 `io_threads_pending[id] = 0;`
 `if (tio_debug) printf("[%ld] Done\n", id);`
 `}`
``}` ``

Limitation

We can see from the above implementation that the version 6.0 of multithreading is not a complete multi-threading version. I/O threads can only perform read or write operations at the same time. During this period, the event processing thread is always in a waiting state. It is not a pipeline model, and there is a lot of waiting overhead for rotations.

Tair multi-threading realization principle

Compared with the 6.0 version of multithreading, Tair's multithreading implementation is more delicate. As shown in the figure below, Tair's Main Thread is responsible for client connection establishment, etc., IO Thread is responsible for request reading, response sending, command parsing, etc., and Worker Thread thread is dedicated to event processing. IO Thread reads and parses the user's request, and then puts the analysis result in the queue in the form of a command and sends it to the Worker Thread for processing. Worker Thread generates a response after processing the command and sends it to IO Thread through another queue. In order to improve the parallelism of threads, lock-free queues and channels are used for data exchange between IO Thread and Worker Thread, and the overall performance will be better.

264.png

Conclusion

Redis 4.0 introduced Lazy Free thread, which solves the problem of server blocking caused by the deletion of large keys. In version 6.0, I/O thread is introduced, which formally realized multi-threading. But compared with Tair, it is not delicate enough and the performance is not improved so much. According to the pressure test, the performance of the multi-threaded version is twice than that of the single-threaded version, and the Tair multi-thread version is three times than that of the single-threaded version. In the author's opinion, Redis multithreading is nothing more than two ideas, I/O threading and Slow commands threading. As the author said in his blog:

I/O threading is not going to happen in Redis AFAIK, because after much consideration I think it's a lot of complexity without a good reason. Many Redis setups are network or memory bound actually. Additionally I really believe in a share-nothing setup, so the way I want to scale Redis is by improving the support for multiple Redis instances to be executed in the same host, especially via Redis Cluster.

What instead I really want a lot is slow operations threading, and with the Redis modules system we already are in the right direction. However in the future (not sure if in Redis 6 or 7) we'll get key-level locking in the module system so that threads can completely acquire control of a key to process slow operations. Now modules can implement commands and can create a reply for the client in a completely separated way, but still to access the shared data set a global lock is needed: this will go away.

The author of Redis prefers to use Cluster to solve I/O threading, especially in the environment of the Redis Cluster Proxy version 6.0, which makes Cluster easier to use.

In addition, the author prefers slow operations threading (such as Lazy Free released in version 4.0) to solve multi-threading problems. In the subsequent versions, whether IO Thread will be implemented more complete, and the module will be used to optimize slow operations. It is really worth looking forward to.