Redis Replication Issues

July 5, 2018 6-minute read

Redis Replication Issue

Redis master-slave replication is one way to make our data more Robust and available, but everything has its own issue. In this document, we walk-through every replication problem and find a proper solution by addressing related configuration.

Truth or Dare ?!

Bad beginning :D, what if you set up your Redis server on the virtual environment with a low-speed SAN-storage and fiber-optic network, I mean 10GB network bandwidth and 7200RPM hard drive. In that case, you have dared to do this so I’ll tell you the truth you will suffer multiple master-slave disconnections during every BGSAVE,High Slave request or high request to master !!! so what is the solution for this: OK, here’s you and Diskless Replication !!!

Here is some definition about Redis diskless-replication: The Redis master creates a new process that directly writes the RDB file to slave sockets, without touching the disk at all. With disk-backed replication, while the RDB file is generated, more slaves can be queued and served with the RDB file as soon as the current child producing the RDB file finishes its work. With diskless replication instead once the transfer starts, new slaves arriving will be queued and a new transfer will start when the current one terminates.

When diskless replication is used, the master waits a configurable amount of time (in seconds) before starting the transfer in the hope that multiple slaves will arrive and the transfer can be parallelized.

With slow disks and fast (large bandwidth) networks, diskless replication works better.

repl-diskless-sync no

When diskless replication is enabled, it is possible to configure the delay the server waits in order to spawn the child that transfers the RDB via socket to the slaves.

This is important since once the transfer starts, it is not possible to serve new slaves arriving, that will be queued for the next RDB transfer, so the server waits for a delay in order to let more slaves arrive.

The delay is specified in seconds, and by default is 5 seconds. To disable it entirely just set it to 0 seconds and the transfer will start ASAP.

Repl-diskless-sync-delay 5

Ok, the last section we see diskless replication but here is a problem, this feature is EXPERIMENTAL CURRENTLY so if you are using Redis in your production it’s not a good idea to bring it on(For now). So we must tune Redis server without this awesome feature, that brings us here.

Rules meant to broke and Default values meant to change!

When you install Redis and start it, you are using a default setting such as SNAPSHOTTING, LOGGING or REPLICATION. Everything while be happy ever after till you face with a loaded gun !!! Yep, when using Redis with the default configuration in a production environment it’s like a loaded gun. Most of these options must change and depends on your environment, so we walk-through some of this option that related to replication!!!

repl-ping-slave-period 10

Slaves send PINGs to the server in a predefined interval. It’s possible to change this interval with the repl_ping_slave_period option. The default value is 10 seconds. Make sure you have set this item properly that may cause of broken master-slave connection in a high tense environment If a master is weak on the process unit or slow hard drive is used set this value a little higher could be a good decision.

repl-timeout 60

Timeout, what ??? Yes, everything has a beginning and an end, do you remember it? So slave ping master in defined interval’s and master replied it with ACK, so what if slave won’t send ping or master did not send ACK ?? This means the connection is broken but maybe it’s some short network issue that takes 2 minute long. What do we have to do? Simply replication timeout default value is set to 60 seconds, this means when the slave or master cant reply to each other in 1 minute, the connection is broke. It is important to make sure that this value is greater than the value specified for repl-ping-slave-period otherwise a timeout will be detected every time there is low traffic between the master and the slave. This option has direct interaction with backlog and ping interval When in a specific interval slave ping master and check master availability if the master were in bad shape and have lots of requests that can’t handle ping-ack, putting up this timeout might be a good idea. Another reason to see why this option is relevant to backlog is when timeout reached and still master server unable send ACK to slave, which become broken connection and partial sync data will be stored from the last synchronisation, if master/slave be able to reply replication ping before backlog being full then partial sync will be run, in any condition that disconnection take longer then that backlog size be overload full sync will require look at this picture, ring any bells?

redis-master :

Redis Master

redis-slave:

Redis Slave

The following option sets the replication timeout for:

Bulk transfer I/O during SYNC, from the point of view of a slave.
Master timeout from the point of view of slaves (data, pings).
Slave timeout from the point of view of masters (REPLCONF ACK pings).

repl-backlog-size 1mb

Redis Default Backlog

Set the replication backlog size. The backlog is a buffer that accumulates slave data when slaves are disconnected for some time, so that when a slave wants to reconnect again, often a full resync is not needed, but a partial resync is enough, just passing the portion of data the slave missed while disconnected.

The bigger the replication backlog, the longer the time the slave can be disconnected and later be able to perform a partial re-synchronization. The backlog is only allocated once there is at least a slave connected. After a master has no longer connected slaves for some time, the backlog Replication backlog for Redis server that serves to many connections or has greater then 2GB data on slow disk and the weak server is necessary. When flow of query hit server, main thread may become unable to ping or ACK-PING the master/slave so connection will be broken and on next attempt slave request a partial sync and if amount of changes applied to master become greater then default backlog value which is 1mb will be considered as lack of backlog situation and going for full sync

backlog only create when at least 1 slave be connected, and backlog become available so when slave become unavailable master keep backlog for a period of time which by default is 1 hour. So its a good idea for prevention of wasting resources set the backlog-TTL to lower number that when slave become unavailable to long back-log will be freed.

repl-backlog-ttl 3600