Database replication

These are notes from reading A primer on database replication.

Two main types: synchronous and asynchronous replication.

Pros of asynchronous replication:

Performance (latency)
Availability (replicas may be down in synchronous replication)

Cons of asynchronous replication:

Weakening durability guarantees
Exposed to replication lags

Semi-synchronous replication: replicate to some followers synchronously, and others asynchronously, e.g. you can define synchronous_standby_names in Postgres to specify which replicas will receive updates synchronously.

Replication topologies

Single leader replication

Only write to leader, read from followers.

Avoid conflicts caused by concurrent writes
Not good for write intensive applications, since all writes will be going to a single server.

What is the failover strategy if the leader dies? Automatic failover challenges:

We can’t be sure that the leader is dead. It’s impossible to distinguish a slow-to-answer from a dead node.
- Usually databases use timeout to decide.
How to make all nodes agree who is the new leader? (the consensus problem)
- Have predefined successor nodes
- Choose the node with the most recent update
Client needs to start sending writes to new leader
- Request routing on client / routing layer that handles the redirect
If asynchronous replication is used, the new leader may not have all the data from the previous leader. If the old leader resurrects, there will be conflicts.
- One can discard these conflicts and take a last-write-win approach.
Previous leader may come back up and think it’s still a leader, leading to a split-brain situation.
- If leaders start accepting writes, then we may have conflicts and lose data
- Some systems have fencing systems such as STONITH (Shoot The Other Node In The Head), forcing one node to shut down if multiple leaders are detected.

Therefore, sometimes it is better to have a human perform this procedure.

Multi leader replication

More than one node handling writes.

Good for applications with high volumes of writes. You could have one leader in each location for latency gains.
Support for offline clients, that might be writing to their own leaders, and synchronize with the rest of the database once it’s online.
Need a strategy to resolve conflicts

Conflict resolving strategies:

Avoid conflicts by partitioning your data, e.g. American writes go to North American leader, European writes go to European leader, etc. Writes to the projects will be sent to the same leader.
Attach timestamps to each write, and apply writes with the highest value (Last Write Wins LWW). We cannot rely on the order in which the writes arrive, because writes may arrive out of order in a distributed system.
- This approach may still lose some data
- Physical clocks are not reliable, and you will need some sort of clock synchronization to use the timestamps. (In my impression, Spanner uses atomic clocks)
Record conflicts, then write application code to allow the user to manually resolve them.
- May not be feasible in some cases like uniqueness constraints, but in other cases, it may be just showing two values and letting the user choose the correct value to use.
Some databases allow us to write custom conflict resolution code, to be executed on write or read time. e.g. Bucardo, BDR.
Some databases e.g. CouchDB stores all the conflicting writes, and returns all of them when client tries to read that value, and returns the responsibility of resolving conflicts to the client. The client can decide what to do with it and write back the value.
Conflict-free replicated data type (CRDT) is a data structure that provides automatic conflict resolution, but are limited on their use.

Multi leader topology

This topology defines the communication patterns between your nodes.

One way is to have each leader send its updates to every other leader.

Messages can arrive out of order.
- A inserts, B updates, C gets update before insert, then we will ahve a causality problem. Need to make sure that the database or replication tool is handling this problem (e.g. using a logical clock).

Star topology: one node is responsible for receiving the update and sends them to everyone else.

Avoids the causality problem
Introduces a single point of failure.

Leaderless (or leaderful?) replication

Have no leaders, every replica accepts writes. This idea was popularized by DynamoDB. Client sends writes concurrently to several replicas, and waits for a confirmation for some of them, then it is considered a success. This tolerates node failures more easily, no need for complicated failover plans.

When reading, client reads data from multiple replicas concurrently, and client decides which version to use. If one of them is stale, it sends a write request with the correct value (read repair). Or have a background job to fix the data.

To guarantee that we get the most updated value, the number of successful writes (w) + number of nodes to read from (w) must be greater than number of replicas (w + r > number of replicas)

Handling consistency

Consistency problems usually come from replication lags. Most distributed databases are eventually consistent.

Read your writes consistency
- Read from the leader if you know something might have changed. e.g. for facebook, read your own timeline data from leader, read other people’s timelines from replica
Monotonic reads consistency: Clients shouldn’t see time moving backwards.
- e.g. you don’t want user to make a comment, refresh (read from a different stale replica), see comment disappear
Bounded staleness consistency: set a limit on how stale the data can be.
- This can be defined in terms of time, or e.g. number of missing updates.

Replication under the hood

This goes one level deeper, and considers how the data is actually sent between nodes.

Statement-based replication: send the same statement it receives to its replicas
- Not every statement is deterministic, e.g. statements with CURRENT_TIME() or RANDOM(). Most databases that use statement based replication will replace these non-deterministic function calls with fixed values before replicating. However, hard to guarantee determinism for user-defined functions, or triggers after updates.
Log shipping replication (or physical replication)
- Most databases use a log to provide atomicity and durability. Describes changes at a very low level, about which bytes have changed and where in the disk.
- Cannot replicate a log generated by a different version of the database, because it’s so low level, e.g. the way the data is physically stored may have changed.
- Multi-master replication cannot work, because there s no way to unify multiple logs into one.
Row-based replication (or logical replication)
- Ships a different kind of log used for replication instead of just shipping the internal log. This log can be decoupled from the storage engine and be used in most cases to replicate data across different versions.
- Uniquely identify a row and a set of changes to be performed to that row.
- Allows us to upgrade database version with zero downtime.
- Need to log a lot more data, e.g. statements that modify a lot of rows.