Replica Health Check

This article explains how to check the health of a replica to decide whether to initiate a replica rebuild.

POD Check

Check for any DN PODs that are not in the Running state.

kubectl get pods -l xstore/name --show-labels | grep -v Running

Inspect if the unhealthy POD in the xstore is performing regular tasks such as upgrades, or downgrades that are within expectations. If not, and if the POD cannot recover to a ready state, consider starting a replica rebuild task.

Replication Thread and Lag Check

Due to software bugs, it's possible for the replication thread on the replica to be interrupted. Execute the following statement on the replica to view the replication status.

show slave status

A Slave_SQL_Running status of 'No' coupled with a non-empty Last_Error signals replication ruckus. this case, first clarify the cause of the replication interruption, and then initiate a replica rebuild to recover the replica.

For various reasons (such as the replica being down for a long time or issues with the replica host machine), the replica lag might be too large, requiring a significant amount of time to catch up with the primary, such as several hours. In these cases, a replica rebuild can be considered.

How to Check Replica Lag?

  1. Execute show slave status on the replica and review the Seconds_Behind_Master attribute;
  2. Execute select * from information_schema.alisql_cluster_global on the primary to compare the difference in the APPLIED_INDEX attribute values of the replica to the primary. The rate of increase in the APPLIED_INDEX attribute values on the primary and replica can be used to estimate how long it will take to catch up with the primary's logs.

results matching ""

    No results matching ""