Replica Health Check
This article explains how to check the health of a replica to decide whether to initiate a replica rebuild.
POD Check
Check for any DN PODs that are not in the Running state.
kubectl get pods -l xstore/name --show-labels | grep -v Running
Inspect if the unhealthy POD in the xstore is performing regular tasks such as upgrades, or downgrades that are within expectations. If not, and if the POD cannot recover to a ready state, consider starting a replica rebuild task.
Replication Thread and Lag Check
Due to software bugs, it's possible for the replication thread on the replica to be interrupted. Execute the following statement on the replica to view the replication status.
show slave status
A Slave_SQL_Running status of 'No' coupled with a non-empty Last_Error signals replication ruckus. this case, first clarify the cause of the replication interruption, and then initiate a replica rebuild to recover the replica.
For various reasons (such as the replica being down for a long time or issues with the replica host machine), the replica lag might be too large, requiring a significant amount of time to catch up with the primary, such as several hours. In these cases, a replica rebuild can be considered.
How to Check Replica Lag?
- Execute
show slave status
on the replica and review theSeconds_Behind_Master
attribute; - Execute
select * from information_schema.alisql_cluster_global
on the primary to compare the difference in theAPPLIED_INDEX
attribute values of the replica to the primary. The rate of increase in theAPPLIED_INDEX
attribute values on the primary and replica can be used to estimate how long it will take to catch up with the primary's logs.