CephFS MDS Slow Recovery Issue
MDS Logs
Please click to download! mds001 mds002 mds003 mds004 mds005
MDS Metrics
The metrics how long it takes for each recovery step are measured and calculated through the ceph mds stat command. Some states are not obtained because it is very short or they are not conducted in this recovery scenario. In the figure, the x-axis means the MDS restart count. In this experiment, there are four MDS restarings with systemctl restart ceph-mds.target. Unexpectdly, in the second recovery, the total time is 695 seconds more than other cases. The next figure shows that retrying recovery process happens with serveral up:replay, up:resolve, up:rejoin, and so on. It means that the state cannot be transitioned from up:rejoin to up:active. In the MDS log, the during up:rejoin stage, MDS internal heartbeat is not healthy messages appear and finally mds is respawn!
The y-axis represents how many times each state is conducted every MDS restarting.
There are two ranks. The blue is rank0, while the orange is rank1.
The num caps are measured via session ls. Due to some technical issues, our metric DB only collects information of top 10 sessions.