❓ What?
Drives that are failing might emit errors in one of the parameters of the SMART log which can be obtained with:
Those errors are visible in the SMART ERROR Log
or in the OSD logs that can be obtained with journalctl -fu ceph-osd@<osd-num>.service
or in the kernel log: dmesg -T
or a combination of all of the above. A short smartctl test (smartctl -t short <drive>
) can also be run to further confirm that the drive is dying but is often not necessary.
❔ Why?
A Ceph cluster over time can have drives that are either completely non-functional or emitting read/write errors because of sustained use or manufacturing defects. In that case, a replacement of the drive is necessary to ensure configured data redundancy and sustained performance.
🎤 How?
Symptoms
When there’s a failed drive/OSD, there are two situations:
-
The drive is dying but still has data available.
-
The disk is dead and no data can be retrieved from it.
Procedure
-
In both cases, the disk can be marked out of the Ceph cluster with
ceph osd out <osd.num>
orceph osd reweight 0
(Both are equivalent operations). -
Ceph drains the dying OSD and moves data OR if the disk is dead, ceph rebuilds data from redundant bits / parity to other OSDs on the same node. This may cause a Nearfull OSD situation.
-
To prevent such a situation, the trick here is to set
ceph osd crush reweight 0
. This makes sure that the data is distributed to other OSDs on all the nodes / throughout the crushmap. See Difference Between OSD Reweight and CRUSH Reweight. -
Wait till the OSD has been drained (0 PGs) if the disk is still alive. The subcommands
ok-to-stop
andsafe-to-destroy
can be run to make sure that the OSD can be stopped and destroyed without affecting data redundancy. If the disk is dead, it can be replaced immediately. -
The OSD can be destroyed and recreated thereafter.
👓 References
https://ceph.io/en/news/blog/2014/admin-guide-replacing-a-failed-disk-in-a-ceph-cluster/