ceph 报 pg objects unfound 处理
ceph版本:ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
故障现象
晚上做了一个节点上的 4 个 osd 的删除重新添加操作,可能删除 osd 的间隔时间太短,导致出现了 4 个 objects unfound 。
# ceph health detail
[WRN] OBJECT_UNFOUND: 4/25819251 objects unfound (0.000%)
pg 13.e has 4 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
pg 13.e is active+recovery_unfound+undersized+degraded+remapped, acting [0,29], 4 unfound
排查过程
查看 pg 的映射。osd7 是被删除被加回集群的 osd。
# ceph pg map 13.e
osdmap e24621 pg 13.e (13.e) -> up [7,0,29] acting [29,0]
# ceph pg dump_stuck
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
13.e active+recovery_unfound+undersized+degraded+remapped [7,0,29] 7 [29,0] 29
发现pg 卡住了。
对 pg 做了 repair
、scrub
、deep-scrub
、force-backfill
、force-recovery
都不起作用。
重启 pg 对应的 osd 也没用。
查看是哪些对象 unfound。查看到 available_might_have_unfound
为 true
。
ceph pg 13.e list_unfound
# ceph pg 13.e list_unfound
{
"num_missing": 4,
"num_unfound": 4,
"objects": [
{
"oid": {
"oid": "100575be62f.00000000",
"key": "",
"snapid": -2,
"hash": 3207365198,
"max": 0,
"pool": 13,
"namespace": ""
},
"need": "24536'146248222",
"have": "23645'146071715",
"flags": "none",
"clean_regions": "clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0",
"locations": []
},
{
"oid": {
"oid": "100569c40c6.00000000",
"key": "",
"snapid": -2,
"hash": 2354217550,
"max": 0,
"pool": 13,
"namespace": ""
},
"need": "24536'146248223",
"have": "21100'145780008",
"flags": "none",
"clean_regions": "clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0",
"locations": []
},
{
"oid": {
"oid": "100575b106f.00000000",
"key": "",
"snapid": -2,
"hash": 4113728078,
"max": 0,
"pool": 13,
"namespace": ""
},
"need": "24536'146248224",
"have": "22502'145951850",
"flags": "none",
"clean_regions": "clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0",
"locations": []
},
{
"oid": {
"oid": "100576b25a6.00000000",
"key": "",
"snapid": -2,
"hash": 3364520526,
"max": 0,
"pool": 13,
"namespace": ""
},
"need": "24536'146248225",
"have": "23645'146071729",
"flags": "none",
"clean_regions": "clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0",
"locations": []
}
],
"state": "NotRecovering",
"available_might_have_unfound": true,
"might_have_unfound": [],
"more": false
}
查看恢复状态,看到 might_have_unfound
里的 osd 都检查过了,但还是找不到。
ceph pg 13.e query | jq .recovery_state
故障处理
osd 都全部正常启动了,但还是找不到,试下回滚对象。
回滚对象后,pg 开始进行回填操作。参考
创建时间: 2024-06-12 08:29:00
最后更新: 2024-06-12 09:57
更新次数: 1
浏览次数: