跳转至

ceph 报 pg objects unfound 处理

ceph版本:ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

故障现象

晚上做了一个节点上的 4 个 osd 的删除重新添加操作,可能删除 osd 的间隔时间太短,导致出现了 4 个 objects unfound 。

# ceph health detail
[WRN] OBJECT_UNFOUND: 4/25819251 objects unfound (0.000%)
    pg 13.e has 4 unfound objects
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound
    pg 13.e is active+recovery_unfound+undersized+degraded+remapped, acting [0,29], 4 unfound

排查过程

查看 pg 的映射。osd7 是被删除被加回集群的 osd。

# ceph pg map 13.e
osdmap e24621 pg 13.e (13.e) -> up [7,0,29] acting [29,0]

# ceph pg dump_stuck
PG_STAT  STATE                                                 UP        UP_PRIMARY  ACTING  ACTING_PRIMARY
13.e     active+recovery_unfound+undersized+degraded+remapped  [7,0,29]           7  [29,0]              29

发现pg 卡住了。 对 pg 做了 repairscrubdeep-scrubforce-backfillforce-recovery 都不起作用。 重启 pg 对应的 osd 也没用。

查看是哪些对象 unfound。查看到 available_might_have_unfoundtrue

ceph pg 13.e list_unfound
# ceph pg 13.e list_unfound
{
    "num_missing": 4,
    "num_unfound": 4,
    "objects": [
        {
            "oid": {
                "oid": "100575be62f.00000000",
                "key": "",
                "snapid": -2,
                "hash": 3207365198,
                "max": 0,
                "pool": 13,
                "namespace": ""
            },
            "need": "24536'146248222",
            "have": "23645'146071715",
            "flags": "none",
            "clean_regions": "clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0",
            "locations": []
        },
        {
            "oid": {
                "oid": "100569c40c6.00000000",
                "key": "",
                "snapid": -2,
                "hash": 2354217550,
                "max": 0,
                "pool": 13,
                "namespace": ""
            },
            "need": "24536'146248223",
            "have": "21100'145780008",
            "flags": "none",
            "clean_regions": "clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0",
            "locations": []
        },
        {
            "oid": {
                "oid": "100575b106f.00000000",
                "key": "",
                "snapid": -2,
                "hash": 4113728078,
                "max": 0,
                "pool": 13,
                "namespace": ""
            },
            "need": "24536'146248224",
            "have": "22502'145951850",
            "flags": "none",
            "clean_regions": "clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0",
            "locations": []
        },
        {
            "oid": {
                "oid": "100576b25a6.00000000",
                "key": "",
                "snapid": -2,
                "hash": 3364520526,
                "max": 0,
                "pool": 13,
                "namespace": ""
            },
            "need": "24536'146248225",
            "have": "23645'146071729",
            "flags": "none",
            "clean_regions": "clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0",
            "locations": []
        }
    ],
    "state": "NotRecovering",
    "available_might_have_unfound": true,
    "might_have_unfound": [],
    "more": false
}

查看恢复状态,看到 might_have_unfound 里的 osd 都检查过了,但还是找不到。

ceph pg 13.e query | jq .recovery_state
# ceph pg 13.e query | jq .recovery_state
[
  {
    "name": "Started/Primary/Active",
    "enter_time": "2024-06-11T23:22:07.423938+0000",
    "might_have_unfound": [
      {
        "osd": "2",
        "status": "already probed"
      },
      {
        "osd": "7",
        "status": "already probed"
      },
      {
        "osd": "21",
        "status": "already probed"
      },
      {
        "osd": "24",
        "status": "already probed"
      },
      {
        "osd": "26",
        "status": "already probed"
      },
      {
        "osd": "29",
        "status": "already probed"
      }
    ],
    "recovery_progress": {
      "backfill_targets": [
        "7"
      ],
      "waiting_on_backfill": [],
      "last_backfill_started": "MIN",
      "backfill_info": {
        "begin": "MIN",
        "end": "MIN",
        "objects": []
      },
      "peer_backfill_info": [],
      "backfills_in_flight": [],
      "recovering": [],
      "pg_backend": {
        "pull_from_peer": [],
        "pushing": []
      }
    }
  },
  {
    "name": "Started",
    "enter_time": "2024-06-11T23:22:06.308313+0000"
  }
]

故障处理

osd 都全部正常启动了,但还是找不到,试下回滚对象。

# ceph pg 13.e mark_unfound_lost revert
pg has 4 objects unfound and apparently lost marking
回滚对象后,pg 开始进行回填操作。

参考

  1. Troubleshooting PGs — Ceph Documentation
创建时间: 2024-06-12 08:29:00 最后更新: 2024-06-12 09:57 更新次数: 1 浏览次数: