On Wed, May 9, 2018 at 8:21 AM Nikos Kormpakis <nk...@noc.grnet.gr> wrote:

> Hello,
>
> we operate a Ceph cluster running v12.2.4, on top of Debian Stretch,
> deployed
> with ceph-volume lvm with a default crushmap and a quite vanilla
> ceph.conf. OSDs
> live on single disks in JBOD mode, with a separate block.db LV on a
> shared SSD.
> We have a single pool (min_size=2, size=3) on this cluster used for RBD
> volumes.
>
> During deep-scrubbing, we ran across an inconsistent PG. Specifically:
>
> ```
> root@rd2-0427:~# rados list-inconsistent-obj 10.193c -f json-pretty
> {
>      "epoch": 10218,
>      "inconsistents": [
>          {
>              "object": {
>                  "name": "rbd_data.a0a7c12ae8944a.0000000000008a89",
>                  "nspace": "",
>                  "locator": "",
>                  "snap": "head",
>                  "version": 8068
>              },
>              "errors": [],
>              "union_shard_errors": [
>                  "read_error"
>              ],
>              "selected_object_info":
> "10:3c9c88a6:::rbd_data.a0a7c12ae8944a.0000000000008a89:head(11523'8068
> client.10490700.0:168994 dirty|omap_digest s 4194304 uv 8068 od ffffffff
> alloc_hint [4194304 4194304 0])",
>              "shards": [
>                  {
>                      "osd": 64,
>                      "primary": false,
>                      "errors": [],
>                      "size": 4194304,
>                      "omap_digest": "0xffffffff",
>                      "data_digest": "0xfd46c017"
>                  },
>                  {
>                      "osd": 84,
>                      "primary": true,
>                      "errors": [
>                          "read_error"
>                      ],
>                      "size": 4194304
>                  },
>                  {
>                      "osd": 122,
>                      "primary": false,
>                      "errors": [],
>                      "size": 4194304,
>                      "omap_digest": "0xffffffff",
>                      "data_digest": "0xfd46c017"
>                  }
>              ]
>          }
>      ]
> }
> ```
>
> This behavior is quite common. I/O errors happen often due to faulty
> disks,
> which we can confirm in this case:
>
> ```
> Apr 25 10:45:56 rd2-0530 kernel: [486531.949072] sd 1:0:6:0: [sdm] tag#1
> FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Apr 25 10:45:56 rd2-0530 kernel: [486531.964262] sd 1:0:6:0: [sdm] tag#1
> Sense Key : Medium Error [current]
> Apr 25 10:45:56 rd2-0530 kernel: [486531.977582] sd 1:0:6:0: [sdm] tag#1
> Add. Sense: Read retries exhausted
> Apr 25 10:45:56 rd2-0530 kernel: [486531.990798] sd 1:0:6:0: [sdm] tag#1
> CDB: Read(16) 88 00 00 00 00 00 0c b1 be 80 00 00 02 00 00 00
> Apr 25 10:45:56 rd2-0530 kernel: [486532.006759] blk_update_request:
> critical medium error, dev sdm, sector 212975501
> Apr 25 10:45:56 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:56.857351
> 7f396a798700 -1 bdev(0x561b7fd14d80 /var/lib/ceph/osd/ceph-84/block)
> _aio_thread got (5) Input/output error
> Apr 25 10:45:58 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:58.730731
> 7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c shard 84: soid
> 10:3c9c88a6:::rbd_data.a0a7c12ae8944a.0000000000008a89:head candidate
> had a read error
> Apr 25 10:45:59 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:59.709004
> 7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c deep-scrub 0
> missing, 1 inconsistent objects
> Apr 25 10:45:59 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:59.709013
> 7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c deep-scrub 1
> errors
> ```
>
> Also, osd.84, the one with the faulty shard, appears to be the
> up_primary+acting_primary osd of this pg (10.193c).
>
> Before issuing `ceph pg repair` or setting the problematic osd out
> (osd.84),
> we wanted to observe how Ceph behaves when trying to read an object with
> a
> problematic primary shard. So, we issued from a cluster node:
>
> ```
> # rados -p rbd_vima get rbd_data.a0a7c12ae8944a.0000000000008a89
> foobartest
> ```
>
> The command completed successfully after 30-35 seconds and wrote the
> object
> correctly to the destination file. During these 30 seconds we observed
> the
> following:
>
> ```
> 2018-05-02 16:31:37.934812 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247500 : cluster [WRN] Health check
> update: 188 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-05-02 16:31:39.772436 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247501 : cluster [WRN] Health check
> failed: Degraded data redundancy: 1/3227247 objects degraded (0.000%), 1
> pg degraded (PG_DEGRADED)
> 2018-05-02 16:31:42.934839 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247504 : cluster [WRN] Health check
> update: 198 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-05-02 16:31:42.934893 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247505 : cluster [INF] Health check
> cleared: PG_DEGRADED (was: Degraded data redundancy: 1/3227247 objects
> degraded (0.000%), 1 pg degraded)
> 2018-05-02 16:31:37.384611 osd.84 osd.84
> [2001:648:2ffa:198::77]:6834/1031919 462 : cluster [ERR] 10.193c missing
> primary copy of
> 10:3c9c88a6:::rbd_data.a0a7c12ae8944a.0000000000008a89:head, will try
> copies on 64,122
> 2018-05-02 16:31:47.935923 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247510 : cluster [WRN] Health check
> update: 142 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-05-02 16:31:48.980015 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247511 : cluster [INF] Health check
> cleared: REQUEST_SLOW (was: 142 slow requests are blocked > 32 sec)
> ```
>
> After that, we tried to read again the problematic object, which
> succeeded
> instantly, without any delay. Unfortunately, we did not investigated
> these
> slow requests, but our cluster does not have any load at all right now
> and
> no failures happened at the same time. So, we suspect that our rados
> command
> caused these requests somehow. We also ran deep-scrub on pg 10.193c
> which did
> not report any errors at all:
>
> ```
> # ceph pg deep-scrub 10.193c
> instructing pg 10.193c on osd.84 to deep-scrub
>
> rd2-0530 # tail -f /var/log/ceph/ceph-osd.84.log
> 2018-05-02 16:38:23.070692 7fa3213cf700  0 log_channel(cluster) log
> [DBG] : 10.193c deep-scrub starts
> 2018-05-02 16:38:29.327847 7fa3213cf700  0 log_channel(cluster) log
> [DBG] : 10.193c deep-scrub ok
> ```
>
> After deep-scrubbing, our cluster turned to HEALTH_OK state again. We
> also
> checked if something else happened and the problematic PG got remapped
> some
> how (ie osd failure), but we did not observe something like that.
>
> So, at this part, we are stuck with a lot of questions:
>
> 1) After how much time RADOS tries to read from a secondary replica? Is
> this
>     timeout configurable?
> 2) If a primary shard is missing, Ceph tries to recreate it somehow
>     automatically?
> 3) If Ceph recreates the primary shard (even automatically, or with
> `ceph
>     pg repair`, why did we not observe IO errors again? Does BlueStore
> know
>     which disk blocks are bad an somehow avoids them or the same object
> can
>     be stored on different blocks if recreated? Unfortunately, I'm not
>     familiar with its internals.
> 4) Is there any reason why did slow requests appear? Can we correlate
> these
>     requests somehow with our problem?
>
> This behavior looks very confusing from a first sight and we'd really
> want
> to know what is happening and what is Ceph doing internally. I'd really
> appreciate any insights or pointers.
>

David and a few other people have been making a lot of changes around this
area lately to make Ceph handle failures more transparently, and I haven't
kept up with all of it. But I *believe* what happened is:
1) the scrub caused a read of the object, and BlueStore returned a read
error
2) the OSD would have previously treated this as a catastrophic failure and
crashed, but now it handles it by marking the object as missing and needing
recovery
— I don't quite remember the process here. Either 3') it tries to do
recovery on its own when there are available resources for it, or
3) the user requested an object the OSD had marked as missing, so
4) the recovery code kicked off and the OSD grabbed it from another replica.

In particular reference to your questions
1) It's not about time; a read error means the object is marked as gone
locally; when that happens it will try and recover the object from elsewhere
2) not a whole shard, but an object, sure. (I mean, it will also try to
recover a shard, but that's the normal peering, recovery, backfill sort of
thing...)
3) I don't know the BlueStore internals well enough to say for sure if it
marks the blocks as bad, but most modern *disks* will do that transparently
to the upper layers, so BlueStore just needs to write the data out again.
To BlueStore, the write will look like a completely different object, so
the fact a previous bit of hard drive was bad won't matter.
4) Probably your cluster was already busy, and ops got backed up on either
the primary OSD or one of the others participating in recovery? I mean,
that generally shouldn't occur, but slow requests tend to happen if you
overload a cluster and maybe the recovery pushed it over the edge...
-Greg



>
> Best regards,
> --
> Nikos Kormpakis - nk...@noc.grnet.gr
> Network Operations Center, Greek Research & Technology Network
> Tel: +30 210 7475712 <+30%2021%200747%205712> - http://www.grnet.gr
> 7, Kifisias Av., 115 23 Athens, Greece
> <https://maps.google.com/?q=7,+Kifisias+Av.,+115+23+Athens,+Greece&entry=gmail&source=g>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to