On Sun, May 16, 2021 at 4:18 PM Markus Kienast <m...@trickkiste.at> wrote: > > Am So., 16. Mai 2021 um 15:36 Uhr schrieb Ilya Dryomov <idryo...@gmail.com>: >> >> On Sun, May 16, 2021 at 12:54 PM Markus Kienast <m...@trickkiste.at> wrote: >> > >> > Hi Ilya, >> > >> > unfortunately I can not find any "missing primary copy of ..." error in >> > the logs of my 3 OSDs. >> > The NVME disks are also brand new and there is not much traffic on them. >> > >> > The only error keyword I find are those two messages in osd.0 and osd.1 >> > logs shown below. >> > >> > BTW the error posted before actually concerns osd1. The one I posted was >> > copied from somebody elses bug report, which had similar errors. Here are >> > my original error messages on LTSP boot: >> >> Hi Markus, >> >> Please don't ever paste log messages from other bug reports again. >> Your email said "I am seeing these messages ..." and I spent a fair >> amount of time staring at the code trying to understand how an issue >> that was fixed several releases ago could resurface. >> >> The numbers in the log message mean specific things. For example it >> is immediately obvious that >> >> get_reply osd1 tid 11 data 4164 > preallocated 4096, skipping >> >> is not related to >> >> get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping >> >> even though they probably look the same to you. > > > Sorry, I was not aware of that. > >> >> > [ 10.331119] libceph: mon1 (1)10.101.0.27:6789 session established >> > [ 10.331799] libceph: client175444 fsid >> > b0f4a188-bd81-11ea-8849-97abe2843f29 >> > [ 10.336866] libceph: mon0 (1)10.101.0.25:6789 session established >> > [ 10.337598] libceph: client175444 fsid >> > b0f4a188-bd81-11ea-8849-97abe2843f29 >> > [ 10.349380] libceph: get_reply osd1 tid 11 data 4164 > preallocated >> > 4096, skipping >> >> Please paste the entire boot log and "rbd info" output for the affected >> image. > > > elias@maas:~$ rbd info squashfs/ltsp-01 > rbd image 'ltsp-01': > size 3.5 GiB in 896 objects > order 22 (4 MiB objects) > snapshot_count: 0 > id: 23faade1714 > block_name_prefix: rbd_data.23faade1714 > format: 2 > features: layering, exclusive-lock, object-map, fast-diff, deep-flatten > op_features: > flags: > create_timestamp: Mon Jan 11 12:09:22 2021 > access_timestamp: Wed Feb 24 10:55:17 2021 > modify_timestamp: Mon Jan 11 12:09:22 2021 > > I don't have the boot log available right now, but you can watch a video of > the boot process right here: https://photos.app.goo.gl/S8PssYu2VAr4CSeg7 > > It seems to be consistently "tid 11" consistently, while in this video it was > "data 4288" not "data 4164" as above. But the image has been modified in the > meantime, as far as I can recall, so that might be due to that reason. >> >> >> > >> > elias@maas:~$ juju ssh ceph-osd/2 sudo zgrep -i error >> > /var/log/ceph/ceph-osd.0.log >> > 2021-05-16T08:52:56.872+0000 7f0b262c2d80 4 rocksdb: >> > Options.error_if_exists: 0 >> > 2021-05-16T08:52:59.872+0000 7f0b262c2d80 4 rocksdb: >> > Options.error_if_exists: 0 >> > 2021-05-16T08:53:00.884+0000 7f0b262c2d80 1 osd.0 8599 warning: got an >> > error loading one or more classes: (1) Operation not permitted >> > >> > elias@maas:~$ juju ssh ceph-osd/0 sudo zgrep -i error >> > /var/log/ceph/ceph-osd.1.log >> > 2021-05-16T08:49:52.971+0000 7fb6aa68ed80 4 rocksdb: >> > Options.error_if_exists: 0 >> > 2021-05-16T08:49:55.979+0000 7fb6aa68ed80 4 rocksdb: >> > Options.error_if_exists: 0 >> > 2021-05-16T08:49:56.828+0000 7fb6aa68ed80 1 osd.1 8589 warning: got an >> > error loading one or more classes: (1) Operation not permitted >> > >> > How can I find our more about this bug? It keeps coming back every two >> > weeks and I need to restart all OSDs to make it go away for another two >> > weeks. Can I check "tid 11 data 4164" somehow. I find no documentation, >> > what a tid actually is and how I could perform a read test on it. >> >> So *just* restarting the three OSDs you have makes it go away? >> >> What is meant by restarting? Rebooting the node or simply restarting >> the OSD process? > > > I did reboot all OSD nodes and since the MON and FS nodes run as LXD/juju > instances on them, they were rebooted as well. > >> >> > >> > Another interesting detail is, that the problem does only seem to affect >> > booting up from this RBD but not operation per se. The thin clients >> > already booted from this RBD continue working. >> >> I take it that the affected image is mapped on multiple nodes? If so, >> on how many? > > > Currently "squashfs/ltsp-01" is mapped on 4 nodes. > As the pool name indicates, the FS was converted to squashfs and is therefore > mounted read-only, while the underlying dev might actually not be mounted > read-only, as there does not seem to be an option available to mount RO via > /sys/bus/rbd/add_single_major or /sys/bus/rbd/add. > > As far as I can tell, the only way to force RO is to map a snapshot instead.
Are you writing to /sys/bus/rbd/add_single_major directly instead of using the rbd tool? > >> >> > >> > All systems run: >> > Ubuntu 20.04.2 LTS >> > Kernel 5.8.0-53-generic >> > ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus >> > (stable) >> > >> > The cluster has been setup with Ubuntu MAAS/juju, consists of >> > * 1 MAAS server >> > * with 1 virtual LXD juju controller >> > * 3 OSD servers with one 2 TB Nvme SSD each for ceph and a 256 SATA SSD >> > for the operating system. >> > * each OSD contains a virtualized LXD MON and an LXD FS server (setup >> > through juju, see juju yaml file attached). >> >> Can you describe the client side a bit more? How many clients do you >> have? How many of them are active at the same time? > > > Currently, there are only 4 active clients but the system is intended to > being able to sustain 100s of clients. We are using an RBD as boot device for > PXE booted thin clients, you might have heard of the Linux Terminal Server > Project (ltsp.org). We adapted the stack to support booting from RBD. How many active clients there were at the time when the image couldn't be mapped? I suspect between 60 and 70? The next time it happens, check the output of "rbd status" for that image. If you see around 65 watchers, that is it. With exclusive-lock feature enabled on the image, the current kernel implementation can't handle more than that. Watches are established if the image is mapped read-write. For your squashfs + overlayfs use case, it's not only better to map read-only just in case, you actually *need* to do that to avoid watches being established. If you are writing to /sys/bus/rbd/add_single_major directly, append "ro" somewhere in the options part of the string: ip:port,... name=myuser,secret=mysecret rbd ltsp-01 - # read-write ip:port,... name=myuser,secret=mysecret,ro rbd ltsp-01 - # read-only Thanks, Ilya _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io