On Sun, May 16, 2021 at 8:06 PM Markus Kienast <m...@trickkiste.at> wrote: > > Am So., 16. Mai 2021 um 19:38 Uhr schrieb Ilya Dryomov <idryo...@gmail.com>: >> >> On Sun, May 16, 2021 at 4:18 PM Markus Kienast <m...@trickkiste.at> wrote: >> > >> > Am So., 16. Mai 2021 um 15:36 Uhr schrieb Ilya Dryomov >> > <idryo...@gmail.com>: >> >> >> >> On Sun, May 16, 2021 at 12:54 PM Markus Kienast <m...@trickkiste.at> >> >> wrote: >> >> > >> >> > Hi Ilya, >> >> > >> >> > unfortunately I can not find any "missing primary copy of ..." error in >> >> > the logs of my 3 OSDs. >> >> > The NVME disks are also brand new and there is not much traffic on them. >> >> > >> >> > The only error keyword I find are those two messages in osd.0 and osd.1 >> >> > logs shown below. >> >> > >> >> > BTW the error posted before actually concerns osd1. The one I posted >> >> > was copied from somebody elses bug report, which had similar errors. >> >> > Here are my original error messages on LTSP boot: >> >> >> >> Hi Markus, >> >> >> >> Please don't ever paste log messages from other bug reports again. >> >> Your email said "I am seeing these messages ..." and I spent a fair >> >> amount of time staring at the code trying to understand how an issue >> >> that was fixed several releases ago could resurface. >> >> >> >> The numbers in the log message mean specific things. For example it >> >> is immediately obvious that >> >> >> >> get_reply osd1 tid 11 data 4164 > preallocated 4096, skipping >> >> >> >> is not related to >> >> >> >> get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping >> >> >> >> even though they probably look the same to you. >> > >> > >> > Sorry, I was not aware of that. >> > >> >> >> >> > [ 10.331119] libceph: mon1 (1)10.101.0.27:6789 session established >> >> > [ 10.331799] libceph: client175444 fsid >> >> > b0f4a188-bd81-11ea-8849-97abe2843f29 >> >> > [ 10.336866] libceph: mon0 (1)10.101.0.25:6789 session established >> >> > [ 10.337598] libceph: client175444 fsid >> >> > b0f4a188-bd81-11ea-8849-97abe2843f29 >> >> > [ 10.349380] libceph: get_reply osd1 tid 11 data 4164 > preallocated >> >> > 4096, skipping >> >> >> >> Please paste the entire boot log and "rbd info" output for the affected >> >> image. >> > >> > >> > elias@maas:~$ rbd info squashfs/ltsp-01 >> > rbd image 'ltsp-01': >> > size 3.5 GiB in 896 objects >> > order 22 (4 MiB objects) >> > snapshot_count: 0 >> > id: 23faade1714 >> > block_name_prefix: rbd_data.23faade1714 >> > format: 2 >> > features: layering, exclusive-lock, object-map, fast-diff, deep-flatten >> > op_features: >> > flags: >> > create_timestamp: Mon Jan 11 12:09:22 2021 >> > access_timestamp: Wed Feb 24 10:55:17 2021 >> > modify_timestamp: Mon Jan 11 12:09:22 2021 >> > >> > I don't have the boot log available right now, but you can watch a video >> > of the boot process right here: https://photos.app.goo.gl/S8PssYu2VAr4CSeg7 >> > >> > It seems to be consistently "tid 11" consistently, while in this video it >> > was "data 4288" not "data 4164" as above. But the image has been modified >> > in the meantime, as far as I can recall, so that might be due to that >> > reason. >> >> >> >> >> >> > >> >> > elias@maas:~$ juju ssh ceph-osd/2 sudo zgrep -i error >> >> > /var/log/ceph/ceph-osd.0.log >> >> > 2021-05-16T08:52:56.872+0000 7f0b262c2d80 4 rocksdb: >> >> > Options.error_if_exists: 0 >> >> > 2021-05-16T08:52:59.872+0000 7f0b262c2d80 4 rocksdb: >> >> > Options.error_if_exists: 0 >> >> > 2021-05-16T08:53:00.884+0000 7f0b262c2d80 1 osd.0 8599 warning: got an >> >> > error loading one or more classes: (1) Operation not permitted >> >> > >> >> > elias@maas:~$ juju ssh ceph-osd/0 sudo zgrep -i error >> >> > /var/log/ceph/ceph-osd.1.log >> >> > 2021-05-16T08:49:52.971+0000 7fb6aa68ed80 4 rocksdb: >> >> > Options.error_if_exists: 0 >> >> > 2021-05-16T08:49:55.979+0000 7fb6aa68ed80 4 rocksdb: >> >> > Options.error_if_exists: 0 >> >> > 2021-05-16T08:49:56.828+0000 7fb6aa68ed80 1 osd.1 8589 warning: got an >> >> > error loading one or more classes: (1) Operation not permitted >> >> > >> >> > How can I find our more about this bug? It keeps coming back every two >> >> > weeks and I need to restart all OSDs to make it go away for another two >> >> > weeks. Can I check "tid 11 data 4164" somehow. I find no documentation, >> >> > what a tid actually is and how I could perform a read test on it. >> >> >> >> So *just* restarting the three OSDs you have makes it go away? >> >> >> >> What is meant by restarting? Rebooting the node or simply restarting >> >> the OSD process? >> > >> > >> > I did reboot all OSD nodes and since the MON and FS nodes run as LXD/juju >> > instances on them, they were rebooted as well. >> > >> >> >> >> > >> >> > Another interesting detail is, that the problem does only seem to >> >> > affect booting up from this RBD but not operation per se. The thin >> >> > clients already booted from this RBD continue working. >> >> >> >> I take it that the affected image is mapped on multiple nodes? If so, >> >> on how many? >> > >> > >> > Currently "squashfs/ltsp-01" is mapped on 4 nodes. >> > As the pool name indicates, the FS was converted to squashfs and is >> > therefore mounted read-only, while the underlying dev might actually not >> > be mounted read-only, as there does not seem to be an option available to >> > mount RO via /sys/bus/rbd/add_single_major or /sys/bus/rbd/add. >> > >> > As far as I can tell, the only way to force RO is to map a snapshot >> > instead. >> >> Are you writing to /sys/bus/rbd/add_single_major directly instead of >> using the rbd tool? > > > Yes. > Line 110 > https://github.com/trickkiste/ltsp/blob/feature-boot_method-rbd/debian/ltsp-rbd.initramfs-script > > echo "${mons} name=${user},secret=${key} ${pool} ${image} ${snap}" > > ${rbd_bus} > >> >> > >> >> >> >> > >> >> > All systems run: >> >> > Ubuntu 20.04.2 LTS >> >> > Kernel 5.8.0-53-generic >> >> > ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus >> >> > (stable) >> >> > >> >> > The cluster has been setup with Ubuntu MAAS/juju, consists of >> >> > * 1 MAAS server >> >> > * with 1 virtual LXD juju controller >> >> > * 3 OSD servers with one 2 TB Nvme SSD each for ceph and a 256 SATA SSD >> >> > for the operating system. >> >> > * each OSD contains a virtualized LXD MON and an LXD FS server (setup >> >> > through juju, see juju yaml file attached). >> >> >> >> Can you describe the client side a bit more? How many clients do you >> >> have? How many of them are active at the same time? >> > >> > >> > Currently, there are only 4 active clients but the system is intended to >> > being able to sustain 100s of clients. We are using an RBD as boot device >> > for PXE booted thin clients, you might have heard of the Linux Terminal >> > Server Project (ltsp.org). We adapted the stack to support booting from >> > RBD. >> >> How many active clients there were at the time when the image couldn't >> be mapped? I suspect between 60 and 70? > > > No, just 4. > Most of the time 3 still running and working correctly and one stuck at > reboot. > > Maybe the sum of all LTSP client reboots since I cleared the problem by > rebooting the OSDs could amount to 60-70. I do not know, as we are not > logging that currently. > >> >> >> The next time it happens, check the output of "rbd status" for that >> image. If you see around 65 watchers, that is it. With exclusive-lock >> feature enabled on the image, the current kernel implementation can't >> handle more than that. > > > OK, currently I am seeing 5, which is one more than the number of clients we > have. So it seems these watchers do not timeout after reboot or hard reset. > > Is there any way to make these watchers time out?
They are supposed to time out after 30 seconds. Does the IP address of the rogue watch offer a clue? Note that when the mapping gets stuck on that preallocated check, it still maintains the watch so it's not going to time out in that case. > >> >> >> Watches are established if the image is mapped read-write. For your >> squashfs + overlayfs use case, it's not only better to map read-only >> just in case, you actually *need* to do that to avoid watches being >> established. >> >> If you are writing to /sys/bus/rbd/add_single_major directly, append >> "ro" somewhere in the options part of the string: >> >> ip:port,... name=myuser,secret=mysecret rbd ltsp-01 - # read-write >> >> ip:port,... name=myuser,secret=mysecret,ro rbd ltsp-01 - # read-only > > > Thank you, we will add this missing piece to our rbd initrd code. > > Are you a ceph dev? > Could you make sure to add this to kernel documentation too! > https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-rbd Map options are documented in the rbd man page: https://docs.ceph.com/en/latest/man/8/rbd/#kernel-rbd-krbd-options > > There is no mention of that option currently. > I might even have tried this but it might not have worked. Not sure, this has > been over a year back. > > Also missing in the documentation is, how one could mount a CephFS on boot!!! Do you mean booting *from* CephFS, i.e. using it as a root filesystem? Because mounting CephFS on boot after root filesystem is mounted is done through /etc/fstab, like you would mount any other filesystem whether local or network. > We are thinking about switching to booting a CephFS in the future. > But I would not have any idea and did not find any documentation on how we > would approach that - which boot kernel option to use, which sysfs interface > could be used, or which tools we must include in initrd. > > Generally it would be great if you could include the proper initrd code for > RBD and CephFS root filesystems to the Ceph project. You can happily use my > code as a starting point. > > https://github.com/trickkiste/ltsp/blob/feature-boot_method-rbd/debian/ltsp-rbd.initramfs-script I think booting from CephFS would require kernel patches. It looks like NFS and CIFS are the only network filesystems supported by the init/root infrastructure in the kernel. Thanks, Ilya _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io