[ceph-users] Re: libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping

Ilya Dryomov Sun, 16 May 2021 12:37:15 -0700

On Sun, May 16, 2021 at 8:06 PM Markus Kienast <m...@trickkiste.at> wrote:
>
> Am So., 16. Mai 2021 um 19:38 Uhr schrieb Ilya Dryomov <idryo...@gmail.com>:
>>
>> On Sun, May 16, 2021 at 4:18 PM Markus Kienast <m...@trickkiste.at> wrote:
>> >
>> > Am So., 16. Mai 2021 um 15:36 Uhr schrieb Ilya Dryomov 
>> > <idryo...@gmail.com>:
>> >>
>> >> On Sun, May 16, 2021 at 12:54 PM Markus Kienast <m...@trickkiste.at> 
>> >> wrote:
>> >> >
>> >> > Hi Ilya,
>> >> >
>> >> > unfortunately I can not find any "missing primary copy of ..." error in 
>> >> > the logs of my 3 OSDs.
>> >> > The NVME disks are also brand new and there is not much traffic on them.
>> >> >
>> >> > The only error keyword I find are those two messages in osd.0 and osd.1 
>> >> > logs shown below.
>> >> >
>> >> > BTW the error posted before actually concerns osd1. The one I posted 
>> >> > was copied from somebody elses bug report, which had similar errors. 
>> >> > Here are my original error messages on LTSP boot:
>> >>
>> >> Hi Markus,
>> >>
>> >> Please don't ever paste log messages from other bug reports again.
>> >> Your email said "I am seeing these messages ..." and I spent a fair
>> >> amount of time staring at the code trying to understand how an issue
>> >> that was fixed several releases ago could resurface.
>> >>
>> >> The numbers in the log message mean specific things.  For example it
>> >> is immediately obvious that
>> >>
>> >>   get_reply osd1 tid 11 data 4164 > preallocated 4096, skipping
>> >>
>> >> is not related to
>> >>
>> >>   get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping
>> >>
>> >> even though they probably look the same to you.
>> >
>> >
>> > Sorry, I was not aware of that.
>> >
>> >>
>> >> > [    10.331119] libceph: mon1 (1)10.101.0.27:6789 session established
>> >> > [    10.331799] libceph: client175444 fsid 
>> >> > b0f4a188-bd81-11ea-8849-97abe2843f29
>> >> > [    10.336866] libceph: mon0 (1)10.101.0.25:6789 session established
>> >> > [    10.337598] libceph: client175444 fsid 
>> >> > b0f4a188-bd81-11ea-8849-97abe2843f29
>> >> > [    10.349380] libceph: get_reply osd1 tid 11 data 4164 > preallocated
>> >> > 4096, skipping
>> >>
>> >> Please paste the entire boot log and "rbd info" output for the affected
>> >> image.
>> >
>> >
>> > elias@maas:~$ rbd info squashfs/ltsp-01
>> > rbd image 'ltsp-01':
>> > size 3.5 GiB in 896 objects
>> > order 22 (4 MiB objects)
>> > snapshot_count: 0
>> > id: 23faade1714
>> > block_name_prefix: rbd_data.23faade1714
>> > format: 2
>> > features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
>> > op_features:
>> > flags:
>> > create_timestamp: Mon Jan 11 12:09:22 2021
>> > access_timestamp: Wed Feb 24 10:55:17 2021
>> > modify_timestamp: Mon Jan 11 12:09:22 2021
>> >
>> > I don't have the boot log available right now, but you can watch a video 
>> > of the boot process right here: https://photos.app.goo.gl/S8PssYu2VAr4CSeg7
>> >
>> > It seems to be consistently "tid 11" consistently, while in this video it 
>> > was "data 4288" not "data 4164" as above. But the image has been modified 
>> > in the meantime, as far as I can recall, so that might be due to that 
>> > reason.
>> >>
>> >>
>> >> >
>> >> > elias@maas:~$ juju ssh ceph-osd/2 sudo zgrep -i error 
>> >> > /var/log/ceph/ceph-osd.0.log
>> >> > 2021-05-16T08:52:56.872+0000 7f0b262c2d80  4 rocksdb:                   
>> >> >       Options.error_if_exists: 0
>> >> > 2021-05-16T08:52:59.872+0000 7f0b262c2d80  4 rocksdb:                   
>> >> >       Options.error_if_exists: 0
>> >> > 2021-05-16T08:53:00.884+0000 7f0b262c2d80  1 osd.0 8599 warning: got an 
>> >> > error loading one or more classes: (1) Operation not permitted
>> >> >
>> >> > elias@maas:~$ juju ssh ceph-osd/0 sudo zgrep -i error 
>> >> > /var/log/ceph/ceph-osd.1.log
>> >> > 2021-05-16T08:49:52.971+0000 7fb6aa68ed80  4 rocksdb:                   
>> >> >       Options.error_if_exists: 0
>> >> > 2021-05-16T08:49:55.979+0000 7fb6aa68ed80  4 rocksdb:                   
>> >> >       Options.error_if_exists: 0
>> >> > 2021-05-16T08:49:56.828+0000 7fb6aa68ed80  1 osd.1 8589 warning: got an 
>> >> > error loading one or more classes: (1) Operation not permitted
>> >> >
>> >> > How can I find our more about this bug? It keeps coming back every two 
>> >> > weeks and I need to restart all OSDs to make it go away for another two 
>> >> > weeks. Can I check "tid 11 data 4164" somehow. I find no documentation, 
>> >> > what a tid actually is and how I could perform a read test on it.
>> >>
>> >> So *just* restarting the three OSDs you have makes it go away?
>> >>
>> >> What is meant by restarting?  Rebooting the node or simply restarting
>> >> the OSD process?
>> >
>> >
>> > I did reboot all OSD nodes and since the MON and FS nodes run as LXD/juju 
>> > instances on them, they were rebooted as well.
>> >
>> >>
>> >> >
>> >> > Another interesting detail is, that the problem does only seem to 
>> >> > affect booting up from this RBD but not operation per se. The thin 
>> >> > clients already booted from this RBD continue working.
>> >>
>> >> I take it that the affected image is mapped on multiple nodes?  If so,
>> >> on how many?
>> >
>> >
>> > Currently "squashfs/ltsp-01" is mapped on 4 nodes.
>> > As the pool name indicates, the FS was converted to squashfs and is 
>> > therefore mounted read-only, while the underlying dev might actually not 
>> > be mounted read-only, as there does not seem to be an option available to 
>> > mount RO via /sys/bus/rbd/add_single_major or /sys/bus/rbd/add.
>> >
>> > As far as I can tell, the only way to force RO is to map a snapshot 
>> > instead.
>>
>> Are you writing to /sys/bus/rbd/add_single_major directly instead of
>> using the rbd tool?
>
>
> Yes.
> Line 110 
> https://github.com/trickkiste/ltsp/blob/feature-boot_method-rbd/debian/ltsp-rbd.initramfs-script
>
> echo "${mons} name=${user},secret=${key} ${pool} ${image} ${snap}" > 
> ${rbd_bus}
>
>>
>> >
>> >>
>> >> >
>> >> > All systems run:
>> >> > Ubuntu 20.04.2 LTS
>> >> > Kernel 5.8.0-53-generic
>> >> > ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus 
>> >> > (stable)
>> >> >
>> >> > The cluster has been setup with Ubuntu MAAS/juju, consists of
>> >> > * 1 MAAS server
>> >> > * with 1 virtual LXD juju controller
>> >> > * 3 OSD servers with one 2 TB Nvme SSD each for ceph and a 256 SATA SSD 
>> >> > for the operating system.
>> >> > * each OSD contains a virtualized LXD MON and an LXD FS server (setup 
>> >> > through juju, see juju yaml file attached).
>> >>
>> >> Can you describe the client side a bit more?  How many clients do you
>> >> have?  How many of them are active at the same time?
>> >
>> >
>> > Currently, there are only 4 active clients but the system is intended to 
>> > being able to sustain 100s of clients. We are using an RBD as boot device 
>> > for PXE booted thin clients, you might have heard of the Linux Terminal 
>> > Server Project (ltsp.org). We adapted the stack to support booting from 
>> > RBD.
>>
>> How many active clients there were at the time when the image couldn't
>> be mapped?  I suspect between 60 and 70?
>
>
> No, just 4.
> Most of the time 3 still running and working correctly and one stuck at 
> reboot.
>
> Maybe the sum of all LTSP client reboots since I cleared the problem by 
> rebooting the OSDs could amount to 60-70. I do not know, as we are not 
> logging that currently.
>
>>
>>
>> The next time it happens, check the output of "rbd status" for that
>> image.  If you see around 65 watchers, that is it.  With exclusive-lock
>> feature enabled on the image, the current kernel implementation can't
>> handle more than that.
>
>
> OK, currently I am seeing 5, which is one more than the number of clients we 
> have. So it seems these watchers do not timeout after reboot or hard reset.
>
> Is there any way to make these watchers time out?


They are supposed to time out after 30 seconds.  Does the IP address
of the rogue watch offer a clue?

Note that when the mapping gets stuck on that preallocated check, it
still maintains the watch so it's not going to time out in that case.

>
>>
>>
>> Watches are established if the image is mapped read-write.  For your
>> squashfs + overlayfs use case, it's not only better to map read-only
>> just in case, you actually *need* to do that to avoid watches being
>> established.
>>
>> If you are writing to /sys/bus/rbd/add_single_major directly, append
>> "ro" somewhere in the options part of the string:
>>
>>   ip:port,... name=myuser,secret=mysecret rbd ltsp-01 -  # read-write
>>
>>   ip:port,... name=myuser,secret=mysecret,ro rbd ltsp-01 -  # read-only
>
>
> Thank you, we will add this missing piece to our rbd initrd code.
>
> Are you a ceph dev?
> Could you make sure to add this to kernel documentation too!
> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-rbd

Map options are documented in the rbd man page:

https://docs.ceph.com/en/latest/man/8/rbd/#kernel-rbd-krbd-options

>
> There is no mention of that option currently.
> I might even have tried this but it might not have worked. Not sure, this has 
> been over a year back.
>
> Also missing in the documentation is, how one could mount a CephFS on boot!!!

Do you mean booting *from* CephFS, i.e. using it as a root filesystem?
Because mounting CephFS on boot after root filesystem is mounted is done
through /etc/fstab, like you would mount any other filesystem whether
local or network.

> We are thinking about switching to booting a CephFS in the future.
> But I would not have any idea and did not find any documentation on how we 
> would approach that - which boot kernel option to use, which sysfs interface 
> could be used, or which tools we must include in initrd.
>
> Generally it would be great if you could include the proper initrd code for 
> RBD and CephFS root filesystems to the Ceph project. You can happily use my 
> code as a starting point.
>
> https://github.com/trickkiste/ltsp/blob/feature-boot_method-rbd/debian/ltsp-rbd.initramfs-script

I think booting from CephFS would require kernel patches.  It looks
like NFS and CIFS are the only network filesystems supported by the
init/root infrastructure in the kernel.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping

Reply via email to