[ceph-users] Re: libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping

Ilya Dryomov Sun, 16 May 2021 10:38:48 -0700

On Sun, May 16, 2021 at 4:18 PM Markus Kienast <m...@trickkiste.at> wrote:
>
> Am So., 16. Mai 2021 um 15:36 Uhr schrieb Ilya Dryomov <idryo...@gmail.com>:
>>
>> On Sun, May 16, 2021 at 12:54 PM Markus Kienast <m...@trickkiste.at> wrote:
>> >
>> > Hi Ilya,
>> >
>> > unfortunately I can not find any "missing primary copy of ..." error in 
>> > the logs of my 3 OSDs.
>> > The NVME disks are also brand new and there is not much traffic on them.
>> >
>> > The only error keyword I find are those two messages in osd.0 and osd.1 
>> > logs shown below.
>> >
>> > BTW the error posted before actually concerns osd1. The one I posted was 
>> > copied from somebody elses bug report, which had similar errors. Here are 
>> > my original error messages on LTSP boot:
>>
>> Hi Markus,
>>
>> Please don't ever paste log messages from other bug reports again.
>> Your email said "I am seeing these messages ..." and I spent a fair
>> amount of time staring at the code trying to understand how an issue
>> that was fixed several releases ago could resurface.
>>
>> The numbers in the log message mean specific things.  For example it
>> is immediately obvious that
>>
>>   get_reply osd1 tid 11 data 4164 > preallocated 4096, skipping
>>
>> is not related to
>>
>>   get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping
>>
>> even though they probably look the same to you.
>
>
> Sorry, I was not aware of that.
>
>>
>> > [    10.331119] libceph: mon1 (1)10.101.0.27:6789 session established
>> > [    10.331799] libceph: client175444 fsid 
>> > b0f4a188-bd81-11ea-8849-97abe2843f29
>> > [    10.336866] libceph: mon0 (1)10.101.0.25:6789 session established
>> > [    10.337598] libceph: client175444 fsid 
>> > b0f4a188-bd81-11ea-8849-97abe2843f29
>> > [    10.349380] libceph: get_reply osd1 tid 11 data 4164 > preallocated
>> > 4096, skipping
>>
>> Please paste the entire boot log and "rbd info" output for the affected
>> image.
>
>
> elias@maas:~$ rbd info squashfs/ltsp-01
> rbd image 'ltsp-01':
> size 3.5 GiB in 896 objects
> order 22 (4 MiB objects)
> snapshot_count: 0
> id: 23faade1714
> block_name_prefix: rbd_data.23faade1714
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> op_features:
> flags:
> create_timestamp: Mon Jan 11 12:09:22 2021
> access_timestamp: Wed Feb 24 10:55:17 2021
> modify_timestamp: Mon Jan 11 12:09:22 2021
>
> I don't have the boot log available right now, but you can watch a video of 
> the boot process right here: https://photos.app.goo.gl/S8PssYu2VAr4CSeg7
>
> It seems to be consistently "tid 11" consistently, while in this video it was 
> "data 4288" not "data 4164" as above. But the image has been modified in the 
> meantime, as far as I can recall, so that might be due to that reason.
>>
>>
>> >
>> > elias@maas:~$ juju ssh ceph-osd/2 sudo zgrep -i error 
>> > /var/log/ceph/ceph-osd.0.log
>> > 2021-05-16T08:52:56.872+0000 7f0b262c2d80  4 rocksdb:                      
>> >    Options.error_if_exists: 0
>> > 2021-05-16T08:52:59.872+0000 7f0b262c2d80  4 rocksdb:                      
>> >    Options.error_if_exists: 0
>> > 2021-05-16T08:53:00.884+0000 7f0b262c2d80  1 osd.0 8599 warning: got an 
>> > error loading one or more classes: (1) Operation not permitted
>> >
>> > elias@maas:~$ juju ssh ceph-osd/0 sudo zgrep -i error 
>> > /var/log/ceph/ceph-osd.1.log
>> > 2021-05-16T08:49:52.971+0000 7fb6aa68ed80  4 rocksdb:                      
>> >    Options.error_if_exists: 0
>> > 2021-05-16T08:49:55.979+0000 7fb6aa68ed80  4 rocksdb:                      
>> >    Options.error_if_exists: 0
>> > 2021-05-16T08:49:56.828+0000 7fb6aa68ed80  1 osd.1 8589 warning: got an 
>> > error loading one or more classes: (1) Operation not permitted
>> >
>> > How can I find our more about this bug? It keeps coming back every two 
>> > weeks and I need to restart all OSDs to make it go away for another two 
>> > weeks. Can I check "tid 11 data 4164" somehow. I find no documentation, 
>> > what a tid actually is and how I could perform a read test on it.
>>
>> So *just* restarting the three OSDs you have makes it go away?
>>
>> What is meant by restarting?  Rebooting the node or simply restarting
>> the OSD process?
>
>
> I did reboot all OSD nodes and since the MON and FS nodes run as LXD/juju 
> instances on them, they were rebooted as well.
>
>>
>> >
>> > Another interesting detail is, that the problem does only seem to affect 
>> > booting up from this RBD but not operation per se. The thin clients 
>> > already booted from this RBD continue working.
>>
>> I take it that the affected image is mapped on multiple nodes?  If so,
>> on how many?
>
>
> Currently "squashfs/ltsp-01" is mapped on 4 nodes.
> As the pool name indicates, the FS was converted to squashfs and is therefore 
> mounted read-only, while the underlying dev might actually not be mounted 
> read-only, as there does not seem to be an option available to mount RO via 
> /sys/bus/rbd/add_single_major or /sys/bus/rbd/add.
>
> As far as I can tell, the only way to force RO is to map a snapshot instead.


Are you writing to /sys/bus/rbd/add_single_major directly instead of
using the rbd tool?

>
>>
>> >
>> > All systems run:
>> > Ubuntu 20.04.2 LTS
>> > Kernel 5.8.0-53-generic
>> > ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus 
>> > (stable)
>> >
>> > The cluster has been setup with Ubuntu MAAS/juju, consists of
>> > * 1 MAAS server
>> > * with 1 virtual LXD juju controller
>> > * 3 OSD servers with one 2 TB Nvme SSD each for ceph and a 256 SATA SSD 
>> > for the operating system.
>> > * each OSD contains a virtualized LXD MON and an LXD FS server (setup 
>> > through juju, see juju yaml file attached).
>>
>> Can you describe the client side a bit more?  How many clients do you
>> have?  How many of them are active at the same time?
>
>
> Currently, there are only 4 active clients but the system is intended to 
> being able to sustain 100s of clients. We are using an RBD as boot device for 
> PXE booted thin clients, you might have heard of the Linux Terminal Server 
> Project (ltsp.org). We adapted the stack to support booting from RBD.

How many active clients there were at the time when the image couldn't
be mapped?  I suspect between 60 and 70?

The next time it happens, check the output of "rbd status" for that
image.  If you see around 65 watchers, that is it.  With exclusive-lock
feature enabled on the image, the current kernel implementation can't
handle more than that.

Watches are established if the image is mapped read-write.  For your
squashfs + overlayfs use case, it's not only better to map read-only
just in case, you actually *need* to do that to avoid watches being
established.

If you are writing to /sys/bus/rbd/add_single_major directly, append
"ro" somewhere in the options part of the string:

  ip:port,... name=myuser,secret=mysecret rbd ltsp-01 -  # read-write

  ip:port,... name=myuser,secret=mysecret,ro rbd ltsp-01 -  # read-only

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: libceph: get_reply osd2 tid 1459933 data 3248128 > preallocated 131072, skipping

Reply via email to