Hi Reed,

it looks to me like your settings aren't effective. You might want to check OSD log rather than crash info and see the assertion's backtrace.

Does it mention RocksDBBlueFSVolumeSelector as the one in https://tracker.ceph.com/issues/53906:

ceph version 17.0.0-10229-g7e035110 (7e035110784fba02ba81944e444be9a36932c6a3) 
quincy (dev)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7f2beb318c20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1b0) [0x56347eb33bec]
 5: /usr/bin/ceph-osd(+0x5d5daf) [0x56347eb33daf]
 6: (RocksDBBlueFSVolumeSelector::add_usage(void*, bluefs_fnode_t const&)+0) 
[0x56347f1f7d00]
 7: (BlueFS::_flush_range_F(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x735) [0x56347f295b45]


If so - then there is still a mess with proper parameter changes.

Thanks
Igor

On 10/01/2024 20:13, Reed Dier wrote:
Well, sadly, that setting doesn’t seem to resolve the issue.

I set the value in ceph.conf for the OSDs with small WAL/DB devices that keep 
running into the issue,

$  ceph tell osd.12 config show | grep bluestore_volume_selection_policy
     "bluestore_volume_selection_policy": "rocksdb_original",
$ ceph crash info 2024-01-10T16:39:05.925534Z_f0c57ca3-b7e6-4511-b7ae-5834541d6c67 | 
egrep "(assert_condition|entity_name)"
     "assert_condition": "cur >= p.length",
     "entity_name": "osd.12",

So, I guess that configuration item doesn’t in fact prevent the crash as was 
purported.
Looks like I may need to fast track moving to quincy…

Reed

On Jan 8, 2024, at 9:47 AM, Reed Dier<reed.d...@focusvq.com>  wrote:

I ended up setting it in ceph.conf which appears to have worked (as far as I 
can tell).

[osd]
bluestore_volume_selection_policy = rocksdb_original
$ ceph config show osd.0  | grep bluestore_volume_selection_policy
bluestore_volume_selection_policy   rocksdb_original                  file      
(mon[rocksdb_original])
So far so good…

Reed

On Jan 8, 2024, at 2:04 AM, Eugen Block <ebl...@nde.ag  <mailto:ebl...@nde.ag>> 
wrote:

Hi,

I just did the same in my lab environment and the config got applied to the 
daemon after a restart:

pacific:~ # ceph tell osd.0 config show | grep bluestore_volume_selection_policy
    "bluestore_volume_selection_policy": "rocksdb_original",

This is also a (tiny single-node) cluster running 16.2.14. Maybe you have some 
typo or something while doing the loop? Have you tried to set it for one OSD 
only and see if it starts with the config set?


Zitat von Reed Dier <reed.d...@focusvq.com  <mailto:reed.d...@focusvq.com>>:

After ~3 uneventful weeks after upgrading from 15.2.17 to 16.2.14 I’ve started seeing OSD 
crashes with "cur >= fnode.size” and "cur >= p.length”, which seems to be 
resolved in the next point release for pacific later this month, but until then, I’d love to 
keep the OSDs from flapping.

$ for crash in $(ceph crash ls | grep osd | awk '{print $1}') ; do ceph crash info $crash 
| egrep "(assert_condition|crash_id)" ; done
    "assert_condition": "cur >= fnode.size",
    "crash_id": 
"2024-01-03T09:07:55.698213Z_348af2d3-d4a7-4c27-9f71-70e6dc7c1af7",
    "assert_condition": "cur >= p.length",
    "crash_id": 
"2024-01-03T14:21:55.794692Z_4557c416-ffca-4165-aa91-d63698d41454",
    "assert_condition": "cur >= fnode.size",
    "crash_id": 
"2024-01-03T22:53:43.010010Z_15dc2b2a-30fb-4355-84b9-2f9560f08ea7",
    "assert_condition": "cur >= p.length",
    "crash_id": 
"2024-01-04T02:34:34.408976Z_2954a2c2-25d2-478e-92ad-d79c42d3ba43",
    "assert_condition": "cur2 >= p.length",
    "crash_id": 
"2024-01-04T21:57:07.100877Z_12f89c2c-4209-4f5a-b243-f0445ba629d2",
    "assert_condition": "cur >= p.length",
    "crash_id": 
"2024-01-05T00:35:08.561753Z_a189d967-ab02-4c61-bf68-1229222fd259",
    "assert_condition": "cur >= fnode.size",
    "crash_id": 
"2024-01-05T04:11:48.625086Z_a598cbaf-2c4f-4824-9939-1271eeba13ea",
    "assert_condition": "cur >= p.length",
    "crash_id": 
"2024-01-05T13:49:34.911210Z_953e38b9-8ae4-4cfe-8f22-d4b7cdf65cea",
    "assert_condition": "cur >= p.length",
    "crash_id": 
"2024-01-05T13:54:25.732770Z_4924b1c0-309c-4471-8c5d-c3aaea49166c",
    "assert_condition": "cur >= p.length",
    "crash_id": 
"2024-01-05T16:35:16.485416Z_0bca3d2a-2451-4275-a049-a65c58c1aff1”,
As noted 
inhttps://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/
  
<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/>
  
<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/
  
<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/>>

You can apparently work around the issue by setting
'bluestore_volume_selection_policy' config parameter to rocksdb_original.
However, after trying to set that parameter with `ceph config set osd.$osd 
bluestore_volume_selection_policy rocksdb_original` it doesn’t appear to set?

$ ceph config show-with-defaults osd.0  | grep bluestore_volume_selection_policy
bluestore_volume_selection_policy                           use_some_extra
$ ceph config set osd.0 bluestore_volume_selection_policy rocksdb_original
$ ceph config show osd.0  | grep bluestore_volume_selection_policy
bluestore_volume_selection_policy   use_some_extra                    default   
              mom
This, I assume, should reflect the new setting, however it still shows the 
default “use_some_extra” value.

But then this seems to imply that the config is set?
$ ceph config dump | grep bluestore_volume_selection_policy
    osd.0                dev       bluestore_volume_selection_policy       
rocksdb_original                                              *
[snip]
    osd.9                dev       bluestore_volume_selection_policy       
rocksdb_original                                              *
Does this need to be set in ceph.conf or is there another setting that also 
needs to be set?
Even after bouncing the OSD daemon, `ceph config show` still reports 
“use_some_extra"

Appreciate any help they can offer to point me towards to bridge the gap 
between now and the next point release.

Thanks,
Reed
_______________________________________________
ceph-users mailing list --ceph-users@ceph.io  <mailto:ceph-users@ceph.io>
To unsubscribe send an email toceph-users-le...@ceph.io  
<mailto:ceph-users-le...@ceph.io>

_______________________________________________
ceph-users mailing list --ceph-users@ceph.io  <mailto:ceph-users@ceph.io>
To unsubscribe send an email toceph-users-le...@ceph.io  
<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to