[ceph-users] Re: ceph rbox test on passive compressed pool
Hi David, Just to let you know, this hint is being set, what is the reason for ceph of doing only half the objects? Can it be that there is some issue with my osd's? Like some maybe have an old fs (still using disk not volume)? Is this still to be expected or does ceph under pressure drop compressing? https://github.com/ceph-dovecot/dovecot-ceph-plugin/blob/56d6c900cc9ec07dfb98ef2abac07aae466b7610/src/librmb/rados-storage-impl.cpp#L75 Thanks, Marc -Original Message- Cc: jan.radon Subject: Re: [ceph-users] ceph rbox test on passive compressed pool The hints have to be given from the client side as far as I understand, can you share the client code too? Also,not seems that there's no guarantees that it will actually do anything (best effort I guess): https://docs.ceph.com/docs/mimic/rados/api/librados/#c.rados_set_alloc_hint Cheers On 6 September 2020 15:59:01 BST, Marc Roos wrote: I have been inserting 10790 exactly the same 64kb text message to a passive compressing enabled pool. I am still counting, but it looks like only half the objects are compressed. mail/b08c3218dbf1545ff43052412a8e mtime 2020-09-06 16:27:39.00, size 63580 mail/00f6043775f1545ff43052412a8e mtime 2020-09-06 16:25:57.00, size 525 mail/b875f40571f1545ff43052412a8e mtime 2020-09-06 16:25:53.00, size 63580 mail/e87c120b19f1545ff43052412a8e mtime 2020-09-06 16:24:25.00, size 525 I am not sure if this should be expected from passive, these docs[1] hint that passive 'compress if hinted COMPRESSIBLE'. From that I would conclude that all text messages should be compressed. A previous test with a 64kb gzip attachment seemed to not compress, although I did not look at all object sizes. on 14.2.11 [1] https://documentation.suse.com/ses/5.5/html/ses-all/ceph-pools.html #sec-ceph-pool-compression https://docs.ceph.com/docs/mimic/rados/operations/pools/ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: slow "rados ls"
Hi Stefan I can't recall that that was the case and unfortunately we do not have enough history for our performance measurements to look back We are on nautilus. Please let me know your findings when you do your pg expansion on nautilus Grtz Marcel > OK, I'm really curious if you observed the following behaviour: > > During, or shortly after the rebalance, did you see high CPU usage of > the OSDs? In particular the ones that hosted the PGs before they were > moved to the new nodes? As in ~ 300 % CPU per OSD (increasing from a few > percent to 300% non stop)? RocksDB is doing housekeeping, And we > observed before, and today again, on Mimic 13.2.8, that with a lot of > OMAP/META data the OSDs that have to clean up consume a ridiculous > amount of CPU (for hours on end). Triggering loads of slow ops and > latency spikes in the somtimes (tens) of seconds. > > Are you running nautilus? If you haven't seen this behaviour this might > have been fixed in Nautlilus. Or you cluster is different from ours. We > will do PG expansion after we have upgraded to Nautilus, so we'll > definitely know by then. > > Thanks, > > Stefan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-osd performance on ram disk
On 10/09/2020 19:37, Mark Nelson wrote: On 9/10/20 11:03 AM, George Shuklin wrote: ... Are there any knobs to tweak to see higher performance for ceph-osd? I'm pretty sure it's not any kind of leveling, GC or other 'iops-related' issues (brd has performance of two order of magnitude higher). So as you've seen, Ceph does a lot more than just write a chunk of data out to a block on disk. There's tons of encoding/decoding happening, crc checksums, crush calculations, onode lookups, write-ahead-logging, and other work involved that all adds latency. You can overcome some of that through parallelism, but 30K IOPs per OSD is probably pretty on-point for a nautilus era OSD. For octopus+ the cache refactor in bluestore should get you farther (40-50k+ for and OSD in isolation). The maximum performance we've seen in-house is around 70-80K IOPs on a single OSD using very fast NVMe and highly tuned settings. A couple of things you can try: - upgrade to octopus+ for the cache refactor - Make sure you are using the equivalent of the latency-performance or latency-network tuned profile. The most important part is disabling CPU cstate transitions. - increase osd_memory_target if you have a larger dataset (onode cache misses in bluestore add a lot of latency) - enable turbo if it's disabled (higher clock speed generally helps) On the write path you are correct that there is a limitation regarding a single kv sync thread. Over the years we've made this less of a bottleneck but it's possible you still could be hitting it. In our test lab we've managed to utilize up to around 12-14 cores on a single OSD in isolation with 16 tp_osd_tp worker threads and on a larger cluster about 6-7 cores per OSD. There's probably multiple factors at play, including context switching, cache thrashing, memory throughput, object creation/destruction, etc. If you decide to look into it further you may want to try wallclock profiling the OSD under load and seeing where it is spending its time. Thank you for feedback. I forgot to mention this, it's Octopus, fresh installation. I've disabled CSTATE (governor=performance), it make no difference - same iops, same CPU use by ceph-osd I've just can't force Ceph to consume more than 330% of CPU. I can force read up to 150k IOPS (both network and local), hitting CPU limit, but write is somewhat restricted by ceph itself. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm - How to deploy ceph cluster with a partition on SSD for block.db
On Tue, Sep 08, 2020 at 07:14:16AM -, kle...@psi-net.si wrote: I found out that it's already possible to specify storage path in OSD service specification yaml. It works for data_devices, but unfortunately not for db_devices and wal_devices, at least not in my case. Aside from the question whether db/wal/journal_devices should accept paths as a filter, I'd like to point out that partitions are only a valid argument when calling `ceph-volume lvm prepare/create`. OSD service specs are quite tightly coupled to the batch subcommand, which has no support partitions. The batch subcommand will soon gain support for handle logical volumes too. I'll explore if we can extend osd service specs accordingly. Until then I'm afraid you're stuck to use the create or prepare subcommand for "uncommon" deployments like this (db devices collocated with root device). service_type: osd service_id: osd_spec_default placement: host_pattern: '*' data_devices: paths: - /dev/vdb1 db_devices: paths: - /dev/vdb2 wal_devices: paths: - /dev/vdb3 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Problem unusable after deleting pool with bilion objects
Hi all, I have build testing cluster with 4 hosts, 1 SSD's and 11 HDD on each host. Running ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu. Because we want to save small size object, I set bluestore_min_alloc_size 8192 (it is maybe important in this case) I have filled it through rados gw with approx billion of small objects. After tests I changed min_alloc_size back and deleted rados pools (to emtpy whole cluster) and I was waiting till cluster deletes data from OSD's, but that destabilized the cluster. I never reached health OK. OSD's were killed in random order. I can start them back but they will again get out from cluster with.. ``` -18> 2020-09-05 22:11:19.430 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -17> 2020-09-05 22:11:19.430 7f7a3ee40700 5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304 -16> 2020-09-05 22:11:20.434 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -15> 2020-09-05 22:11:21.434 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -14> 2020-09-05 22:11:22.258 7f7a2b81f700 5 osd.42 103257 heartbeat osd_stat(store_statfs(0x1ce1829/0x2d08c/0x1d18000, data 0x23143355/0x974a, compress 0x0/0x0/0x0, omap 0x1f11e, meta 0x2d08a0ee2), peers [3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] op hist []) -13> 2020-09-05 22:11:22.438 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -12> 2020-09-05 22:11:23.442 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -11> 2020-09-05 22:11:24.442 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -10> 2020-09-05 22:11:24.442 7f7a3ee40700 5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304 -9> 2020-09-05 22:11:24.442 7f7a2e024700 0 bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation observed for _collection_list, latency = 151.113s, lat = 2m cid =5.47_head start #5:e2000# end #MAX# max 2147483647 -8> 2020-09-05 22:11:24.446 7f7a2e024700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out after 15 -7> 2020-09-05 22:11:24.446 7f7a2e024700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide timed out after 150 -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient: get_auth_request con 0x555b15d07680 auth_method 0 -5> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257 ms_handle_reset con 0x555b15963600 session 0x555a9f9d6d00 -4> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257 ms_handle_reset con 0x555b15961b00 session 0x555a9f9d7980 -3> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257 ms_handle_reset con 0x555b15963a80 session 0x555a9f9d6a80 -2> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257 ms_handle_reset con 0x555b15960480 session 0x555a9f9d6f80 -1> 2020-09-05 22:11:24.446 7f7a3c494700 3 osd.42 103257 handle_osd_map epochs [103258,103259], i have 103257, src has [83902,103259] 0> 2020-09-05 22:11:24.450 7f7a2e024700 -1 *** Caught signal (Aborted) ** ``` I have approx 12 OSD's down with this error. I decided to wipe problematic OSD's so I cannot debug it, but I'm curious what I did wrong (deleting pool with many small data?) or what to do next time. I did that before but not with bilion object and without bluestore_min_alloc_size change, and it worked without problems. With regards Jan Pekar -- Ing. Jan Pekař jan.pe...@imatic.cz Imatic | Jagellonská 14 | Praha 3 | 130 00 http://www.imatic.cz | +420326555326 -- ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Problem unusable after deleting pool with bilion objects
Hi Jan, most likely this is a known issue with slow and ineffective pool removal procedure in Ceph. I did some presentation on the topic at yesterday's weekly performance meeting, presumably a recording will be available in a couple of days. An additional accompanying issue not covered during this meeting is RocksDB's misbehavior after (or during) such massive removals. At some point it starts to slow down reading operations handling (e.g. collection listing) which results in OSD suicide timeouts. Exactly what is observed in your case. There were multiple discussion on this issue in this mailing list too. In short the currect workaround is to perform manual DB compaction using ceph-kvstore-tool. Pool removal will most likely to proceed hence one might face similar assertions after a while. Hence there might be a need for multiple "compaction-restart" iterations until pool is finally removed. And yet another potential issue (or at least an additional factor) with your setup is a pretty high DB vs. Main devices ratio (1:11). Deleting procedures from multiple OSDs result in a pretty highload on DB volume which becomes overburdened... Thanks, Igor On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote: Hi all, I have build testing cluster with 4 hosts, 1 SSD's and 11 HDD on each host. Running ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu. Because we want to save small size object, I set bluestore_min_alloc_size 8192 (it is maybe important in this case) I have filled it through rados gw with approx billion of small objects. After tests I changed min_alloc_size back and deleted rados pools (to emtpy whole cluster) and I was waiting till cluster deletes data from OSD's, but that destabilized the cluster. I never reached health OK. OSD's were killed in random order. I can start them back but they will again get out from cluster with.. ``` -18> 2020-09-05 22:11:19.430 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -17> 2020-09-05 22:11:19.430 7f7a3ee40700 5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304 -16> 2020-09-05 22:11:20.434 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -15> 2020-09-05 22:11:21.434 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -14> 2020-09-05 22:11:22.258 7f7a2b81f700 5 osd.42 103257 heartbeat osd_stat(store_statfs(0x1ce1829/0x2d08c/0x1d18000, data 0x23143355/0x974a, compress 0x0/0x0/0x0, omap 0x1f11e, meta 0x2d08a0ee2), peers [3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] op hist []) -13> 2020-09-05 22:11:22.438 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -12> 2020-09-05 22:11:23.442 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -11> 2020-09-05 22:11:24.442 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -10> 2020-09-05 22:11:24.442 7f7a3ee40700 5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304 -9> 2020-09-05 22:11:24.442 7f7a2e024700 0 bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation observed for _collection_list, latency = 151.113s, lat = 2m cid =5.47_head start #5:e2000# end #MAX# max 2147483647 -8> 2020-09-05 22:11:24.446 7f7a2e024700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out after 15 -7> 2020-09-05 22:11:24.446 7f7a2e024700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide timed out after 150 -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient: get_auth_request con 0x555b15d07680 auth_method 0 -5> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257 ms_handle_reset con 0x555b15963600 session 0x555a9f9d6d00 -4> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257 ms_handle_reset con 0x555b15961b00 session 0x555a9f9d7980 -3> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257 ms_handle_reset con 0x555b15963a80 session 0x555a9f9d6a80 -2> 2020-09-05 22:11:24.446 7f7a3c494700 2 osd.42 103257 ms_handle_reset con 0x555b15960480 ses
[ceph-users] Is it possible to assign osd id numbers?
Hello, I have been wondering for quite some time whether or not it is possible to influence the osd.id numbers that are assigned during an install. I have made an attempt to keep our osds in order over the last few years, but it is a losing battle without having some control over the osd assignment. I am currently using ceph-deploy to handle adding nodes to the cluster. Thanks in advance, Shain Shain Miley | Director of Platform and Infrastructure | Digital Media | smi...@npr.org ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Problem unusable after deleting pool with bilion objects
Hi Igor, thank you, I also think that it is the problem you described. I recreated OSD's now and also noticed strange warnings - HEALTH_WARN Degraded data redundancy: 106763/723 objects degraded (14766.667%) Maybe there are some "phantom", zero sized objects (OMAPs?), that cluster is recovering, but I don't need them (are not listed in ceph df). You mentioned DB vs. Main devices ratio (1:11) - I'm not separating DB from device - each device has it's own RockDB on it. With regards Jan Pekar On 11/09/2020 14.36, Igor Fedotov wrote: Hi Jan, most likely this is a known issue with slow and ineffective pool removal procedure in Ceph. I did some presentation on the topic at yesterday's weekly performance meeting, presumably a recording will be available in a couple of days. An additional accompanying issue not covered during this meeting is RocksDB's misbehavior after (or during) such massive removals. At some point it starts to slow down reading operations handling (e.g. collection listing) which results in OSD suicide timeouts. Exactly what is observed in your case. There were multiple discussion on this issue in this mailing list too. In short the currect workaround is to perform manual DB compaction using ceph-kvstore-tool. Pool removal will most likely to proceed hence one might face similar assertions after a while. Hence there might be a need for multiple "compaction-restart" iterations until pool is finally removed. And yet another potential issue (or at least an additional factor) with your setup is a pretty high DB vs. Main devices ratio (1:11). Deleting procedures from multiple OSDs result in a pretty highload on DB volume which becomes overburdened... Thanks, Igor On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote: Hi all, I have build testing cluster with 4 hosts, 1 SSD's and 11 HDD on each host. Running ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu. Because we want to save small size object, I set bluestore_min_alloc_size 8192 (it is maybe important in this case) I have filled it through rados gw with approx billion of small objects. After tests I changed min_alloc_size back and deleted rados pools (to emtpy whole cluster) and I was waiting till cluster deletes data from OSD's, but that destabilized the cluster. I never reached health OK. OSD's were killed in random order. I can start them back but they will again get out from cluster with.. ``` -18> 2020-09-05 22:11:19.430 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -17> 2020-09-05 22:11:19.430 7f7a3ee40700 5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304 -16> 2020-09-05 22:11:20.434 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -15> 2020-09-05 22:11:21.434 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -14> 2020-09-05 22:11:22.258 7f7a2b81f700 5 osd.42 103257 heartbeat osd_stat(store_statfs(0x1ce1829/0x2d08c/0x1d18000, data 0x23143355/0x974a, compress 0x0/0x0/0x0, omap 0x1f11e, meta 0x2d08a0ee2), peers [3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] op hist []) -13> 2020-09-05 22:11:22.438 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -12> 2020-09-05 22:11:23.442 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -11> 2020-09-05 22:11:24.442 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -10> 2020-09-05 22:11:24.442 7f7a3ee40700 5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304 -9> 2020-09-05 22:11:24.442 7f7a2e024700 0 bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation observed for _collection_list, latency = 151.113s, lat = 2m cid =5.47_head start #5:e2000# end #MAX# max 2147483647 -8> 2020-09-05 22:11:24.446 7f7a2e024700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out after 15 -7> 2020-09-05 22:11:24.446 7f7a2e024700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide timed out after 150 -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient:
[ceph-users] Re: Problem unusable after deleting pool with bilion objects
Jan, please see inline On 9/11/2020 4:13 PM, Jan Pekař - Imatic wrote: Hi Igor, thank you, I also think that it is the problem you described. I recreated OSD's now and also noticed strange warnings - HEALTH_WARN Degraded data redundancy: 106763/723 objects degraded (14766.667%) Maybe there are some "phantom", zero sized objects (OMAPs?), that cluster is recovering, but I don't need them (are not listed in ceph df). The above look pretty weird but I don't know what's happening here... You mentioned DB vs. Main devices ratio (1:11) - I'm not separating DB from device - each device has it's own RockDB on it. Are you saying that DB is colocated with main data and resides on HDD? If so this is another significant (or may be the major) trigger for the issue. RocksDB + HDD is a bad pair for high load DB operation handling which bulk pool removal is. With regards Jan Pekar On 11/09/2020 14.36, Igor Fedotov wrote: Hi Jan, most likely this is a known issue with slow and ineffective pool removal procedure in Ceph. I did some presentation on the topic at yesterday's weekly performance meeting, presumably a recording will be available in a couple of days. An additional accompanying issue not covered during this meeting is RocksDB's misbehavior after (or during) such massive removals. At some point it starts to slow down reading operations handling (e.g. collection listing) which results in OSD suicide timeouts. Exactly what is observed in your case. There were multiple discussion on this issue in this mailing list too. In short the currect workaround is to perform manual DB compaction using ceph-kvstore-tool. Pool removal will most likely to proceed hence one might face similar assertions after a while. Hence there might be a need for multiple "compaction-restart" iterations until pool is finally removed. And yet another potential issue (or at least an additional factor) with your setup is a pretty high DB vs. Main devices ratio (1:11). Deleting procedures from multiple OSDs result in a pretty highload on DB volume which becomes overburdened... Thanks, Igor On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote: Hi all, I have build testing cluster with 4 hosts, 1 SSD's and 11 HDD on each host. Running ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu. Because we want to save small size object, I set bluestore_min_alloc_size 8192 (it is maybe important in this case) I have filled it through rados gw with approx billion of small objects. After tests I changed min_alloc_size back and deleted rados pools (to emtpy whole cluster) and I was waiting till cluster deletes data from OSD's, but that destabilized the cluster. I never reached health OK. OSD's were killed in random order. I can start them back but they will again get out from cluster with.. ``` -18> 2020-09-05 22:11:19.430 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -17> 2020-09-05 22:11:19.430 7f7a3ee40700 5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304 -16> 2020-09-05 22:11:20.434 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -15> 2020-09-05 22:11:21.434 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -14> 2020-09-05 22:11:22.258 7f7a2b81f700 5 osd.42 103257 heartbeat osd_stat(store_statfs(0x1ce1829/0x2d08c/0x1d18000, data 0x23143355/0x974a, compress 0x0/0x0/0x0, omap 0x1f11e, meta 0x2d08a0ee2), peers [3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] op hist []) -13> 2020-09-05 22:11:22.438 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -12> 2020-09-05 22:11:23.442 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -11> 2020-09-05 22:11:24.442 7f7a3ee40700 5 prioritycache tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 heap: 2073067520 old mem: 1932735282 new mem: 1932735282 -10> 2020-09-05 22:11:24.442 7f7a3ee40700 5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304 -9> 2020-09-05 22:11:24.442 7f7a2e024700 0 bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation observed for _collection_list, latency
[ceph-users] Re: Is it possible to assign osd id numbers?
On 11/09/2020 16:11, Shain Miley wrote: Hello, I have been wondering for quite some time whether or not it is possible to influence the osd.id numbers that are assigned during an install. I have made an attempt to keep our osds in order over the last few years, but it is a losing battle without having some control over the osd assignment. I am currently using ceph-deploy to handle adding nodes to the cluster. You can reuse osd numbers, but I strongly advice you not to focus on precise IDs. The reason is that you can have such combination of server faults, which will swap IDs no matter what. It's a false sense of beauty to have 'ID of OSD match ID in the name of the server'. How to reuse osd nums? OSD number is used (and should be cleaned if OSD dies) in three places in Ceph: 1) Crush map: ceph osd crush rm osd.x 2) osd list: ceph osd rm osd.x 3) auth: ceph auth rm osd.x The last one is often forgoten and is a usual reason for ceph-ansible to fail on new disk in the server. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus
Could you please run: CEPH_ARGS="--log-file log --debug-asok 5" ceph-bluestore-tool repair --path <...> ; cat log | grep asok > out and share 'out' file. Thanks, Igor On 9/11/2020 5:15 PM, Jean-Philippe Méthot wrote: Hi, We’re upgrading our cluster OSD node per OSD node to Nautilus from Mimic. From some release notes, it was recommended to run the following command to fix stats after an upgrade : ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0 However, running that command gives us the following error message: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: In function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: 53 : FAILED ceph_assert(r == 0) ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f1a5a823025] 2: (()+0x25c1ed) [0x7f1a5a8231ed] 3: (()+0x3c7a4f) [0x55b33537ca4f] 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] 8: (main()+0x10b3) [0x55b335187493] 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] 10: (()+0x1f9b5f) [0x55b3351aeb5f] 2020-09-10 14:40:25.873 7f1a6467eec0 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: In function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: 53: FAILED ceph_assert(r == 0) ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f1a5a823025] 2: (()+0x25c1ed) [0x7f1a5a8231ed] 3: (()+0x3c7a4f) [0x55b33537ca4f] 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] 8: (main()+0x10b3) [0x55b335187493] 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] 10: (()+0x1f9b5f) [0x55b3351aeb5f] *** Caught signal (Aborted) ** in thread 7f1a6467eec0 thread_name:ceph-bluestore- ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) 1: (()+0xf630) [0x7f1a58cf0630] 2: (gsignal()+0x37) [0x7f1a574be387] 3: (abort()+0x148) [0x7f1a574bfa78] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7f1a5a823074] 5: (()+0x25c1ed) [0x7f1a5a8231ed] 6: (()+0x3c7a4f) [0x55b33537ca4f] 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] 11: (main()+0x10b3) [0x55b335187493] 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] 13: (()+0x1f9b5f) [0x55b3351aeb5f] 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal (Aborted) ** in thread 7f1a6467eec0 thread_name:ceph-bluestore- ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) 1: (()+0xf630) [0x7f1a58cf0630] 2: (gsignal()+0x37) [0x7f1a574be387] 3: (abort()+0x148) [0x7f1a574bfa78] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7f1a5a823074] 5: (()+0x25c1ed) [0x7f1a5a8231ed] 6: (()+0x3c7a4f) [0x55b33537ca4f] 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] 11: (main()+0x10b3) [0x55b335187493] 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] 13: (()+0x1f9b5f) [0x55b3351aeb5f] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. What could be the source of this error? I haven’t found much of anything about it online. Jean-Philippe Méthot Senior Openstack system administrator Administrateur systè
[ceph-users] Re: ceph-osd performance on ram disk
On 9/11/20 4:15 AM, George Shuklin wrote: On 10/09/2020 19:37, Mark Nelson wrote: On 9/10/20 11:03 AM, George Shuklin wrote: ... Are there any knobs to tweak to see higher performance for ceph-osd? I'm pretty sure it's not any kind of leveling, GC or other 'iops-related' issues (brd has performance of two order of magnitude higher). So as you've seen, Ceph does a lot more than just write a chunk of data out to a block on disk. There's tons of encoding/decoding happening, crc checksums, crush calculations, onode lookups, write-ahead-logging, and other work involved that all adds latency. You can overcome some of that through parallelism, but 30K IOPs per OSD is probably pretty on-point for a nautilus era OSD. For octopus+ the cache refactor in bluestore should get you farther (40-50k+ for and OSD in isolation). The maximum performance we've seen in-house is around 70-80K IOPs on a single OSD using very fast NVMe and highly tuned settings. A couple of things you can try: - upgrade to octopus+ for the cache refactor - Make sure you are using the equivalent of the latency-performance or latency-network tuned profile. The most important part is disabling CPU cstate transitions. - increase osd_memory_target if you have a larger dataset (onode cache misses in bluestore add a lot of latency) - enable turbo if it's disabled (higher clock speed generally helps) On the write path you are correct that there is a limitation regarding a single kv sync thread. Over the years we've made this less of a bottleneck but it's possible you still could be hitting it. In our test lab we've managed to utilize up to around 12-14 cores on a single OSD in isolation with 16 tp_osd_tp worker threads and on a larger cluster about 6-7 cores per OSD. There's probably multiple factors at play, including context switching, cache thrashing, memory throughput, object creation/destruction, etc. If you decide to look into it further you may want to try wallclock profiling the OSD under load and seeing where it is spending its time. Thank you for feedback. I forgot to mention this, it's Octopus, fresh installation. I've disabled CSTATE (governor=performance), it make no difference - same iops, same CPU use by ceph-osd I've just can't force Ceph to consume more than 330% of CPU. I can force read up to 150k IOPS (both network and local), hitting CPU limit, but write is somewhat restricted by ceph itself. Ok, can I assume block/db/wal are all on the ramdisk? I'd start a benchmark and attach gdbpmp to the OSD and see if you can get a callgraph (1000 samples is nice if you don't mind waiting a bit). That will tell us a lot more about where the code is spending time. It will slow the benchmark way down fwiw. Some other things you could try: Try to tweak the number of osd worker threads to better match the number of cores in your system. Too many and you end up with context switching. Too few and you limit parallelism. You can also check rocksdb compaction stats in the osd logs using this tool: https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py Given that you are on ramdisk the 1GB default WAL limit should be plenty to let you avoid WAL throttling during compaction, but just verifying that compactions are not taking a long time is good peace of mind. Mark ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph rbox test on passive compressed pool
On 09/11 09:36, Marc Roos wrote: > > Hi David, > > Just to let you know, this hint is being set, what is the reason for > ceph of doing only half the objects? Can it be that there is some issue > with my osd's? Like some maybe have an old fs (still using disk not > volume)? Is this still to be expected or does ceph under pressure drop > compressing? > > https://github.com/ceph-dovecot/dovecot-ceph-plugin/blob/56d6c900cc9ec07dfb98ef2abac07aae466b7610/src/librmb/rados-storage-impl.cpp#L75 I was trying to look into this a bit :), can you give me more info about the OSDs that you are using? What filesystem are they on? Cheers! > > Thanks, > Marc > > > > -Original Message- > Cc: jan.radon > Subject: Re: [ceph-users] ceph rbox test on passive compressed pool > > The hints have to be given from the client side as far as I understand, > can you share the client code too? > > Also,not seems that there's no guarantees that it will actually do > anything (best effort I guess): > https://docs.ceph.com/docs/mimic/rados/api/librados/#c.rados_set_alloc_hint > > Cheers > > > On 6 September 2020 15:59:01 BST, Marc Roos > wrote: > > > > I have been inserting 10790 exactly the same 64kb text message to a > > passive compressing enabled pool. I am still counting, but it looks > like > only half the objects are compressed. > > mail/b08c3218dbf1545ff43052412a8e mtime 2020-09-06 > 16:27:39.00, > size 63580 > mail/00f6043775f1545ff43052412a8e mtime 2020-09-06 > 16:25:57.00, > size 525 > mail/b875f40571f1545ff43052412a8e mtime 2020-09-06 > 16:25:53.00, > size 63580 > mail/e87c120b19f1545ff43052412a8e mtime 2020-09-06 > 16:24:25.00, > size 525 > > I am not sure if this should be expected from passive, these > docs[1] > hint that passive 'compress if hinted COMPRESSIBLE'. From that I > would > conclude that all text messages should be compressed. > A previous test with a 64kb gzip attachment seemed to not compress, > > although I did not look at all object sizes. > > > > on 14.2.11 > > [1] > https://documentation.suse.com/ses/5.5/html/ses-all/ceph-pools.html > #sec-ceph-pool-compression > https://docs.ceph.com/docs/mimic/rados/operations/pools/ > > > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. > > -- David Caro ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph rbox test on passive compressed pool
It is a hdd pool, all bluestore, configured with ceph-disk. Upgrades seem not to have 'updated' bluefs, some osds report like this: { "/dev/sdb2": { "osd_uuid": "xxx", "size": 4000681103360, "btime": "2019-01-08 13:45:59.488533", "description": "main", "bluefs": "1", "ceph_fsid": "x", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "ready": "ready", "require_osd_release": "14", "whoami": "3" } } And some like this: { "/dev/sdh2": { "osd_uuid": "xxx", "size": 3000487051264, "btime": "2017-07-14 14:45:59.212792", "description": "main", "require_osd_release": "14" } } -Original Message- Cc: ceph-users Subject: Re: [ceph-users] ceph rbox test on passive compressed pool On 09/11 09:36, Marc Roos wrote: > > Hi David, > > Just to let you know, this hint is being set, what is the reason for > ceph of doing only half the objects? Can it be that there is some > issue with my osd's? Like some maybe have an old fs (still using disk > not volume)? Is this still to be expected or does ceph under pressure > drop compressing? > > https://github.com/ceph-dovecot/dovecot-ceph-plugin/blob/56d6c900cc9ec > 07dfb98ef2abac07aae466b7610/src/librmb/rados-storage-impl.cpp#L75 I was trying to look into this a bit :), can you give me more info about the OSDs that you are using? What filesystem are they on? Cheers! > > Thanks, > Marc > > > > -Original Message- > Cc: jan.radon > Subject: Re: [ceph-users] ceph rbox test on passive compressed pool > > The hints have to be given from the client side as far as I > understand, can you share the client code too? > > Also,not seems that there's no guarantees that it will actually do > anything (best effort I guess): > https://docs.ceph.com/docs/mimic/rados/api/librados/#c.rados_set_alloc > _hint > > Cheers > > > On 6 September 2020 15:59:01 BST, Marc Roos > wrote: > > > > I have been inserting 10790 exactly the same 64kb text message to a > > passive compressing enabled pool. I am still counting, but it looks > like > only half the objects are compressed. > > mail/b08c3218dbf1545ff43052412a8e mtime 2020-09-06 > 16:27:39.00, > size 63580 > mail/00f6043775f1545ff43052412a8e mtime 2020-09-06 > 16:25:57.00, > size 525 > mail/b875f40571f1545ff43052412a8e mtime 2020-09-06 > 16:25:53.00, > size 63580 > mail/e87c120b19f1545ff43052412a8e mtime 2020-09-06 > 16:24:25.00, > size 525 > > I am not sure if this should be expected from passive, these docs[1] > hint that passive 'compress if hinted COMPRESSIBLE'. From that I > would > conclude that all text messages should be compressed. > A previous test with a 64kb gzip attachment seemed to not compress, > > although I did not look at all object sizes. > > > > on 14.2.11 > > [1] > https://documentation.suse.com/ses/5.5/html/ses-all/ceph-pools.html > #sec-ceph-pool-compression > https://docs.ceph.com/docs/mimic/rados/operations/pools/ > > > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. > > -- David Caro ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus
Hi, We’re upgrading our cluster OSD node per OSD node to Nautilus from Mimic. From some release notes, it was recommended to run the following command to fix stats after an upgrade : ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0 However, running that command gives us the following error message: > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: > In > function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 > time 2020-09-10 14:40:25.872353 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: > 53 > : FAILED ceph_assert(r == 0) > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x14a) [0x7f1a5a823025] > 2: (()+0x25c1ed) [0x7f1a5a8231ed] > 3: (()+0x3c7a4f) [0x55b33537ca4f] > 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 8: (main()+0x10b3) [0x55b335187493] > 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 10: (()+0x1f9b5f) [0x55b3351aeb5f] > 2020-09-10 14:40:25.873 7f1a6467eec0 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: > In function 'virtual Allocator::SocketHook::~SocketHook()' thread > 7f1a6467eec0 time 2020-09-10 14:40:25.872353 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: > 53: FAILED ceph_assert(r == 0) > > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x14a) [0x7f1a5a823025] > 2: (()+0x25c1ed) [0x7f1a5a8231ed] > 3: (()+0x3c7a4f) [0x55b33537ca4f] > 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 8: (main()+0x10b3) [0x55b335187493] > 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 10: (()+0x1f9b5f) [0x55b3351aeb5f] > *** Caught signal (Aborted) ** > in thread 7f1a6467eec0 thread_name:ceph-bluestore- > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus > (stable) > 1: (()+0xf630) [0x7f1a58cf0630] > 2: (gsignal()+0x37) [0x7f1a574be387] > 3: (abort()+0x148) [0x7f1a574bfa78] > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x199) [0x7f1a5a823074] > 5: (()+0x25c1ed) [0x7f1a5a8231ed] > 6: (()+0x3c7a4f) [0x55b33537ca4f] > 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 11: (main()+0x10b3) [0x55b335187493] > 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 13: (()+0x1f9b5f) [0x55b3351aeb5f] > 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal (Aborted) ** > in thread 7f1a6467eec0 thread_name:ceph-bluestore- > > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus > (stable) > 1: (()+0xf630) [0x7f1a58cf0630] > 2: (gsignal()+0x37) [0x7f1a574be387] > 3: (abort()+0x148) [0x7f1a574bfa78] > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x199) [0x7f1a5a823074] > 5: (()+0x25c1ed) [0x7f1a5a8231ed] > 6: (()+0x3c7a4f) [0x55b33537ca4f] > 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 11: (main()+0x10b3) [0x55b335187493] > 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 13: (()+0x1f9b5f) [0x55b3351aeb5f] > NOTE: a copy of the executable, or `objdump -rdS ` is needed to > interpret this. What could be the source of this error? I haven’t found much of anything about it online. Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada TEL : +1.514.802.1644 - Poste : 26
[ceph-users] Re: OSDs and tmpfs
> We have a 23 node cluster and normally when we add OSDs they end up > mounting like > this: > > /dev/sde1 3.7T 2.0T 1.8T 54% /var/lib/ceph/osd/ceph-15 > > /dev/sdj1 3.7T 2.0T 1.7T 55% /var/lib/ceph/osd/ceph-20 > > /dev/sdd1 3.7T 2.1T 1.6T 58% /var/lib/ceph/osd/ceph-14 > > /dev/sdc1 3.7T 1.8T 1.9T 49% /var/lib/ceph/osd/ceph-13 > I'm pretty sure those OSDs have been deployed with Filestore backend as the first partition of the device is the data partition and needs to be mounted. > However I noticed this morning that the 3 new servers have the OSDs > mounted like > this: > > tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-246 > > tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-240 > > tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-248 > > tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-237 > And here, it looks like those OSDs are using Bluestore backend because this backend doesn't need to mount any data partitions. What you're seeing is the Bluestore metadata in this tmpfs. You should find in the mount point some usefull information (fsid, keyring and symlinks to the data block and/or db/wal). I don't know if you're using ceph-disk or ceph-volume but you can find information about this by running either: - ceph-disk list - ceph-volume lvm list ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs and tmpfs
I have also these mounts with bluestore /dev/sde1 on /var/lib/ceph/osd/ceph-32 type xfs (rw,relatime,attr2,inode64,noquota) /dev/sdb1 on /var/lib/ceph/osd/ceph-3 type xfs (rw,relatime,attr2,inode64,noquota) /dev/sdc1 on /var/lib/ceph/osd/ceph-6 type xfs (rw,relatime,attr2,inode64,noquota) /dev/sdd1 on /var/lib/ceph/osd/ceph-8 type xfs (rw,relatime,attr2,inode64,noquota) /dev/sdj1 on /var/lib/ceph/osd/ceph-19 type xfs (rw,relatime,attr2,inode64,noquota) [@c01 ~]# ls -l /var/lib/ceph/osd/ceph-0 total 52 -rw-r--r-- 1 ceph ceph 3 Aug 24 2017 active lrwxrwxrwx 1 ceph ceph 58 Jun 30 2017 block -> /dev/disk/by-partuuid/63b970b7-2759-4eae-a66e-b84335eba598 -rw-r--r-- 1 ceph ceph 37 Jun 30 2017 block_uuid -rw-r--r-- 1 ceph ceph 2 Jun 30 2017 bluefs -rw-r--r-- 1 ceph ceph 37 Jun 30 2017 ceph_fsid -rw-r--r-- 1 ceph ceph 37 Jun 30 2017 fsid -rw--- 1 ceph ceph 56 Jun 30 2017 keyring -rw-r--r-- 1 ceph ceph 8 Jun 30 2017 kv_backend -rw-r--r-- 1 ceph ceph 21 Jun 30 2017 magic -rw-r--r-- 1 ceph ceph 4 Jun 30 2017 mkfs_done -rw-r--r-- 1 ceph ceph 6 Jun 30 2017 ready -rw-r--r-- 1 ceph ceph 3 Oct 19 2019 require_osd_release -rw-r--r-- 1 ceph ceph 0 Sep 26 2019 systemd -rw-r--r-- 1 ceph ceph 10 Jun 30 2017 type -rw-r--r-- 1 ceph ceph 2 Jun 30 2017 whoami -Original Message- To: ceph-users@ceph.io Subject: [ceph-users] Re: OSDs and tmpfs > We have a 23 node cluster and normally when we add OSDs they end > up mounting like > this: > > /dev/sde1 3.7T 2.0T 1.8T 54% /var/lib/ceph/osd/ceph-15 > > /dev/sdj1 3.7T 2.0T 1.7T 55% /var/lib/ceph/osd/ceph-20 > > /dev/sdd1 3.7T 2.1T 1.6T 58% /var/lib/ceph/osd/ceph-14 > > /dev/sdc1 3.7T 1.8T 1.9T 49% /var/lib/ceph/osd/ceph-13 > I'm pretty sure those OSDs have been deployed with Filestore backend as the first partition of the device is the data partition and needs to be mounted. > However I noticed this morning that the 3 new servers have the > OSDs mounted like > this: > > tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-246 > > tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-240 > > tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-248 > > tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-237 > And here, it looks like those OSDs are using Bluestore backend because this backend doesn't need to mount any data partitions. What you're seeing is the Bluestore metadata in this tmpfs. You should find in the mount point some usefull information (fsid, keyring and symlinks to the data block and/or db/wal). I don't know if you're using ceph-disk or ceph-volume but you can find information about this by running either: - ceph-disk list - ceph-volume lvm list ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it possible to assign osd id numbers?
Thank you for your answer below. I'm not looking to reuse them as much as I am trying to control what unused number is actually used. For example if I have 20 osds and 2 have failed...when I replace a disk in one server I don't want it to automatically use the next lowest number for the osd assignment. I understand what you mean about not focusing on the osd ids...but my ocd is making me ask the question. Thanks, Shain On 9/11/20, 9:45 AM, "George Shuklin" wrote: On 11/09/2020 16:11, Shain Miley wrote: > Hello, > I have been wondering for quite some time whether or not it is possible to influence the osd.id numbers that are assigned during an install. > > I have made an attempt to keep our osds in order over the last few years, but it is a losing battle without having some control over the osd assignment. > > I am currently using ceph-deploy to handle adding nodes to the cluster. > You can reuse osd numbers, but I strongly advice you not to focus on precise IDs. The reason is that you can have such combination of server faults, which will swap IDs no matter what. It's a false sense of beauty to have 'ID of OSD match ID in the name of the server'. How to reuse osd nums? OSD number is used (and should be cleaned if OSD dies) in three places in Ceph: 1) Crush map: ceph osd crush rm osd.x 2) osd list: ceph osd rm osd.x 3) auth: ceph auth rm osd.x The last one is often forgoten and is a usual reason for ceph-ansible to fail on new disk in the server. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs and tmpfs
Hi together, I believe the deciding factor is whether the OSD was deployed using ceph-disk (in "ceph-volume" speak, a "simple" OSD), which means the metadata will be on a separate partition, or whether it was deployed with "ceph-volume lvm". The latter stores the metadata in LVM tags, so the extra partition is not needed anymore. Of course, since the general recommendation to move from filestore to bluestore came at a similar time as the ceph-disk deprecation, there's usually correlation between these two ingredients ;-). Cheers, Oliver Am 11.09.20 um 20:33 schrieb Marc Roos: > > > I have also these mounts with bluestore > > /dev/sde1 on /var/lib/ceph/osd/ceph-32 type xfs > (rw,relatime,attr2,inode64,noquota) > /dev/sdb1 on /var/lib/ceph/osd/ceph-3 type xfs > (rw,relatime,attr2,inode64,noquota) > /dev/sdc1 on /var/lib/ceph/osd/ceph-6 type xfs > (rw,relatime,attr2,inode64,noquota) > /dev/sdd1 on /var/lib/ceph/osd/ceph-8 type xfs > (rw,relatime,attr2,inode64,noquota) > /dev/sdj1 on /var/lib/ceph/osd/ceph-19 type xfs > (rw,relatime,attr2,inode64,noquota) > > [@c01 ~]# ls -l /var/lib/ceph/osd/ceph-0 > total 52 > -rw-r--r-- 1 ceph ceph 3 Aug 24 2017 active > lrwxrwxrwx 1 ceph ceph 58 Jun 30 2017 block -> > /dev/disk/by-partuuid/63b970b7-2759-4eae-a66e-b84335eba598 > -rw-r--r-- 1 ceph ceph 37 Jun 30 2017 block_uuid > -rw-r--r-- 1 ceph ceph 2 Jun 30 2017 bluefs > -rw-r--r-- 1 ceph ceph 37 Jun 30 2017 ceph_fsid > -rw-r--r-- 1 ceph ceph 37 Jun 30 2017 fsid > -rw--- 1 ceph ceph 56 Jun 30 2017 keyring > -rw-r--r-- 1 ceph ceph 8 Jun 30 2017 kv_backend > -rw-r--r-- 1 ceph ceph 21 Jun 30 2017 magic > -rw-r--r-- 1 ceph ceph 4 Jun 30 2017 mkfs_done > -rw-r--r-- 1 ceph ceph 6 Jun 30 2017 ready > -rw-r--r-- 1 ceph ceph 3 Oct 19 2019 require_osd_release > -rw-r--r-- 1 ceph ceph 0 Sep 26 2019 systemd > -rw-r--r-- 1 ceph ceph 10 Jun 30 2017 type > -rw-r--r-- 1 ceph ceph 2 Jun 30 2017 whoami > > > -Original Message- > To: ceph-users@ceph.io > Subject: [ceph-users] Re: OSDs and tmpfs > >> We have a 23 node cluster and normally when we add OSDs they end >> up mounting like >> this: >> >> /dev/sde1 3.7T 2.0T 1.8T 54% /var/lib/ceph/osd/ceph-15 >> >> /dev/sdj1 3.7T 2.0T 1.7T 55% /var/lib/ceph/osd/ceph-20 >> >> /dev/sdd1 3.7T 2.1T 1.6T 58% /var/lib/ceph/osd/ceph-14 >> >> /dev/sdc1 3.7T 1.8T 1.9T 49% /var/lib/ceph/osd/ceph-13 >> > > I'm pretty sure those OSDs have been deployed with Filestore backend as > the first partition of the device is the data partition and needs to be > mounted. > >> However I noticed this morning that the 3 new servers have the >> OSDs mounted like >> this: >> >> tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-246 >> >> tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-240 >> >> tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-248 >> >> tmpfs47G 28K 47G 1% /var/lib/ceph/osd/ceph-237 >> > > And here, it looks like those OSDs are using Bluestore backend because > this backend doesn't need to mount any data partitions. > What you're seeing is the Bluestore metadata in this tmpfs. > You should find in the mount point some usefull information (fsid, > keyring and symlinks to the data block and/or db/wal). > > I don't know if you're using ceph-disk or ceph-volume but you can find > information about this by running either: > - ceph-disk list > - ceph-volume lvm list > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it possible to assign osd id numbers?
Now that’s a *very* different question from numbers assigned during an install. With recent releases instead of going down the full removal litany listed below, you can instead down/out the OSD and `destroy` it. That preserves the CRUSH bucket and OSD ID, then when you use ceph-disk, ceph-volume, what-have-you to deploy a replacement, you can specify the same and desired OSD ID on the commandline. Note that as of 12.2.2, you’ll want to record and re-set any override reweight (manual or reweight-by-utilization) as that usually or never survives. Also note that again as of that release, if the replacement drive is a different size, the CRUSH weight is not adjusted, so you may (or may not) want to adjust the CRUSH weight. Slight differences aren’t usually a huge deal; big differences can mean you have unused capacity, or overloaded drives. — Anthony > > Thank you for your answer below. > > I'm not looking to reuse them as much as I am trying to control what unused > number is actually used. > > For example if I have 20 osds and 2 have failed...when I replace a disk in > one server I don't want it to automatically use the next lowest number for > the osd assignment. > > I understand what you mean about not focusing on the osd ids...but my ocd is > making me ask the question. > > Thanks, > Shain > > On 9/11/20, 9:45 AM, "George Shuklin" wrote: > >On 11/09/2020 16:11, Shain Miley wrote: >> Hello, >> I have been wondering for quite some time whether or not it is possible to >> influence the osd.id numbers that are assigned during an install. >> >> I have made an attempt to keep our osds in order over the last few years, >> but it is a losing battle without having some control over the osd >> assignment. >> >> I am currently using ceph-deploy to handle adding nodes to the cluster. >> >You can reuse osd numbers, but I strongly advice you not to focus on >precise IDs. The reason is that you can have such combination of server >faults, which will swap IDs no matter what. > >It's a false sense of beauty to have 'ID of OSD match ID in the name of >the server'. > >How to reuse osd nums? > >OSD number is used (and should be cleaned if OSD dies) in three places >in Ceph: > >1) Crush map: ceph osd crush rm osd.x > >2) osd list: ceph osd rm osd.x > >3) auth: ceph auth rm osd.x > >The last one is often forgoten and is a usual reason for ceph-ansible to >fail on new disk in the server. >___ >ceph-users mailing list -- ceph-users@ceph.io >To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus
Here’s the out file, as requested. Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada TEL : +1.514.802.1644 - Poste : 2644 FAX : +1.514.612.0678 CA/US : 1.855.774.4678 FR : 01 76 60 41 43 UK : 0808 189 0423 > Le 11 sept. 2020 à 10:38, Igor Fedotov a écrit : > > Could you please run: > > CEPH_ARGS="--log-file log --debug-asok 5" ceph-bluestore-tool repair --path > <...> ; cat log | grep asok > out > > and share 'out' file. > > > Thanks, > > Igor > > On 9/11/2020 5:15 PM, Jean-Philippe Méthot wrote: >> Hi, >> >> We’re upgrading our cluster OSD node per OSD node to Nautilus from Mimic. >> From some release notes, it was recommended to run the following command to >> fix stats after an upgrade : >> >> ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0 >> >> However, running that command gives us the following error message: >> >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: >>> In >>> function 'virtual Allocator::SocketHook::~SocketHook()' thread >>> 7f1a6467eec0 time 2020-09-10 14:40:25.872353 >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: >>> 53 >>> : FAILED ceph_assert(r == 0) >>> ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus >>> (stable) >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x14a) [0x7f1a5a823025] >>> 2: (()+0x25c1ed) [0x7f1a5a8231ed] >>> 3: (()+0x3c7a4f) [0x55b33537ca4f] >>> 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] >>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] >>> 8: (main()+0x10b3) [0x55b335187493] >>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>> 10: (()+0x1f9b5f) [0x55b3351aeb5f] >>> 2020-09-10 14:40:25.873 7f1a6467eec0 -1 >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: >>> In function 'virtual Allocator::SocketHook::~SocketHook()' thread >>> 7f1a6467eec0 time 2020-09-10 14:40:25.872353 >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: >>> 53: FAILED ceph_assert(r == 0) >>> >>> ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus >>> (stable) >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x14a) [0x7f1a5a823025] >>> 2: (()+0x25c1ed) [0x7f1a5a8231ed] >>> 3: (()+0x3c7a4f) [0x55b33537ca4f] >>> 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] >>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] >>> 8: (main()+0x10b3) [0x55b335187493] >>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>> 10: (()+0x1f9b5f) [0x55b3351aeb5f] >>> *** Caught signal (Aborted) ** >>> in thread 7f1a6467eec0 thread_name:ceph-bluestore- >>> ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus >>> (stable) >>> 1: (()+0xf630) [0x7f1a58cf0630] >>> 2: (gsignal()+0x37) [0x7f1a574be387] >>> 3: (abort()+0x148) [0x7f1a574bfa78] >>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x199) [0x7f1a5a823074] >>> 5: (()+0x25c1ed) [0x7f1a5a8231ed] >>> 6: (()+0x3c7a4f) [0x55b33537ca4f] >>> 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] >>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] >>> 11: (main()+0x10b3) [0x55b335187493] >>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>> 13: (()+0x1f9b5f) [0x55b3351aeb5f] >>> 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal (Aborted) ** >>> in thread 7f1a6467eec0 thread_name:ceph-bluestore- >>> >>> ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus >>> (stable) >>> 1: (()+0xf630) [0x7f1a58cf0630] >>> 2: (gsignal()+0x37) [0x7f1a574be387] >>> 3: (abort()+0x148) [0x7f1a574bfa78] >>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x199) [0x7f1a5a823074] >>> 5: (()+0x25c1ed) [0x7f1a5a8231ed] >>> 6: (()
[ceph-users] Re: The confusing output of ceph df command
Igor, I think I misunderstood the output of USED. The info should be allocated size, not equal 1.5*STORED sometimes. For example: when writing 4k file, It may allocate 64k that seems to use more spaces, but If you write another 4k, it can use the same blob.(I will validate the guess). So ceph df may not reflect the new files we can store. I'm reading codes of ceph, I will reply the thread when I got the correct meaning. Thanks, Norman On 11/9/2020 上午7:40, Igor Fedotov wrote: Norman, >default-fs-data0 9 374 TiB 1.48G 939 TiB 74.71 212 TiB given the above numbers 'default-fs-data0' pool has average object size around 256K (374 TiB / 1.48G objects). Are you sure that absolute majority of your objects in this pool are 4M? Wondering what are the df report for the 'good' cluster? Additionally (given that default-fs-data0 keeps most of data for the cluster) you might want to estimate allocation losses via performance counters inspection: bluestore_stored vs. bluestore_allocated. Summing the delta between them over all (hdd?) OSDs you might get the total loss. Simpler way to do the same is to learn the deltas from a 2-3 of OSDs and if just multiply the average delta by amount of OSDs. This less precise but statistically should be good enough... Thanks, Igor On 9/10/2020 5:10 AM, norman wrote: Igor, Thanks for your reply. The object size is 4M and almost no overwrites in the pool, why space loss happened in the pool? I have another cluster with the same config, Its USED is almost equal to 1.5*STORED, the diff between them is: The cluster has different OSD size(12T and 8T) . Norman On 9/9/2020 下午7:17, Igor Fedotov wrote: Hi Norman, not pretending to know the exact root cause but IMO one of the working hypothesis might be as follows : Presuming spinners as backing devices for you OSDs and hence 64K allocation unit (bluestore min_alloc_size_hdd param). 1) 1.48GB user objects result in 1.48G * 6 = 8.88G EC shards. 2) Shards tend to be unaligned with 64K allocation unit which might result in an average loss of 32K per each shard. 3) Hence total loss due to allocation overhead to be estimated at 32K * 8.88G = 284T which looks close enough to your numbers for default-fs-data0: 939TiB - 374 TiB / 4 * 6 = 378 TiB of space loss. Additional issue which might result in the space loss is space amplification occurred caused by partial unaligned overwrites to objects in EC pool. See my post "Root cause analysis for space overhead with erasure coded pools." to d...@ceph.io mailing list on Jan 23. Migrating to 4K min alloc size seems to be the only known way to fix (or rather workaround) these issues. Upcoming Pacific release is gonna to bring downsizing to 4K (for new OSD deployments) along with some additional changes to smooth corresponding negative performance impacts. Hope this helps. Igor On 9/9/2020 2:30 AM, norman kern wrote: Hi, I have changed most of pools from 3-replica to ec 4+2 in my cluster, when I use ceph df command to show the used capactiy of the cluster: RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.8 PiB 788 TiB 1.0 PiB 1.0 PiB 57.22 ssd 7.9 TiB 4.6 TiB 181 GiB 3.2 TiB 41.15 ssd-cache 5.2 TiB 5.2 TiB 67 GiB 73 GiB 1.36 TOTAL 1.8 PiB 798 TiB 1.0 PiB 1.0 PiB 56.99 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL default-oss.rgw.control 1 0 B 8 0 B 0 1.3 TiB default-oss.rgw.meta 2 22 KiB 97 3.9 MiB 0 1.3 TiB default-oss.rgw.log 3 525 KiB 223 621 KiB 0 1.3 TiB default-oss.rgw.buckets.index 4 33 MiB 34 33 MiB 0 1.3 TiB default-oss.rgw.buckets.non-ec 5 1.6 MiB 48 3.8 MiB 0 1.3 TiB .rgw.root 6 3.8 KiB 16 720 KiB 0 1.3 TiB default-oss.rgw.buckets.data 7 274 GiB 185.39k 450 GiB 0.14 212 TiB default-fs-metadata 8 488 GiB 153.10M 490 GiB 10.65 1.3 TiB default-fs-data0 9 374 TiB 1.48G 939 TiB 74.71 212 TiB ... The USED = 3 * STORED in 3-replica mode is completely right, but for EC 4+2 pool (for default-fs-data0 ) the USED is not equal 1.5 * STORED, why...:( ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-u