[ceph-users] Re: ceph rbox test on passive compressed pool

2020-09-11 Thread Marc Roos


Hi David,

Just to let you know, this hint is being set, what is the reason for 
ceph of doing only half the objects? Can it be that there is some issue 
with my osd's? Like some maybe have an old fs (still using disk not 
volume)? Is this still to be expected or does ceph under pressure drop 
compressing?

https://github.com/ceph-dovecot/dovecot-ceph-plugin/blob/56d6c900cc9ec07dfb98ef2abac07aae466b7610/src/librmb/rados-storage-impl.cpp#L75

 Thanks,
Marc



-Original Message-
Cc: jan.radon
Subject: Re: [ceph-users] ceph rbox test on passive compressed pool

The hints have to be given from the client side as far as I understand, 
can you share the client code too?

Also,not seems that there's no guarantees that it will actually do 
anything (best effort I guess):
https://docs.ceph.com/docs/mimic/rados/api/librados/#c.rados_set_alloc_hint

Cheers


On 6 September 2020 15:59:01 BST, Marc Roos  
wrote:



I have been inserting 10790 exactly the same 64kb text message to a 

passive compressing enabled pool. I am still counting, but it looks 
like 
only half the objects are compressed.  

mail/b08c3218dbf1545ff43052412a8e mtime 2020-09-06 
16:27:39.00, 
size 63580
mail/00f6043775f1545ff43052412a8e mtime 2020-09-06 
16:25:57.00, 
size 525
mail/b875f40571f1545ff43052412a8e mtime 2020-09-06 
16:25:53.00, 
size 63580
mail/e87c120b19f1545ff43052412a8e mtime 2020-09-06 
16:24:25.00, 
size 525

I am not sure if this should be expected from passive, these 
docs[1] 
hint that passive 'compress if hinted COMPRESSIBLE'. From that I 
would 
conclude that all text messages should be compressed. 
A previous test with a 64kb gzip attachment seemed to not compress, 

although I did not look at all object sizes.



on 14.2.11

[1]
https://documentation.suse.com/ses/5.5/html/ses-all/ceph-pools.html
#sec-ceph-pool-compression
https://docs.ceph.com/docs/mimic/rados/operations/pools/


ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: slow "rados ls"

2020-09-11 Thread Marcel Kuiper
Hi Stefan

I can't recall that that was the case and unfortunately we do not have
enough history for our performance measurements to look back

We are on nautilus. Please let me know your findings when you do your pg
expansion on nautilus

Grtz

Marcel


> OK, I'm really curious if you observed the following behaviour:
>
> During, or shortly after the rebalance, did you see high CPU usage of
> the OSDs? In particular the ones that hosted the PGs before they were
> moved to the new nodes? As in ~ 300 % CPU per OSD (increasing from a few
> percent to 300% non stop)? RocksDB is doing housekeeping, And we
> observed before, and today again, on Mimic 13.2.8, that with a lot of
> OMAP/META data the OSDs that have to clean up consume a ridiculous
> amount of CPU (for hours on end). Triggering loads of slow ops and
> latency spikes in the somtimes (tens) of seconds.
>
> Are you running nautilus? If you haven't seen this behaviour this might
> have been fixed in Nautlilus. Or you cluster is different from ours. We
> will do PG expansion after we have upgraded to Nautilus, so we'll
> definitely know by then.
>
> Thanks,
>
> Stefan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-osd performance on ram disk

2020-09-11 Thread George Shuklin

On 10/09/2020 19:37, Mark Nelson wrote:

On 9/10/20 11:03 AM, George Shuklin wrote:


...
Are there any knobs to tweak to see higher performance for ceph-osd? 
I'm pretty sure it's not any kind of leveling, GC or other 
'iops-related' issues (brd has performance of two order of magnitude 
higher).



So as you've seen, Ceph does a lot more than just write a chunk of 
data out to a block on disk.  There's tons of encoding/decoding 
happening, crc checksums, crush calculations, onode lookups, 
write-ahead-logging, and other work involved that all adds latency.  
You can overcome some of that through parallelism, but 30K IOPs per 
OSD is probably pretty on-point for a nautilus era OSD.  For octopus+ 
the cache refactor in bluestore should get you farther (40-50k+ for 
and OSD in isolation).  The maximum performance we've seen in-house is 
around 70-80K IOPs on a single OSD using very fast NVMe and highly 
tuned settings.



A couple of things you can try:


- upgrade to octopus+ for the cache refactor

- Make sure you are using the equivalent of the latency-performance or 
latency-network tuned profile.  The most important part is disabling 
CPU cstate transitions.


- increase osd_memory_target if you have a larger dataset (onode cache 
misses in bluestore add a lot of latency)


- enable turbo if it's disabled (higher clock speed generally helps)


On the write path you are correct that there is a limitation regarding 
a single kv sync thread.  Over the years we've made this less of a 
bottleneck but it's possible you still could be hitting it.  In our 
test lab we've managed to utilize up to around 12-14 cores on a single 
OSD in isolation with 16 tp_osd_tp worker threads and on a larger 
cluster about 6-7 cores per OSD.  There's probably multiple factors at 
play, including context switching, cache thrashing, memory throughput, 
object creation/destruction, etc.  If you decide to look into it 
further you may want to try wallclock profiling the OSD under load and 
seeing where it is spending its time. 


Thank you for feedback.

I forgot to mention this, it's Octopus, fresh installation.

I've disabled CSTATE (governor=performance), it make no difference - 
same iops, same CPU use by ceph-osd  I've just can't force Ceph to 
consume more than 330% of CPU. I can force read up to 150k IOPS (both 
network and local), hitting CPU limit, but write is somewhat restricted 
by ceph itself.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm - How to deploy ceph cluster with a partition on SSD for block.db

2020-09-11 Thread Jan Fajerski

On Tue, Sep 08, 2020 at 07:14:16AM -, kle...@psi-net.si wrote:

I found out that it's already possible to specify storage path in OSD service 
specification yaml. It works for data_devices, but unfortunately not for 
db_devices and wal_devices, at least not in my case.


Aside from the question whether db/wal/journal_devices should accept paths as a 
filter, I'd like to point out that partitions are only a valid argument when 
calling `ceph-volume lvm prepare/create`.
OSD service specs are quite tightly coupled to the batch subcommand, which has 
no support partitions.
The batch subcommand will soon gain support for handle logical volumes too. I'll 
explore if we can extend osd service specs accordingly.
Until then I'm afraid you're stuck to use the create or prepare subcommand for 
"uncommon" deployments like this (db devices collocated with root device).




service_type: osd
service_id: osd_spec_default
placement:
 host_pattern: '*'
data_devices:
 paths:
 - /dev/vdb1
db_devices:
 paths:
 - /dev/vdb2
wal_devices:
 paths:
 - /dev/vdb3
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Problem unusable after deleting pool with bilion objects

2020-09-11 Thread Jan Pekař - Imatic

Hi all,

I have build testing cluster with 4 hosts, 1 SSD's  and 11 HDD on each host.
Running ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) 
nautilus (stable) on Ubuntu.

Because we want to save small size object, I set bluestore_min_alloc_size 8192 
(it is maybe important in this case)

I have filled it through rados gw with approx billion of small objects. After tests I changed min_alloc_size back and deleted rados pools 
(to emtpy whole cluster) and I was waiting till cluster deletes data from OSD's, but that destabilized the cluster. I never reached health 
OK. OSD's were killed in random order. I can start them back but they will again get out from cluster with..


```

   -18> 2020-09-05 22:11:19.430 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -17> 2020-09-05 22:11:19.430 7f7a3ee40700  5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 
1644167168 kv_used: 1644135504 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
   -16> 2020-09-05 22:11:20.434 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -15> 2020-09-05 22:11:21.434 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -14> 2020-09-05 22:11:22.258 7f7a2b81f700  5 osd.42 103257 heartbeat osd_stat(store_statfs(0x1ce1829/0x2d08c/0x1d18000, data 
0x23143355/0x974a, compress 0x0/0x0/0x0, omap 0x1f11e, meta 0x2d08a0ee2), peers 
[3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] op hist [])
   -13> 2020-09-05 22:11:22.438 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -12> 2020-09-05 22:11:23.442 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -11> 2020-09-05 22:11:24.442 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -10> 2020-09-05 22:11:24.442 7f7a3ee40700  5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 
1644167168 kv_used: 1644119840 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
    -9> 2020-09-05 22:11:24.442 7f7a2e024700  0 bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation observed for 
_collection_list, latency = 151.113s, lat = 2m cid =5.47_head start #5:e2000# end #MAX# max 2147483647

    -8> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out after 15
    -7> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide timed out 
after 150

    -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient: get_auth_request con 
0x555b15d07680 auth_method 0
    -5> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset 
con 0x555b15963600 session 0x555a9f9d6d00
    -4> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset 
con 0x555b15961b00 session 0x555a9f9d7980
    -3> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset 
con 0x555b15963a80 session 0x555a9f9d6a80
    -2> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 ms_handle_reset 
con 0x555b15960480 session 0x555a9f9d6f80
    -1> 2020-09-05 22:11:24.446 7f7a3c494700  3 osd.42 103257 handle_osd_map 
epochs [103258,103259], i have 103257, src has [83902,103259]
 0> 2020-09-05 22:11:24.450 7f7a2e024700 -1 *** Caught signal (Aborted) **
```

I have approx 12 OSD's down with this error.

I decided to wipe problematic OSD's so I cannot debug it, but I'm curious what I did wrong (deleting pool with many small data?) or what to 
do next time.


I did that before but not with bilion object and without 
bluestore_min_alloc_size change, and it worked without problems.

With regards
Jan Pekar

--

Ing. Jan Pekař
jan.pe...@imatic.cz

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz | +420326555326

--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Problem unusable after deleting pool with bilion objects

2020-09-11 Thread Igor Fedotov

Hi Jan,

most likely this is a known issue with slow and ineffective pool removal 
procedure in Ceph.


I did some presentation on the topic at yesterday's weekly performance 
meeting, presumably a recording will be available in a couple of days.


An additional accompanying issue not covered during this meeting is 
RocksDB's misbehavior after (or during) such massive removals. At some 
point it starts to slow  down reading  operations handling (e.g. 
collection listing) which results in OSD suicide timeouts. Exactly what 
is observed in your case. There were multiple discussion on this issue 
in this mailing list too. In short the currect workaround is to perform 
manual DB compaction using ceph-kvstore-tool. Pool removal will most 
likely to proceed hence one might face similar assertions after a while. 
Hence there might be a need for multiple "compaction-restart" iterations 
until pool is finally removed.



And yet another potential issue (or at least an additional factor) with 
your setup is a pretty high DB vs. Main devices ratio (1:11). Deleting 
procedures from multiple OSDs result in a pretty highload on DB volume 
which becomes overburdened...



Thanks,

Igor

On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote:

Hi all,

I have build testing cluster with 4 hosts, 1 SSD's  and 11 HDD on each 
host.
Running ceph version 14.2.10 
(b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu.


Because we want to save small size object, I set 
bluestore_min_alloc_size 8192 (it is maybe important in this case)


I have filled it through rados gw with approx billion of small 
objects. After tests I changed min_alloc_size back and deleted rados 
pools (to emtpy whole cluster) and I was waiting till cluster deletes 
data from OSD's, but that destabilized the cluster. I never reached 
health OK. OSD's were killed in random order. I can start them back 
but they will again get out from cluster with..


```

   -18> 2020-09-05 22:11:19.430 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -17> 2020-09-05 22:11:19.430 7f7a3ee40700  5 
bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 
1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc: 
142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
   -16> 2020-09-05 22:11:20.434 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -15> 2020-09-05 22:11:21.434 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -14> 2020-09-05 22:11:22.258 7f7a2b81f700  5 osd.42 103257 
heartbeat 
osd_stat(store_statfs(0x1ce1829/0x2d08c/0x1d18000, data 
0x23143355/0x974a, compress 0x0/0x0/0x0, omap 0x1f11e, meta 
0x2d08a0ee2), peers 
[3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] 
op hist [])
   -13> 2020-09-05 22:11:22.438 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -12> 2020-09-05 22:11:23.442 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -11> 2020-09-05 22:11:24.442 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -10> 2020-09-05 22:11:24.442 7f7a3ee40700  5 
bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 
1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc: 
142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
    -9> 2020-09-05 22:11:24.442 7f7a2e024700  0 
bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation 
observed for _collection_list, latency = 151.113s, lat = 2m cid 
=5.47_head start #5:e2000# end #MAX# max 2147483647
    -8> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map 
reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out 
after 15
    -7> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map 
reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide timed 
out after 150
    -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient: 
get_auth_request con 0x555b15d07680 auth_method 0
    -5> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 
ms_handle_reset con 0x555b15963600 session 0x555a9f9d6d00
    -4> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 
ms_handle_reset con 0x555b15961b00 session 0x555a9f9d7980
    -3> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 
ms_handle_reset con 0x555b15963a80 session 0x555a9f9d6a80
    -2> 2020-09-05 22:11:24.446 7f7a3c494700  2 osd.42 103257 
ms_handle_reset con 0x555b15960480 ses

[ceph-users] Is it possible to assign osd id numbers?

2020-09-11 Thread Shain Miley
Hello,
I have been wondering for quite some time whether or not it is possible to 
influence the osd.id numbers that are  assigned during an install.

I have made an attempt to keep our osds in order over the last few years, but 
it is a losing battle without having some control over the osd assignment.

I am currently using ceph-deploy to handle adding nodes to the cluster.

Thanks in advance,
Shain

Shain Miley | Director of Platform and Infrastructure | Digital Media | 
smi...@npr.org
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Problem unusable after deleting pool with bilion objects

2020-09-11 Thread Jan Pekař - Imatic

Hi Igor,

thank you, I also think that it is the problem you described.

I recreated OSD's now and also noticed strange warnings -

HEALTH_WARN Degraded data redundancy: 106763/723 objects degraded (14766.667%)

Maybe there are some "phantom", zero sized objects (OMAPs?), that cluster is 
recovering, but I don't need them (are not listed in ceph df).

You mentioned DB vs. Main devices ratio (1:11) - I'm not separating DB from 
device - each device has it's own RockDB on it.

With regards
Jan Pekar

On 11/09/2020 14.36, Igor Fedotov wrote:


Hi Jan,

most likely this is a known issue with slow and ineffective pool removal 
procedure in Ceph.

I did some presentation on the topic at yesterday's weekly performance meeting, 
presumably a recording will be available in a couple of days.

An additional accompanying issue not covered during this meeting is RocksDB's misbehavior after (or during) such massive removals. At some 
point it starts to slow  down reading operations handling (e.g. collection listing) which results in OSD suicide timeouts. Exactly what is 
observed in your case. There were multiple discussion on this issue in this mailing list too. In short the currect workaround is to 
perform manual DB compaction using ceph-kvstore-tool. Pool removal will most likely to proceed hence one might face similar assertions 
after a while. Hence there might be a need for multiple "compaction-restart" iterations until pool is finally removed.



And yet another potential issue (or at least an additional factor) with your setup is a pretty high DB vs. Main devices ratio (1:11). 
Deleting procedures from multiple OSDs result in a pretty highload on DB volume which becomes overburdened...



Thanks,

Igor

On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote:

Hi all,

I have build testing cluster with 4 hosts, 1 SSD's  and 11 HDD on each host.
Running ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) 
nautilus (stable) on Ubuntu.

Because we want to save small size object, I set bluestore_min_alloc_size 8192 
(it is maybe important in this case)

I have filled it through rados gw with approx billion of small objects. After tests I changed min_alloc_size back and deleted rados pools 
(to emtpy whole cluster) and I was waiting till cluster deletes data from OSD's, but that destabilized the cluster. I never reached 
health OK. OSD's were killed in random order. I can start them back but they will again get out from cluster with..


```

   -18> 2020-09-05 22:11:19.430 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -17> 2020-09-05 22:11:19.430 7f7a3ee40700  5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 
1644167168 kv_used: 1644135504 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
   -16> 2020-09-05 22:11:20.434 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -15> 2020-09-05 22:11:21.434 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -14> 2020-09-05 22:11:22.258 7f7a2b81f700  5 osd.42 103257 heartbeat osd_stat(store_statfs(0x1ce1829/0x2d08c/0x1d18000, 
data 0x23143355/0x974a, compress 0x0/0x0/0x0, omap 0x1f11e, meta 0x2d08a0ee2), peers 
[3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] op hist [])
   -13> 2020-09-05 22:11:22.438 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -12> 2020-09-05 22:11:23.442 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -11> 2020-09-05 22:11:24.442 7f7a3ee40700  5 prioritycache tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 heap: 
2073067520 old mem: 1932735282 new mem: 1932735282
   -10> 2020-09-05 22:11:24.442 7f7a3ee40700  5 bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 1932735282 kv_alloc: 
1644167168 kv_used: 1644119840 meta_alloc: 142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
    -9> 2020-09-05 22:11:24.442 7f7a2e024700  0 bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation observed for 
_collection_list, latency = 151.113s, lat = 2m cid =5.47_head start #5:e2000# end #MAX# max 2147483647

    -8> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7f7a2e024700' had timed out after 15
    -7> 2020-09-05 22:11:24.446 7f7a2e024700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7a2e024700' had suicide timed out 
after 150

    -6> 2020-09-05 22:11:24.446 7f7a4c2a4700 10 monclient: 

[ceph-users] Re: Problem unusable after deleting pool with bilion objects

2020-09-11 Thread Igor Fedotov

Jan,

please see inline

On 9/11/2020 4:13 PM, Jan Pekař - Imatic wrote:


Hi Igor,

thank you, I also think that it is the problem you described.

I recreated OSD's now and also noticed strange warnings -

HEALTH_WARN Degraded data redundancy: 106763/723 objects degraded 
(14766.667%)


Maybe there are some "phantom", zero sized objects (OMAPs?), that 
cluster is recovering, but I don't need them (are not listed in ceph df).



The above look pretty weird but I don't know what's happening here...


You mentioned DB vs. Main devices ratio (1:11) - I'm not separating DB 
from device - each device has it's own RockDB on it.


Are you saying that DB is colocated with main data and resides on HDD? 
If so this is another significant (or may be the major) trigger for the 
issue. RocksDB + HDD is a bad pair for high load DB operation handling 
which bulk pool removal is.




With regards
Jan Pekar

On 11/09/2020 14.36, Igor Fedotov wrote:


Hi Jan,

most likely this is a known issue with slow and ineffective pool 
removal procedure in Ceph.


I did some presentation on the topic at yesterday's weekly 
performance meeting, presumably a recording will be available in a 
couple of days.


An additional accompanying issue not covered during this meeting is 
RocksDB's misbehavior after (or during) such massive removals. At 
some point it starts to slow  down reading  operations handling (e.g. 
collection listing) which results in OSD suicide timeouts. Exactly 
what is observed in your case. There were multiple discussion on this 
issue in this mailing list too. In short the currect workaround is to 
perform manual DB compaction using ceph-kvstore-tool. Pool removal 
will most likely to proceed hence one might face similar assertions 
after a while. Hence there might be a need for multiple 
"compaction-restart" iterations until pool is finally removed.



And yet another potential issue (or at least an additional factor) 
with your setup is a pretty high DB vs. Main devices ratio (1:11). 
Deleting procedures from multiple OSDs result in a pretty highload on 
DB volume which becomes overburdened...



Thanks,

Igor

On 9/11/2020 3:00 PM, Jan Pekař - Imatic wrote:

Hi all,

I have build testing cluster with 4 hosts, 1 SSD's  and 11 HDD on 
each host.
Running ceph version 14.2.10 
(b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable) on Ubuntu.


Because we want to save small size object, I set 
bluestore_min_alloc_size 8192 (it is maybe important in this case)


I have filled it through rados gw with approx billion of small 
objects. After tests I changed min_alloc_size back and deleted rados 
pools (to emtpy whole cluster) and I was waiting till cluster 
deletes data from OSD's, but that destabilized the cluster. I never 
reached health OK. OSD's were killed in random order. I can start 
them back but they will again get out from cluster with..


```

   -18> 2020-09-05 22:11:19.430 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -17> 2020-09-05 22:11:19.430 7f7a3ee40700  5 
bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 
1932735282 kv_alloc: 1644167168 kv_used: 1644135504 meta_alloc: 
142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
   -16> 2020-09-05 22:11:20.434 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064941056 unmapped: 8126464 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -15> 2020-09-05 22:11:21.434 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -14> 2020-09-05 22:11:22.258 7f7a2b81f700  5 osd.42 103257 
heartbeat 
osd_stat(store_statfs(0x1ce1829/0x2d08c/0x1d18000, data 
0x23143355/0x974a, compress 0x0/0x0/0x0, omap 0x1f11e, meta 
0x2d08a0ee2), peers 
[3,4,6,7,8,11,12,13,14,16,17,18,19,21,23,24,25,27,28,29,31,32,33,34,41,43] 
op hist [])
   -13> 2020-09-05 22:11:22.438 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -12> 2020-09-05 22:11:23.442 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064359424 unmapped: 8708096 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -11> 2020-09-05 22:11:24.442 7f7a3ee40700  5 prioritycache 
tune_memory target: 3221225472 mapped: 2064285696 unmapped: 8781824 
heap: 2073067520 old mem: 1932735282 new mem: 1932735282
   -10> 2020-09-05 22:11:24.442 7f7a3ee40700  5 
bluestore.MempoolThread(0x555a9d0efb70) _trim_shards cache_size: 
1932735282 kv_alloc: 1644167168 kv_used: 1644119840 meta_alloc: 
142606336 meta_used: 143595 data_alloc: 142606336 data_used: 98304
    -9> 2020-09-05 22:11:24.442 7f7a2e024700  0 
bluestore(/var/lib/ceph/osd/ceph-42) log_latency_fn slow operation 
observed for _collection_list, latency 

[ceph-users] Re: Is it possible to assign osd id numbers?

2020-09-11 Thread George Shuklin

On 11/09/2020 16:11, Shain Miley wrote:

Hello,
I have been wondering for quite some time whether or not it is possible to 
influence the osd.id numbers that are  assigned during an install.

I have made an attempt to keep our osds in order over the last few years, but 
it is a losing battle without having some control over the osd assignment.

I am currently using ceph-deploy to handle adding nodes to the cluster.

You can reuse osd numbers, but I strongly advice you not to focus on 
precise IDs. The reason is that you can have such combination of server 
faults, which will swap IDs no matter what.


It's a false sense of beauty to have 'ID of OSD match ID in the name of 
the server'.


How to reuse osd nums?

OSD number is used (and should be cleaned if OSD dies) in three places 
in Ceph:


1) Crush map: ceph osd crush rm osd.x

2) osd list: ceph osd rm osd.x

3) auth: ceph auth rm osd.x

The last one is often forgoten and is a usual reason for ceph-ansible to 
fail on new disk in the server.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus

2020-09-11 Thread Igor Fedotov

Could you please run:

CEPH_ARGS="--log-file log --debug-asok 5" ceph-bluestore-tool repair 
--path <...> ; cat log | grep asok > out


and share 'out' file.


Thanks,

Igor

On 9/11/2020 5:15 PM, Jean-Philippe Méthot wrote:

Hi,

We’re upgrading our cluster OSD node per OSD node to Nautilus from Mimic. From 
some release notes, it was recommended to run the following command to fix 
stats after an upgrade :

ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0

However, running that command gives us the following error message:


/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
 In
  function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 
time 2020-09-10 14:40:25.872353
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
 53
: FAILED ceph_assert(r == 0)
  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14a) [0x7f1a5a823025]
  2: (()+0x25c1ed) [0x7f1a5a8231ed]
  3: (()+0x3c7a4f) [0x55b33537ca4f]
  4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
  5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
  6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
  7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
  8: (main()+0x10b3) [0x55b335187493]
  9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
  10: (()+0x1f9b5f) [0x55b3351aeb5f]
2020-09-10 14:40:25.873 7f1a6467eec0 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
 In function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 
time 2020-09-10 14:40:25.872353
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
 53: FAILED ceph_assert(r == 0)

  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14a) [0x7f1a5a823025]
  2: (()+0x25c1ed) [0x7f1a5a8231ed]
  3: (()+0x3c7a4f) [0x55b33537ca4f]
  4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
  5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
  6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
  7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
  8: (main()+0x10b3) [0x55b335187493]
  9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
  10: (()+0x1f9b5f) [0x55b3351aeb5f]
*** Caught signal (Aborted) **
  in thread 7f1a6467eec0 thread_name:ceph-bluestore-
ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
(stable)
  1: (()+0xf630) [0x7f1a58cf0630]
  2: (gsignal()+0x37) [0x7f1a574be387]
  3: (abort()+0x148) [0x7f1a574bfa78]
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x199) [0x7f1a5a823074]
  5: (()+0x25c1ed) [0x7f1a5a8231ed]
  6: (()+0x3c7a4f) [0x55b33537ca4f]
  7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
  8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
  9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
  10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
  11: (main()+0x10b3) [0x55b335187493]
  12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
  13: (()+0x1f9b5f) [0x55b3351aeb5f]
2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal (Aborted) **
  in thread 7f1a6467eec0 thread_name:ceph-bluestore-

  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
(stable)
  1: (()+0xf630) [0x7f1a58cf0630]
  2: (gsignal()+0x37) [0x7f1a574be387]
  3: (abort()+0x148) [0x7f1a574bfa78]
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x199) [0x7f1a5a823074]
  5: (()+0x25c1ed) [0x7f1a5a8231ed]
  6: (()+0x3c7a4f) [0x55b33537ca4f]
  7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
  8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
  9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
  10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
  11: (main()+0x10b3) [0x55b335187493]
  12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
  13: (()+0x1f9b5f) [0x55b3351aeb5f]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


What could be the source of this error? I haven’t found much of anything about 
it online.


Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur systè

[ceph-users] Re: ceph-osd performance on ram disk

2020-09-11 Thread Mark Nelson


On 9/11/20 4:15 AM, George Shuklin wrote:

On 10/09/2020 19:37, Mark Nelson wrote:

On 9/10/20 11:03 AM, George Shuklin wrote:


...
Are there any knobs to tweak to see higher performance for ceph-osd? 
I'm pretty sure it's not any kind of leveling, GC or other 
'iops-related' issues (brd has performance of two order of magnitude 
higher).



So as you've seen, Ceph does a lot more than just write a chunk of 
data out to a block on disk.  There's tons of encoding/decoding 
happening, crc checksums, crush calculations, onode lookups, 
write-ahead-logging, and other work involved that all adds latency.  
You can overcome some of that through parallelism, but 30K IOPs per 
OSD is probably pretty on-point for a nautilus era OSD.  For octopus+ 
the cache refactor in bluestore should get you farther (40-50k+ for 
and OSD in isolation).  The maximum performance we've seen in-house 
is around 70-80K IOPs on a single OSD using very fast NVMe and highly 
tuned settings.



A couple of things you can try:


- upgrade to octopus+ for the cache refactor

- Make sure you are using the equivalent of the latency-performance 
or latency-network tuned profile.  The most important part is 
disabling CPU cstate transitions.


- increase osd_memory_target if you have a larger dataset (onode 
cache misses in bluestore add a lot of latency)


- enable turbo if it's disabled (higher clock speed generally helps)


On the write path you are correct that there is a limitation 
regarding a single kv sync thread.  Over the years we've made this 
less of a bottleneck but it's possible you still could be hitting 
it.  In our test lab we've managed to utilize up to around 12-14 
cores on a single OSD in isolation with 16 tp_osd_tp worker threads 
and on a larger cluster about 6-7 cores per OSD.  There's probably 
multiple factors at play, including context switching, cache 
thrashing, memory throughput, object creation/destruction, etc.  If 
you decide to look into it further you may want to try wallclock 
profiling the OSD under load and seeing where it is spending its time. 


Thank you for feedback.

I forgot to mention this, it's Octopus, fresh installation.

I've disabled CSTATE (governor=performance), it make no difference - 
same iops, same CPU use by ceph-osd  I've just can't force Ceph to 
consume more than 330% of CPU. I can force read up to 150k IOPS (both 
network and local), hitting CPU limit, but write is somewhat 
restricted by ceph itself.



Ok, can I assume block/db/wal are all on the ramdisk?  I'd start a 
benchmark and attach gdbpmp to the OSD and see if you can get a 
callgraph (1000 samples is nice if you don't mind waiting a bit). That 
will tell us a lot more about where the code is spending time.  It will 
slow the benchmark way down fwiw.  Some other things you could try:  Try 
to tweak the number of osd worker threads to better match the number of 
cores in your system.  Too many and you end up with context switching.  
Too few and you limit parallelism.  You can also check rocksdb 
compaction stats in the osd logs using this tool:



https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


Given that you are on ramdisk the 1GB default WAL limit should be plenty 
to let you avoid WAL throttling during compaction, but just verifying 
that compactions are not taking a long time is good peace of mind.



Mark






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph rbox test on passive compressed pool

2020-09-11 Thread david
On 09/11 09:36, Marc Roos wrote:
> 
> Hi David,
> 
> Just to let you know, this hint is being set, what is the reason for 
> ceph of doing only half the objects? Can it be that there is some issue 
> with my osd's? Like some maybe have an old fs (still using disk not 
> volume)? Is this still to be expected or does ceph under pressure drop 
> compressing?
> 
> https://github.com/ceph-dovecot/dovecot-ceph-plugin/blob/56d6c900cc9ec07dfb98ef2abac07aae466b7610/src/librmb/rados-storage-impl.cpp#L75


I was trying to look into this a bit :), can you give me more info about the
OSDs that you are using?
What filesystem are they on?

Cheers!
> 
>  Thanks,
> Marc
> 
> 
> 
> -Original Message-
> Cc: jan.radon
> Subject: Re: [ceph-users] ceph rbox test on passive compressed pool
> 
> The hints have to be given from the client side as far as I understand, 
> can you share the client code too?
> 
> Also,not seems that there's no guarantees that it will actually do 
> anything (best effort I guess):
> https://docs.ceph.com/docs/mimic/rados/api/librados/#c.rados_set_alloc_hint
> 
> Cheers
> 
> 
> On 6 September 2020 15:59:01 BST, Marc Roos  
> wrote:
> 
>   
>   
>   I have been inserting 10790 exactly the same 64kb text message to a 
> 
>   passive compressing enabled pool. I am still counting, but it looks 
> like 
>   only half the objects are compressed.  
>   
>   mail/b08c3218dbf1545ff43052412a8e mtime 2020-09-06 
> 16:27:39.00, 
>   size 63580
>   mail/00f6043775f1545ff43052412a8e mtime 2020-09-06 
> 16:25:57.00, 
>   size 525
>   mail/b875f40571f1545ff43052412a8e mtime 2020-09-06 
> 16:25:53.00, 
>   size 63580
>   mail/e87c120b19f1545ff43052412a8e mtime 2020-09-06 
> 16:24:25.00, 
>   size 525
>   
>   I am not sure if this should be expected from passive, these 
> docs[1] 
>   hint that passive 'compress if hinted COMPRESSIBLE'. From that I 
> would 
>   conclude that all text messages should be compressed. 
>   A previous test with a 64kb gzip attachment seemed to not compress, 
> 
>   although I did not look at all object sizes.
>   
>   
>   
>   on 14.2.11
>   
>   [1]
>   https://documentation.suse.com/ses/5.5/html/ses-all/ceph-pools.html
> #sec-ceph-pool-compression
>   https://docs.ceph.com/docs/mimic/rados/operations/pools/
> 
> 
>   ceph-users mailing list -- ceph-users@ceph.io
>   To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> 
> 

-- 
David Caro
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph rbox test on passive compressed pool

2020-09-11 Thread Marc Roos
 
It is a hdd pool, all bluestore, configured with ceph-disk. Upgrades 
seem not to have 'updated' bluefs, some osds report like this:

{
"/dev/sdb2": {
"osd_uuid": "xxx",
"size": 4000681103360,
"btime": "2019-01-08 13:45:59.488533",
"description": "main",
"bluefs": "1",
"ceph_fsid": "x",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"ready": "ready",
"require_osd_release": "14",
"whoami": "3"
}
}

And some like this:
{
"/dev/sdh2": {
"osd_uuid": "xxx",
"size": 3000487051264,
"btime": "2017-07-14 14:45:59.212792",
"description": "main",
"require_osd_release": "14"
}
}



-Original Message-
Cc: ceph-users
Subject: Re: [ceph-users] ceph rbox test on passive compressed pool

On 09/11 09:36, Marc Roos wrote:
> 
> Hi David,
> 
> Just to let you know, this hint is being set, what is the reason for 
> ceph of doing only half the objects? Can it be that there is some 
> issue with my osd's? Like some maybe have an old fs (still using disk 
> not volume)? Is this still to be expected or does ceph under pressure 
> drop compressing?
> 
> https://github.com/ceph-dovecot/dovecot-ceph-plugin/blob/56d6c900cc9ec
> 07dfb98ef2abac07aae466b7610/src/librmb/rados-storage-impl.cpp#L75


I was trying to look into this a bit :), can you give me more info about 
the OSDs that you are using?
What filesystem are they on?

Cheers!
> 
>  Thanks,
> Marc
> 
> 
> 
> -Original Message-
> Cc: jan.radon
> Subject: Re: [ceph-users] ceph rbox test on passive compressed pool
> 
> The hints have to be given from the client side as far as I 
> understand, can you share the client code too?
> 
> Also,not seems that there's no guarantees that it will actually do 
> anything (best effort I guess):
> https://docs.ceph.com/docs/mimic/rados/api/librados/#c.rados_set_alloc
> _hint
> 
> Cheers
> 
> 
> On 6 September 2020 15:59:01 BST, Marc Roos 
> wrote:
> 
>   
>   
>   I have been inserting 10790 exactly the same 64kb text message to 
a
> 
>   passive compressing enabled pool. I am still counting, but it 
looks 
> like
>   only half the objects are compressed.  
>   
>   mail/b08c3218dbf1545ff43052412a8e mtime 2020-09-06 
> 16:27:39.00,
>   size 63580
>   mail/00f6043775f1545ff43052412a8e mtime 2020-09-06 
> 16:25:57.00,
>   size 525
>   mail/b875f40571f1545ff43052412a8e mtime 2020-09-06 
> 16:25:53.00,
>   size 63580
>   mail/e87c120b19f1545ff43052412a8e mtime 2020-09-06 
> 16:24:25.00,
>   size 525
>   
>   I am not sure if this should be expected from passive, these 
docs[1]
>   hint that passive 'compress if hinted COMPRESSIBLE'. From that I 
> would
>   conclude that all text messages should be compressed. 
>   A previous test with a 64kb gzip attachment seemed to not 
compress,
> 
>   although I did not look at all object sizes.
>   
>   
>   
>   on 14.2.11
>   
>   [1]
>   
https://documentation.suse.com/ses/5.5/html/ses-all/ceph-pools.html
> #sec-ceph-pool-compression
>   https://docs.ceph.com/docs/mimic/rados/operations/pools/
> 
> 
>   ceph-users mailing list -- ceph-users@ceph.io
>   To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> 
> 

--
David Caro

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus

2020-09-11 Thread Jean-Philippe Méthot
Hi,

We’re upgrading our cluster OSD node per OSD node to Nautilus from Mimic. From 
some release notes, it was recommended to run the following command to fix 
stats after an upgrade :

ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0

However, running that command gives us the following error message:

> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
>  In
>  function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 
> time 2020-09-10 14:40:25.872353
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
>  53
> : FAILED ceph_assert(r == 0)
>  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x14a) [0x7f1a5a823025]
>  2: (()+0x25c1ed) [0x7f1a5a8231ed]
>  3: (()+0x3c7a4f) [0x55b33537ca4f]
>  4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>  5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>  6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
>  7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
>  8: (main()+0x10b3) [0x55b335187493]
>  9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>  10: (()+0x1f9b5f) [0x55b3351aeb5f]
> 2020-09-10 14:40:25.873 7f1a6467eec0 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
>  In function 'virtual Allocator::SocketHook::~SocketHook()' thread 
> 7f1a6467eec0 time 2020-09-10 14:40:25.872353
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
>  53: FAILED ceph_assert(r == 0)
> 
>  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x14a) [0x7f1a5a823025]
>  2: (()+0x25c1ed) [0x7f1a5a8231ed]
>  3: (()+0x3c7a4f) [0x55b33537ca4f]
>  4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>  5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>  6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
>  7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
>  8: (main()+0x10b3) [0x55b335187493]
>  9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>  10: (()+0x1f9b5f) [0x55b3351aeb5f]
> *** Caught signal (Aborted) **
>  in thread 7f1a6467eec0 thread_name:ceph-bluestore-
> ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
> (stable)
>  1: (()+0xf630) [0x7f1a58cf0630]
>  2: (gsignal()+0x37) [0x7f1a574be387]
>  3: (abort()+0x148) [0x7f1a574bfa78]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x199) [0x7f1a5a823074]
>  5: (()+0x25c1ed) [0x7f1a5a8231ed]
>  6: (()+0x3c7a4f) [0x55b33537ca4f]
>  7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>  8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>  9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
>  10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
>  11: (main()+0x10b3) [0x55b335187493]
>  12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>  13: (()+0x1f9b5f) [0x55b3351aeb5f]
> 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal (Aborted) **
>  in thread 7f1a6467eec0 thread_name:ceph-bluestore-
> 
>  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
> (stable)
>  1: (()+0xf630) [0x7f1a58cf0630]
>  2: (gsignal()+0x37) [0x7f1a574be387]
>  3: (abort()+0x148) [0x7f1a574bfa78]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x199) [0x7f1a5a823074]
>  5: (()+0x25c1ed) [0x7f1a5a8231ed]
>  6: (()+0x3c7a4f) [0x55b33537ca4f]
>  7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>  8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>  9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
>  10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
>  11: (main()+0x10b3) [0x55b335187493]
>  12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>  13: (()+0x1f9b5f) [0x55b3351aeb5f]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.


What could be the source of this error? I haven’t found much of anything about 
it online.


Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
4414-4416 Louis B Mayer
Laval, QC, H7P 0G1, Canada
TEL : +1.514.802.1644 - Poste : 26

[ceph-users] Re: OSDs and tmpfs

2020-09-11 Thread Dimitri Savineau
> We have a 23 node cluster and normally when we add OSDs they end up 
> mounting like
> this:
> 
> /dev/sde1   3.7T  2.0T  1.8T  54% /var/lib/ceph/osd/ceph-15
> 
> /dev/sdj1   3.7T  2.0T  1.7T  55% /var/lib/ceph/osd/ceph-20
> 
> /dev/sdd1   3.7T  2.1T  1.6T  58% /var/lib/ceph/osd/ceph-14
> 
> /dev/sdc1   3.7T  1.8T  1.9T  49% /var/lib/ceph/osd/ceph-13
> 

I'm pretty sure those OSDs have been deployed with Filestore backend as the 
first partition of the device is the data partition and needs to be mounted.

> However I noticed this morning that the 3 new servers have the OSDs 
> mounted like
> this:
> 
> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-246
> 
> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-240
> 
> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-248
> 
> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-237
> 

And here, it looks like those OSDs are using Bluestore backend because this 
backend doesn't need to mount any data partitions.
What you're seeing is the Bluestore metadata in this tmpfs.
You should find in the mount point some usefull information (fsid, keyring and 
symlinks to the data block and/or db/wal).

I don't know if you're using ceph-disk or ceph-volume but you can find 
information about this by running either:
  - ceph-disk list
  - ceph-volume lvm list
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs and tmpfs

2020-09-11 Thread Marc Roos
 

I have also these mounts with bluestore

/dev/sde1 on /var/lib/ceph/osd/ceph-32 type xfs 
(rw,relatime,attr2,inode64,noquota)
/dev/sdb1 on /var/lib/ceph/osd/ceph-3 type xfs 
(rw,relatime,attr2,inode64,noquota)
/dev/sdc1 on /var/lib/ceph/osd/ceph-6 type xfs 
(rw,relatime,attr2,inode64,noquota)
/dev/sdd1 on /var/lib/ceph/osd/ceph-8 type xfs 
(rw,relatime,attr2,inode64,noquota)
/dev/sdj1 on /var/lib/ceph/osd/ceph-19 type xfs 
(rw,relatime,attr2,inode64,noquota)

[@c01 ~]# ls -l /var/lib/ceph/osd/ceph-0
total 52
-rw-r--r-- 1 ceph ceph  3 Aug 24  2017 active
lrwxrwxrwx 1 ceph ceph 58 Jun 30  2017 block -> 
/dev/disk/by-partuuid/63b970b7-2759-4eae-a66e-b84335eba598
-rw-r--r-- 1 ceph ceph 37 Jun 30  2017 block_uuid
-rw-r--r-- 1 ceph ceph  2 Jun 30  2017 bluefs
-rw-r--r-- 1 ceph ceph 37 Jun 30  2017 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Jun 30  2017 fsid
-rw--- 1 ceph ceph 56 Jun 30  2017 keyring
-rw-r--r-- 1 ceph ceph  8 Jun 30  2017 kv_backend
-rw-r--r-- 1 ceph ceph 21 Jun 30  2017 magic
-rw-r--r-- 1 ceph ceph  4 Jun 30  2017 mkfs_done
-rw-r--r-- 1 ceph ceph  6 Jun 30  2017 ready
-rw-r--r-- 1 ceph ceph  3 Oct 19  2019 require_osd_release
-rw-r--r-- 1 ceph ceph  0 Sep 26  2019 systemd
-rw-r--r-- 1 ceph ceph 10 Jun 30  2017 type
-rw-r--r-- 1 ceph ceph  2 Jun 30  2017 whoami


-Original Message-
To: ceph-users@ceph.io
Subject: [ceph-users] Re: OSDs and tmpfs

> We have a 23 node cluster and normally when we add OSDs they end 
> up mounting like
> this:
> 
> /dev/sde1   3.7T  2.0T  1.8T  54% /var/lib/ceph/osd/ceph-15
> 
> /dev/sdj1   3.7T  2.0T  1.7T  55% /var/lib/ceph/osd/ceph-20
> 
> /dev/sdd1   3.7T  2.1T  1.6T  58% /var/lib/ceph/osd/ceph-14
> 
> /dev/sdc1   3.7T  1.8T  1.9T  49% /var/lib/ceph/osd/ceph-13
> 

I'm pretty sure those OSDs have been deployed with Filestore backend as 
the first partition of the device is the data partition and needs to be 
mounted.

> However I noticed this morning that the 3 new servers have the 
> OSDs mounted like
> this:
> 
> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-246
> 
> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-240
> 
> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-248
> 
> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-237
> 

And here, it looks like those OSDs are using Bluestore backend because 
this backend doesn't need to mount any data partitions.
What you're seeing is the Bluestore metadata in this tmpfs.
You should find in the mount point some usefull information (fsid, 
keyring and symlinks to the data block and/or db/wal).

I don't know if you're using ceph-disk or ceph-volume but you can find 
information about this by running either:
  - ceph-disk list
  - ceph-volume lvm list
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it possible to assign osd id numbers?

2020-09-11 Thread Shain Miley
Thank you for your answer below.

I'm not looking to reuse them as much as I am trying to control what unused 
number is actually used.

For example if I have 20 osds and 2 have failed...when I replace a disk in one 
server I don't want it to automatically use the next lowest number for the osd 
assignment.

I understand what you mean about not focusing on the osd ids...but my ocd is 
making me ask the question.

Thanks,
Shain

On 9/11/20, 9:45 AM, "George Shuklin"  wrote:

On 11/09/2020 16:11, Shain Miley wrote:
> Hello,
> I have been wondering for quite some time whether or not it is possible 
to influence the osd.id numbers that are  assigned during an install.
>
> I have made an attempt to keep our osds in order over the last few years, 
but it is a losing battle without having some control over the osd assignment.
>
> I am currently using ceph-deploy to handle adding nodes to the cluster.
>
You can reuse osd numbers, but I strongly advice you not to focus on 
precise IDs. The reason is that you can have such combination of server 
faults, which will swap IDs no matter what.

It's a false sense of beauty to have 'ID of OSD match ID in the name of 
the server'.

How to reuse osd nums?

OSD number is used (and should be cleaned if OSD dies) in three places 
in Ceph:

1) Crush map: ceph osd crush rm osd.x

2) osd list: ceph osd rm osd.x

3) auth: ceph auth rm osd.x

The last one is often forgoten and is a usual reason for ceph-ansible to 
fail on new disk in the server.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs and tmpfs

2020-09-11 Thread Oliver Freyermuth
Hi together,

I believe the deciding factor is whether the OSD was deployed using ceph-disk 
(in "ceph-volume" speak, a "simple" OSD),
which means the metadata will be on a separate partition, or whether it was 
deployed with "ceph-volume lvm". 

The latter stores the metadata in LVM tags, so the extra partition is not 
needed anymore. 
Of course, since the general recommendation to move from filestore to bluestore 
came at a similar time as the ceph-disk deprecation,
there's usually correlation between these two ingredients ;-). 

Cheers,
Oliver

Am 11.09.20 um 20:33 schrieb Marc Roos:
>  
> 
> I have also these mounts with bluestore
> 
> /dev/sde1 on /var/lib/ceph/osd/ceph-32 type xfs 
> (rw,relatime,attr2,inode64,noquota)
> /dev/sdb1 on /var/lib/ceph/osd/ceph-3 type xfs 
> (rw,relatime,attr2,inode64,noquota)
> /dev/sdc1 on /var/lib/ceph/osd/ceph-6 type xfs 
> (rw,relatime,attr2,inode64,noquota)
> /dev/sdd1 on /var/lib/ceph/osd/ceph-8 type xfs 
> (rw,relatime,attr2,inode64,noquota)
> /dev/sdj1 on /var/lib/ceph/osd/ceph-19 type xfs 
> (rw,relatime,attr2,inode64,noquota)
> 
> [@c01 ~]# ls -l /var/lib/ceph/osd/ceph-0
> total 52
> -rw-r--r-- 1 ceph ceph  3 Aug 24  2017 active
> lrwxrwxrwx 1 ceph ceph 58 Jun 30  2017 block -> 
> /dev/disk/by-partuuid/63b970b7-2759-4eae-a66e-b84335eba598
> -rw-r--r-- 1 ceph ceph 37 Jun 30  2017 block_uuid
> -rw-r--r-- 1 ceph ceph  2 Jun 30  2017 bluefs
> -rw-r--r-- 1 ceph ceph 37 Jun 30  2017 ceph_fsid
> -rw-r--r-- 1 ceph ceph 37 Jun 30  2017 fsid
> -rw--- 1 ceph ceph 56 Jun 30  2017 keyring
> -rw-r--r-- 1 ceph ceph  8 Jun 30  2017 kv_backend
> -rw-r--r-- 1 ceph ceph 21 Jun 30  2017 magic
> -rw-r--r-- 1 ceph ceph  4 Jun 30  2017 mkfs_done
> -rw-r--r-- 1 ceph ceph  6 Jun 30  2017 ready
> -rw-r--r-- 1 ceph ceph  3 Oct 19  2019 require_osd_release
> -rw-r--r-- 1 ceph ceph  0 Sep 26  2019 systemd
> -rw-r--r-- 1 ceph ceph 10 Jun 30  2017 type
> -rw-r--r-- 1 ceph ceph  2 Jun 30  2017 whoami
> 
> 
> -Original Message-
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: OSDs and tmpfs
> 
>> We have a 23 node cluster and normally when we add OSDs they end 
>> up mounting like
>> this:
>>
>> /dev/sde1   3.7T  2.0T  1.8T  54% /var/lib/ceph/osd/ceph-15
>>
>> /dev/sdj1   3.7T  2.0T  1.7T  55% /var/lib/ceph/osd/ceph-20
>>
>> /dev/sdd1   3.7T  2.1T  1.6T  58% /var/lib/ceph/osd/ceph-14
>>
>> /dev/sdc1   3.7T  1.8T  1.9T  49% /var/lib/ceph/osd/ceph-13
>>
> 
> I'm pretty sure those OSDs have been deployed with Filestore backend as 
> the first partition of the device is the data partition and needs to be 
> mounted.
> 
>> However I noticed this morning that the 3 new servers have the 
>> OSDs mounted like
>> this:
>>
>> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-246
>>
>> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-240
>>
>> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-248
>>
>> tmpfs47G   28K   47G   1% /var/lib/ceph/osd/ceph-237
>>
> 
> And here, it looks like those OSDs are using Bluestore backend because 
> this backend doesn't need to mount any data partitions.
> What you're seeing is the Bluestore metadata in this tmpfs.
> You should find in the mount point some usefull information (fsid, 
> keyring and symlinks to the data block and/or db/wal).
> 
> I don't know if you're using ceph-disk or ceph-volume but you can find 
> information about this by running either:
>   - ceph-disk list
>   - ceph-volume lvm list
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it possible to assign osd id numbers?

2020-09-11 Thread Anthony D'Atri

Now that’s a *very* different question from numbers assigned during an install.

With recent releases instead of going down the full removal litany listed 
below, you can instead down/out the OSD and `destroy` it.  That preserves the 
CRUSH bucket and OSD ID, then when you use ceph-disk, ceph-volume, 
what-have-you to deploy a replacement, you can specify the same and desired OSD 
ID on the commandline.

Note that as of 12.2.2, you’ll want to record and re-set any override reweight 
(manual or reweight-by-utilization) as that usually or never survives.  Also 
note that again as of that release, if the replacement drive is a different 
size, the CRUSH weight is not adjusted, so you may (or may not) want to adjust 
the CRUSH weight.  Slight differences aren’t usually a huge deal; big 
differences can mean you have unused capacity, or overloaded drives.

— Anthony

> 
> Thank you for your answer below.
> 
> I'm not looking to reuse them as much as I am trying to control what unused 
> number is actually used.
> 
> For example if I have 20 osds and 2 have failed...when I replace a disk in 
> one server I don't want it to automatically use the next lowest number for 
> the osd assignment.
> 
> I understand what you mean about not focusing on the osd ids...but my ocd is 
> making me ask the question.
> 
> Thanks,
> Shain
> 
> On 9/11/20, 9:45 AM, "George Shuklin"  wrote:
> 
>On 11/09/2020 16:11, Shain Miley wrote:
>> Hello,
>> I have been wondering for quite some time whether or not it is possible to 
>> influence the osd.id numbers that are  assigned during an install.
>> 
>> I have made an attempt to keep our osds in order over the last few years, 
>> but it is a losing battle without having some control over the osd 
>> assignment.
>> 
>> I am currently using ceph-deploy to handle adding nodes to the cluster.
>> 
>You can reuse osd numbers, but I strongly advice you not to focus on 
>precise IDs. The reason is that you can have such combination of server 
>faults, which will swap IDs no matter what.
> 
>It's a false sense of beauty to have 'ID of OSD match ID in the name of 
>the server'.
> 
>How to reuse osd nums?
> 
>OSD number is used (and should be cleaned if OSD dies) in three places 
>in Ceph:
> 
>1) Crush map: ceph osd crush rm osd.x
> 
>2) osd list: ceph osd rm osd.x
> 
>3) auth: ceph auth rm osd.x
> 
>The last one is often forgoten and is a usual reason for ceph-ansible to 
>fail on new disk in the server.
>___
>ceph-users mailing list -- ceph-users@ceph.io
>To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus

2020-09-11 Thread Jean-Philippe Méthot
Here’s the out file, as requested.



Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
4414-4416 Louis B Mayer
Laval, QC, H7P 0G1, Canada
TEL : +1.514.802.1644 - Poste : 2644
FAX : +1.514.612.0678
CA/US : 1.855.774.4678
FR : 01 76 60 41 43
UK : 0808 189 0423






> Le 11 sept. 2020 à 10:38, Igor Fedotov  a écrit :
> 
> Could you please run:
> 
> CEPH_ARGS="--log-file log --debug-asok 5" ceph-bluestore-tool repair --path 
> <...> ; cat log | grep asok > out
> 
> and share 'out' file.
> 
> 
> Thanks,
> 
> Igor
> 
> On 9/11/2020 5:15 PM, Jean-Philippe Méthot wrote:
>> Hi,
>> 
>> We’re upgrading our cluster OSD node per OSD node to Nautilus from Mimic. 
>> From some release notes, it was recommended to run the following command to 
>> fix stats after an upgrade :
>> 
>> ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0
>> 
>> However, running that command gives us the following error message:
>> 
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
>>>  In
>>>  function 'virtual Allocator::SocketHook::~SocketHook()' thread 
>>> 7f1a6467eec0 time 2020-09-10 14:40:25.872353
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
>>>  53
>>> : FAILED ceph_assert(r == 0)
>>>  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
>>> (stable)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>> const*)+0x14a) [0x7f1a5a823025]
>>>  2: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>  3: (()+0x3c7a4f) [0x55b33537ca4f]
>>>  4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>>>  5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>  6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
>>>  7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
>>>  8: (main()+0x10b3) [0x55b335187493]
>>>  9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>  10: (()+0x1f9b5f) [0x55b3351aeb5f]
>>> 2020-09-10 14:40:25.873 7f1a6467eec0 -1 
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
>>>  In function 'virtual Allocator::SocketHook::~SocketHook()' thread 
>>> 7f1a6467eec0 time 2020-09-10 14:40:25.872353
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc:
>>>  53: FAILED ceph_assert(r == 0)
>>> 
>>>  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
>>> (stable)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>> const*)+0x14a) [0x7f1a5a823025]
>>>  2: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>  3: (()+0x3c7a4f) [0x55b33537ca4f]
>>>  4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>>>  5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>  6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
>>>  7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
>>>  8: (main()+0x10b3) [0x55b335187493]
>>>  9: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>  10: (()+0x1f9b5f) [0x55b3351aeb5f]
>>> *** Caught signal (Aborted) **
>>>  in thread 7f1a6467eec0 thread_name:ceph-bluestore-
>>> ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
>>> (stable)
>>>  1: (()+0xf630) [0x7f1a58cf0630]
>>>  2: (gsignal()+0x37) [0x7f1a574be387]
>>>  3: (abort()+0x148) [0x7f1a574bfa78]
>>>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>> const*)+0x199) [0x7f1a5a823074]
>>>  5: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>  6: (()+0x3c7a4f) [0x55b33537ca4f]
>>>  7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517]
>>>  8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082]
>>>  9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528]
>>>  10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1]
>>>  11: (main()+0x10b3) [0x55b335187493]
>>>  12: (__libc_start_main()+0xf5) [0x7f1a574aa555]
>>>  13: (()+0x1f9b5f) [0x55b3351aeb5f]
>>> 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal (Aborted) **
>>>  in thread 7f1a6467eec0 thread_name:ceph-bluestore-
>>> 
>>>  ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus 
>>> (stable)
>>>  1: (()+0xf630) [0x7f1a58cf0630]
>>>  2: (gsignal()+0x37) [0x7f1a574be387]
>>>  3: (abort()+0x148) [0x7f1a574bfa78]
>>>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>> const*)+0x199) [0x7f1a5a823074]
>>>  5: (()+0x25c1ed) [0x7f1a5a8231ed]
>>>  6: (()

[ceph-users] Re: The confusing output of ceph df command

2020-09-11 Thread norman

Igor,

I think I misunderstood the output of USED. The info should be allocated 
size, not equal 1.5*STORED sometimes.


For example: when writing 4k file, It may allocate 64k that seems to use 
more spaces, but If you write another 4k, it can use the same blob.(I 
will validate the guess).


So ceph df may not reflect the new files we can store.

I'm reading codes of ceph, I will reply the thread when I got the 
correct meaning.


Thanks,

Norman

On 11/9/2020 上午7:40, Igor Fedotov wrote:

Norman,

>default-fs-data0    9 374 TiB 1.48G 939 TiB 
74.71   212 TiB


given the above numbers 'default-fs-data0' pool has average object 
size around 256K (374 TiB / 1.48G objects). Are you sure that absolute 
majority of your objects in this pool are 4M?



Wondering what are the df report for the 'good' cluster?


Additionally (given that default-fs-data0 keeps most of data for the 
cluster) you might want to estimate allocation losses via performance 
counters inspection: bluestore_stored vs. bluestore_allocated.


Summing the delta between them over all (hdd?) OSDs you might get the 
total loss. Simpler way to do the same is to learn the deltas from a 
2-3 of OSDs and if just multiply the average delta by amount of OSDs. 
This less precise but statistically should be good enough...



Thanks,

Igor


On 9/10/2020 5:10 AM, norman wrote:

Igor,

Thanks for your reply.  The object size is 4M and almost no 
overwrites in the pool, why space loss happened in the pool?


I have another cluster with the same config, Its USED is almost equal 
to 1.5*STORED, the diff between them is:


The cluster has different OSD size(12T and 8T) .

Norman
On 9/9/2020 下午7:17, Igor Fedotov wrote:

Hi Norman,

not pretending to know the exact root cause but IMO one of the 
working hypothesis might be as follows :


Presuming spinners as backing devices for you OSDs and hence 64K 
allocation unit (bluestore min_alloc_size_hdd param).


1) 1.48GB user objects result in 1.48G * 6 = 8.88G EC shards.

2) Shards tend to be unaligned with 64K allocation unit which might 
result in an average loss of 32K per each shard.


3) Hence total loss due to allocation overhead to be estimated at 
32K * 8.88G = 284T  which looks close enough to your numbers for 
default-fs-data0:


939TiB - 374 TiB / 4 * 6 = 378 TiB of space loss.


Additional issue which might result in the space loss is space 
amplification occurred caused by partial unaligned overwrites to 
objects in EC pool. See my post "Root cause analysis for space 
overhead with erasure coded pools." to d...@ceph.io mailing list on 
Jan 23.



Migrating to 4K min alloc size seems to be the only known way to fix 
(or rather workaround) these issues. Upcoming Pacific release is 
gonna to bring downsizing to 4K (for new OSD deployments) along with 
some additional changes to smooth corresponding negative performance 
impacts.



Hope this helps.

Igor



On 9/9/2020 2:30 AM, norman kern wrote:

Hi,

I have changed most of pools from 3-replica to ec 4+2 in my 
cluster, when I use ceph df command to show


the used capactiy of the cluster:

RAW STORAGE:
    CLASS SIZE    AVAIL   USED    RAW USED 
%RAW USED
    hdd   1.8 PiB 788 TiB 1.0 PiB  1.0 
PiB 57.22
    ssd   7.9 TiB 4.6 TiB 181 GiB  3.2 
TiB 41.15
    ssd-cache 5.2 TiB 5.2 TiB  67 GiB   73 
GiB  1.36
    TOTAL 1.8 PiB 798 TiB 1.0 PiB  1.0 
PiB 56.99


POOLS:
    POOL    ID STORED OBJECTS 
USED    %USED MAX AVAIL
    default-oss.rgw.control 1 0 B 8 0 
B 0   1.3 TiB
    default-oss.rgw.meta    2  22 KiB 97 3.9 
MiB 0   1.3 TiB
    default-oss.rgw.log 3 525 KiB 223 621 
KiB 0   1.3 TiB
    default-oss.rgw.buckets.index   4  33 MiB 34 33 
MiB 0   1.3 TiB
    default-oss.rgw.buckets.non-ec  5 1.6 MiB 48 3.8 
MiB 0   1.3 TiB
    .rgw.root    6 3.8 KiB 16 720 
KiB 0   1.3 TiB
    default-oss.rgw.buckets.data    7 274 GiB 185.39k 
450 GiB  0.14   212 TiB
    default-fs-metadata 8 488 GiB 153.10M 
490 GiB 10.65   1.3 TiB
    default-fs-data0    9 374 TiB 1.48G 939 
TiB 74.71   212 TiB


   ...

The USED = 3 * STORED in 3-replica mode is completely right, but 
for EC 4+2 pool (for default-fs-data0 )


the USED is not equal 1.5 * STORED, why...:(


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-u