[ceph-users] Re: osd_pglog memory hoarding - another case

2020-12-14 Thread Kalle Happonen
Hi all,
Ok, so I have some updates on this.

We noticed that we had a bucket with tons of RGW garbage collection pending. It 
was growing faster than we could clean it up.

We suspect this was because users tried to do "s3cmd sync" operations on SWIFT 
uploaded large files. This could logically cause issues as s3 and SWIFT 
calculate md5sums differently on large objects. 

The following command showed the pending gc, and also shows which buckets are 
affected. 

radosgw-admin gc list |grep oid >garbagecollectionlist.txt

Our total RGW GC backlog was up to ~40 M.

We stopped the main s3sync workflow which was affecting the GC growth. Then we 
started running more aggressive radosgw garbage collection.

This really helped with the memory use. It dropped a lot, and for now *knock on 
wood* when the GC has been cleaned up, the memory has stayed at a more stable 
lower level.

So we hope we found the (or a) trigger for the problem.

Hopefully reveals another thread to pull for others debugging the same issue 
(and for us when we hit it again).

Cheers,
Kalle

- Original Message -
> From: "Dan van der Ster" 
> To: "Kalle Happonen" 
> Cc: "ceph-users" 
> Sent: Tuesday, 1 December, 2020 16:53:50
> Subject: Re: [ceph-users] Re: osd_pglog memory hoarding - another case

> Hi Kalle,
> 
> Thanks for the update. Unfortunately I haven't made any progress on
> understanding the root cause of this issue.
> (We are still tracking our mempools closely in grafana and in our case
> they are no longer exploding like in the incident.)
> 
> Cheers, Dan
> 
> On Tue, Dec 1, 2020 at 3:49 PM Kalle Happonen  wrote:
>>
>> Quick update, restarting OSDs is not enough for us to compact the db. So we
>> stop the osd
>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd compact
>> start the osd
>>
>> It seems to fix the spillover. Until it grows again.
>>
>> Cheers,
>> Kalle
>>
>> - Original Message -
>> > From: "Kalle Happonen" 
>> > To: "Dan van der Ster" 
>> > Cc: "ceph-users" 
>> > Sent: Tuesday, 1 December, 2020 15:09:37
>> > Subject: [ceph-users] Re: osd_pglog memory hoarding - another case
>>
>> > Hi All,
>> > back to this. Dan, it seems we're following exactly in your footsteps.
>> >
>> > We recovered from our large pg_log, and got the cluster running. A week 
>> > after
>> > our cluster was ok, we started seeing big memory increases again. I don't 
>> > know
>> > if we had buffer_anon issues before or if our big pg_logs were masking it. 
>> > But
>> > we started seeing bluefs spillover and buffer_anon growth.
>> >
>> > This led to whole other series of problems with OOM killing, which probably
>> > resulted in mon node db growth which filled the disk, which  resulted in 
>> > all
>> > mons going down, and a bigger mess of bringing everything up.
>> >
>> > However. We're back. But I think we can confirm the buffer_anon growth, and
>> > bluefs spillover.
>> >
>> > We now have a job that constatly writes 10k objects in a buckets and 
>> > deletes
>> > them.
>> >
>> > This may curb the memory growth, but I don't think it stops the problem. 
>> > We're
>> > just testing restarting OSDs and while it takes a while, it seems it may 
>> > help.
>> > Of course this is not the greatest fix in production.
>> >
>> > Has anybody gleaned any new information on this issue? Things to tweaks? 
>> > Fixes
>> > in the horizon? Other mitigations?
>> >
>> > Cheers,
>> > Kalle
>> >
>> >
>> > - Original Message -
>> >> From: "Kalle Happonen" 
>> >> To: "Dan van der Ster" 
>> >> Cc: "ceph-users" 
>> >> Sent: Thursday, 19 November, 2020 13:56:37
>> >> Subject: [ceph-users] Re: osd_pglog memory hoarding - another case
>> >
>> >> Hello,
>> >> I thought I'd post an update.
>> >>
>> >> Setting the pg_log size to 500, and running the offline trim operation
>> >> sequentially on all OSDs seems to help. With our current setup, it takes 
>> >> about
>> >> 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we 
>> >> have
>> >> are ~180-750, with a majority around 200, and some nodes consistently 
>> >> have 500
>> >> per OSD. The limiting factor of the recovery time seems to be our nvme, 
>> >> which
>> >> we use for rocksdb for the OSDs.
>> >>
>> >> We haven't fully recovered yet, we're working on it. Almost all our PGs 
>> >> are back
>> >> up, we still have ~40/18000 PGs down, but I think we'll get there. 
>> >> Currently
>> >> ~40 OSDs/1200 down.
>> >>
>> >> It seems like the previous mention of 32kB / pg_log entry seems in the 
>> >> correct
>> >> magnitude for us too. If we count 32kB * 200 pgs * 3000 log entries, we're
>> >> close to the 20 GB / OSD process.
>> >>
>> >> For the nodes that have been trimmed, we're hovering around 100 GB/node of
>> >> memory use, or ~4 GB per OSD, and so far seems stable, but we don't have 
>> >> longer
>> >> term data on that, and we don't know exactly how it behaves when load is
>> >> applied. However if we're currently at the pg_log limit of 500, adding 
>> >> load
>> >> sh

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread huxia...@horebdata.cn
Hello, Kalle,

Your comments abount some bugs with  pg_log memory and buffer_anon memory 
growth worry me a lot, as i am planning to build a cluster with the latest 
Nautilous version.

Could you please comment on, how to safely deal with these bugs or to avoid, if 
indeed they occur?

thanks a lot,

samuel



huxia...@horebdata.cn
 
From: Kalle Happonen
Date: 2020-12-14 08:28
To: Stefan Wild
CC: ceph-users
Subject: [ceph-users] Re: OSD reboot loop after running out of memory
Hi Stefan,
we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case 
we hit a some bugs with pg_log memory growth and buffer_anon memory growth. Can 
you check what's taking up the memory on the OSD with the following command?
 
ceph daemon osd.123 dump_mempools
 
 
Cheers,
Kalle
 
- Original Message -
> From: "Stefan Wild" 
> To: "Igor Fedotov" , "ceph-users" 
> Sent: Sunday, 13 December, 2020 14:46:44
> Subject: [ceph-users] Re: OSD reboot loop after running out of memory
 
> Hi Igor,
> 
> Full osd logs from startup to failed exit:
> https://tiltworks.com/osd.1.log
> 
> In other news, can I expect osd.10 to go down next?
> 
> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
> 2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:43.310905+
> front 2020-12-13T12:39:43.311164+ (oldest deadline
> 2020-12-13T12:40:06.810981+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
> 2020-12-13T12:40:15.055+ 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:42.972558+
> front 2020-12-13T12:39:42.972702+ (oldest deadline
> 2020-12-13T12:40:05.272435+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
> 2020-12-13T12:40:15.155+ 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:42.181904+
> front 2020-12-13T12:39:42.181856+ (oldest deadline
> 2020-12-13T12:40:06.281648+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.171+ 7fe929be8700  1 
> mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.2
> is reporting failure:0
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.171+ 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.2
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:15.176057+ mon.ceph-tpa-server1 (mon.0) 1172513 : cluster
> [DBG] osd.10 failure report canceled by osd.2
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
> 2020-12-13T12:40:15.295+ 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:43.326792+
> front 2020-12-13T12:39:43.32+ (oldest deadline
> 2020-12-13T12:40:07.426786+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.423+ 7fe929be8700  1 
> mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.6
> is reporting failure:0
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.423+ 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.6
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
> 2020-12-13T12:40:15.447+ 7f85048db700 -1 osd.3 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:39.770822+
> front 2020-12-13T12:39:39.770700+ (oldest deadline
> 2020-12-13T12:40:05.070662+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug
> 2020-12-13T12:40:15.687+ 7fa8e1800700 -1 osd.4 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:39.977106+
> front 2020-12-13T12:39:39.977176+ (oldest deadline
> 2020-12-13T12:40:04.677320+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug
> 2020-12-13T12:40:15.799+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:43.310905+
> front 2020-12-13T12:39:43.311164+ (oldest deadline
> 2020-12-13T12:40:06.810981+)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
> 2020-12-13T12:40:16.019+ 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:42.972558+
> front 2020-12-13T12:39:42.972702+ (oldest deadline
> 2020-12-13T12:40:05.272435+)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.179+ 7fe929be8700  1 
> mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.4
> i

[ceph-users] Removing an applied service set

2020-12-14 Thread Michael Wodniok
Hi,

we created multiple CephFS, this invloved deploying mutliple mds-services using 
`ceph orch apply mds [...]`. Worked like a charm.

Now the filesystem has been removed and the leftovers of the filesystem should 
also be removed, but I can't delete the services as cephadm/orchestration 
module is recreating them. What is the "official" way to delete this applied 
service set? Setting placement size to 0 is not possible in Ceph 15.

Kind regards,
Michael


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Kalle Happonen
Hi Samuel,
I think we're hitting some niche cases. Most of our experience (and links to 
other posts) is here.

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/EWPPEMPAJQT6GGYSHM7GIM3BZWS2PSUY/

For the pg_log issue, the default of 3000 might be too large for some 
installations, depending on your PG count. We have set it to 400.

For the buffer_anon problem, there some speculation that it started when 
buffer_anon trimming changed. I assume it'll be fixed in a new version, these 
two may be candidates for fixes.

https://github.com/ceph/ceph/pull/35171
https://github.com/ceph/ceph/pull/35584

Cheers,
Kalle

- Original Message -
> From: huxia...@horebdata.cn
> To: "Kalle Happonen" , "Stefan Wild" 
> 
> Cc: "ceph-users" 
> Sent: Monday, 14 December, 2020 10:27:57
> Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory

> Hello, Kalle,
> 
> Your comments abount some bugs with  pg_log memory and buffer_anon memory 
> growth
> worry me a lot, as i am planning to build a cluster with the latest Nautilous
> version.
> 
> Could you please comment on, how to safely deal with these bugs or to avoid, 
> if
> indeed they occur?
> 
> thanks a lot,
> 
> samuel
> 
> 
> 
> huxia...@horebdata.cn
> 
> From: Kalle Happonen
> Date: 2020-12-14 08:28
> To: Stefan Wild
> CC: ceph-users
> Subject: [ceph-users] Re: OSD reboot loop after running out of memory
> Hi Stefan,
> we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case 
> we
> hit a some bugs with pg_log memory growth and buffer_anon memory growth. Can
> you check what's taking up the memory on the OSD with the following command?
> 
> ceph daemon osd.123 dump_mempools
> 
> 
> Cheers,
> Kalle
> 
> - Original Message -
>> From: "Stefan Wild" 
>> To: "Igor Fedotov" , "ceph-users" 
>> Sent: Sunday, 13 December, 2020 14:46:44
>> Subject: [ceph-users] Re: OSD reboot loop after running out of memory
> 
>> Hi Igor,
>> 
>> Full osd logs from startup to failed exit:
>> https://tiltworks.com/osd.1.log
>> 
>> In other news, can I expect osd.10 to go down next?
>> 
>> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
>> 2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.310905+
>> front 2020-12-13T12:39:43.311164+ (oldest deadline
>> 2020-12-13T12:40:06.810981+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
>> 2020-12-13T12:40:15.055+ 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.972558+
>> front 2020-12-13T12:39:42.972702+ (oldest deadline
>> 2020-12-13T12:40:05.272435+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
>> 2020-12-13T12:40:15.155+ 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.181904+
>> front 2020-12-13T12:39:42.181856+ (oldest deadline
>> 2020-12-13T12:40:06.281648+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.171+ 7fe929be8700  1 
>> mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from 
>> osd.2
>> is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.171+ 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:15.176057+ mon.ceph-tpa-server1 (mon.0) 1172513 : 
>> cluster
>> [DBG] osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
>> 2020-12-13T12:40:15.295+ 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.326792+
>> front 2020-12-13T12:39:43.32+ (oldest deadline
>> 2020-12-13T12:40:07.426786+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.423+ 7fe929be8700  1 
>> mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from 
>> osd.6
>> is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.423+ 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.6
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
>> 2020-12-13T12:40:15.447+ 7f85048db700 -1 osd.3 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:39.770822+
>> front 2020-12-13T12:39:39.770700+ (oldest deadline
>> 2020-12-13T12:40:05.070662+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug
>> 2020-12-13T12:40:15.687+ 7fa8e1800700 -1 osd.4 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 s

[ceph-users] Re: Removing an applied service set

2020-12-14 Thread Eugen Block
Do you have a spec file for the mds services or how did you deploy the  
services? If you have a yml file with the mds placement just remove  
the entries from that file and run 'ceph orch apply -i mds.yml'.


You can export your current config with this command and then modify  
the file to your need:


cephadm:~ # ceph orch ls mds mds.cephfs --export yaml
service_type: mds
service_id: cephfs
service_name: mds.cephfs
placement:
  hosts:
  - host5
  - host6


Then if you apply the changes cephadm should not redeploy those  
deleted services. It's possible that you have to clean-up `ceph auth  
ls` and remove deleted mds keyrings. Also check `cephadm ls` on the  
mds nodes if the containers have been removed.


Regards,
Eugen


Zitat von Michael Wodniok :


Hi,

we created multiple CephFS, this invloved deploying mutliple  
mds-services using `ceph orch apply mds [...]`. Worked like a charm.


Now the filesystem has been removed and the leftovers of the  
filesystem should also be removed, but I can't delete the services  
as cephadm/orchestration module is recreating them. What is the  
"official" way to delete this applied service set? Setting placement  
size to 0 is not possible in Ceph 15.


Kind regards,
Michael



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PGs down

2020-12-14 Thread Igor Fedotov

Hi Jeremy,

I think you lost the data for OSD.11 & .12  I'm not aware of any 
reliable enough way to recover RocksDB from this sort of errors.


Theoretically you might want to disable auto compaction for RocksDB for 
these daemons and try to bring then up and attempt to drain the data out 
of them to different OSDs then. As currently the log you shared shows 
error during compaction there is some chance that during regular 
operation OSD wouldn't need broken data (at least for some time). In 
fact I've never heard someone tried this approach so this would be a 
pretty cutting edge investigation...


Honestly the chance of 100% success is pretty low but some additional 
data might be saved.



Back to DB corruption root causes itself.

As it looks like we have some data consistency issues with RocksDB in 
the latest Octopus and Nautilus releases I'm currently trying to collect 
the stats for the known cases. Hence I'd highly appreciate if you answer 
the following questions


1) Have I got this properly that hardware issue happened to the same 
node where OSD.11 & .12 are located? Or they are at a different one but 
crashed after hardware failure happened to that node and were unable to 
start since then?


2) If they're at the same node - do they have standalone DB/WAL volumes? 
If so have you checked them for hardware failures as well?


3) Not sure if it makes sense but just in case - have you checked dmesg 
output for any disk errors as well?


4) Haven't you performed Ceph upgrade recently. Or more generally - was 
the cluster deployed with the current Ceph version or it was an earlier one?



Thanks,

Igor


On 12/14/2020 5:05 AM, Jeremy Austin wrote:
OSD 12 looks much the same.I don't have logs back to the original 
date, but this looks very similar — db/sst corruption. The standard 
fsck approaches couldn't fix it. I believe it was a form of ATA 
failure — OSD 11 and 12, if I recall correctly, did not actually 
experience SMARTD-reportable errors. (Essentially, fans died on an 
internal SATA enclosure. As the enclosure had no sensor mechanism, I 
didn't realize it until drive temps started to climb. I believe most 
of the drives survived OK, but the enclosure itself I ultimately had 
to completely bypass, even after replacing fans.)


My assumption, once ceph fsck approaches failed, was that I'd need to 
mark 11 and 12 (and maybe 4) as lost, but I was reluctant to do so 
until I confirmed that I had absolutely lost data beyond recall.


On Sat, Dec 12, 2020 at 10:24 PM Igor Fedotov > wrote:


Hi Jeremy,

wondering what were the OSDs' logs when they crashed for the first
time?

And does OSD.12 reports the similar problem for now:

3> 2020-12-12 20:23:45.756 7f2d21404700 -1 rocksdb: submit_common
error: Corruption: block checksum mismatch: expected 3113305400,
got 1242690251 in db/000348.sst offset 47935290 size 4704 code = 2
Rocksdb transaction:

?

Thanks,
Igor
On 12/13/2020 8:48 AM, Jeremy Austin wrote:

I could use some input from more experienced folks…

First time seeing this behavior. I've been running ceph in production
(replicated) since 2016 or earlier.

This, however, is a small 3-node cluster for testing EC. Crush map rules
should sustain the loss of an entire node.
Here's the EC rule:

rule cephfs425 { id 6 type erasure min_size 3 max_size 6 step
set_chooseleaf_tries 40 step set_choose_tries 400 step take default step
choose indep 3 type host step choose indep 2 type osd step emit }


I had actual hardware failure on one node. Interestingly, this appears to
have resulted in data loss. OSDs began to crash in a cascade on other nodes
(i.e., nodes with no known hardware failure). Not a low RAM problem.

I could use some pointers about how to get the down PGs back up — I *think*
there are enough EC shards, even disregarding the OSDs that crash on start.

nautilus 14.2.15

  ceph osd tree
ID  CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
  -1   54.75960 root default
-10   16.81067 host sumia
   1   hdd  5.57719 osd.1   up  1.0 1.0
   5   hdd  5.58469 osd.5   up  1.0 1.0
   6   hdd  5.64879 osd.6   up  1.0 1.0
  -7   16.73048 host sumib
   0   hdd  5.57899 osd.0   up  1.0 1.0
   2   hdd  5.56549 osd.2   up  1.0 1.0
   3   hdd  5.58600 osd.3   up  1.0 1.0
  -3   21.21844 host tower1
   4   hdd  3.71680 osd.4   up0 1.0
   7   hdd  1.84799 osd.7   up  1.0 1.0
   8   hdd  3.71680 osd.8   up  1.0 1.0
   9   hdd  1.84929 osd.9   up  1.0 1.0
  10   hdd  2.72899 osd.10  up  1.0 1.0
  11   hdd  3.71989 osd.11down0 1.0
  

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Igor Fedotov

Hi Stefan,

given the crash backtrace in your log I presume some data removal is in 
progress:


Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3: 
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long, 
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4: 
(KernelDevice::read_random(unsigned long, unsigned long, char*, 
bool)+0x1b3) [0x5587b93653e3]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5: 
(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, 
char*)+0x674) [0x5587b9328cb4]

...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19: 
(BlueStore::_do_omap_clear(BlueStore::TransContext*, 
boost::intrusive_ptr&)+0xa2) [0x5587b922f0e2]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20: 
(BlueStore::_do_remove(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr)+0xc65) [0x5587b923b555]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21: 
(BlueStore::_remove(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&)+0x64) [0x5587b923c3b4]

...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24: 
(ObjectStore::queue_transaction(boost::intrusive_ptr&, 
ceph::os::Transaction&&, boost::intrusive_ptr, 
ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25: 
(PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26: 
(PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e) 
[0x5587b8fd6ede]

...

Did you initiate some large pool removal recently? Or may be data 
rebalancing triggered PG migration (and hence source PG removal) for you?


Highly likely you're facing a well known issue with RocksDB/BlueFS 
performance issues caused by massive data removal.


So your OSDs are just processing I/O very slowly which triggers suicide 
timeout.


We've had multiple threads on the issue in this mailing list - the 
latest one is at 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72ZVQ6PD5NABEEYRDMX7OZTT/


For now the good enough workaround is manual offline DB compaction for 
all the OSDs (this might have temporary effect though as the removal 
proceeds).


Additionally there are users' reports that recent default value's 
modification  for bluefs_buffered_io setting has negative impact (or 
just worsen existing issue with massive removal) as well. So you might 
want to switch it back to true.


As for OSD.10 - can't say for sure as I haven't seen its' logs but I 
think it's experiencing the same issue which might eventually lead it 
into unresponsive state as well. Just grep its log for "heartbeat_map 
is_healthy 'OSD::osd_op_tp thread" strings.



Thanks,

Igor

On 12/13/2020 3:46 PM, Stefan Wild wrote:

Hi Igor,

Full osd logs from startup to failed exit:
https://tiltworks.com/osd.1.log

In other news, can I expect osd.10 to go down next?

Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug 
2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no 
reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+ 
front 2020-12-13T12:39:43.311164+ (oldest deadline 
2020-12-13T12:40:06.810981+)
Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug 
2020-12-13T12:40:15.055+ 7f9220af3700 -1 osd.11 13375 heartbeat_check: no 
reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+ 
front 2020-12-13T12:39:42.972702+ (oldest deadline 
2020-12-13T12:40:05.272435+)
Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug 
2020-12-13T12:40:15.155+ 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no 
reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+ 
front 2020-12-13T12:39:42.181856+ (oldest deadline 
2020-12-13T12:40:06.281648+)
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.171+ 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd 
e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.2 
is reporting failure:0
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.171+ 7fe929be8700  0 log_channel(cluster) log [DBG] : 
osd.10 failure report canceled by osd.2
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:15.176057+ mon.ceph-tpa-server1 (mon.0) 1172513 : cluster 
[DBG] osd.10 failure report canceled by osd.2
Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug 
2020-12-13T12:40:15.295+ 7fa60679a700 -1 osd.0 13375 heartbeat_check: no 
reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+ 
front 2020-12-13T12:39:43.32+ (oldest deadline 
2020-12-13T12:40:07.426786+)
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.423+ 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd 
e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from 

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
Hi Kalle,

Memory usage is back on track for the OSDs since the OOM crash. I don’t know 
what caused it back then, but until all OSDs were back up together, each one of 
them (10 TiB capacity, 7 TiB used) ballooned to over 15 GB memory used. I’m 
happy to dump the stats if they’re showing any history from 2 weeks ago, but 
ballooning and running out of memory is not the issue anymore.

Thanks,
Stefan


From: Kalle Happonen 
Sent: Monday, December 14, 2020 5:00:17 AM
To: huxia...@horebdata.cn 
Cc: Stefan Wild ; ceph-users 
Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory

Hi Samuel,
I think we're hitting some niche cases. Most of our experience (and links to 
other posts) is here.

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/EWPPEMPAJQT6GGYSHM7GIM3BZWS2PSUY/

For the pg_log issue, the default of 3000 might be too large for some 
installations, depending on your PG count. We have set it to 400.

For the buffer_anon problem, there some speculation that it started when 
buffer_anon trimming changed. I assume it'll be fixed in a new version, these 
two may be candidates for fixes.

https://github.com/ceph/ceph/pull/35171
https://github.com/ceph/ceph/pull/35584

Cheers,
Kalle

- Original Message -
> From: huxia...@horebdata.cn
> To: "Kalle Happonen" , "Stefan Wild" 
> 
> Cc: "ceph-users" 
> Sent: Monday, 14 December, 2020 10:27:57
> Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory

> Hello, Kalle,
>
> Your comments abount some bugs with  pg_log memory and buffer_anon memory 
> growth
> worry me a lot, as i am planning to build a cluster with the latest Nautilous
> version.
>
> Could you please comment on, how to safely deal with these bugs or to avoid, 
> if
> indeed they occur?
>
> thanks a lot,
>
> samuel
>
>
>
> huxia...@horebdata.cn
>
> From: Kalle Happonen
> Date: 2020-12-14 08:28
> To: Stefan Wild
> CC: ceph-users
> Subject: [ceph-users] Re: OSD reboot loop after running out of memory
> Hi Stefan,
> we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case 
> we
> hit a some bugs with pg_log memory growth and buffer_anon memory growth. Can
> you check what's taking up the memory on the OSD with the following command?
>
> ceph daemon osd.123 dump_mempools
>
>
> Cheers,
> Kalle
>
> - Original Message -
>> From: "Stefan Wild" 
>> To: "Igor Fedotov" , "ceph-users" 
>> Sent: Sunday, 13 December, 2020 14:46:44
>> Subject: [ceph-users] Re: OSD reboot loop after running out of memory
>
>> Hi Igor,
>>
>> Full osd logs from startup to failed exit:
>> https://tiltworks.com/osd.1.log
>>
>> In other news, can I expect osd.10 to go down next?
>>
>> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
>> 2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.310905+
>> front 2020-12-13T12:39:43.311164+ (oldest deadline
>> 2020-12-13T12:40:06.810981+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
>> 2020-12-13T12:40:15.055+ 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.972558+
>> front 2020-12-13T12:39:42.972702+ (oldest deadline
>> 2020-12-13T12:40:05.272435+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
>> 2020-12-13T12:40:15.155+ 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:42.181904+
>> front 2020-12-13T12:39:42.181856+ (oldest deadline
>> 2020-12-13T12:40:06.281648+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.171+ 7fe929be8700  1 
>> mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from 
>> osd.2
>> is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.171+ 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:15.176057+ mon.ceph-tpa-server1 (mon.0) 1172513 : 
>> cluster
>> [DBG] osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
>> 2020-12-13T12:40:15.295+ 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 
>> 2020-12-13T12:39:43.326792+
>> front 2020-12-13T12:39:43.32+ (oldest deadline
>> 2020-12-13T12:40:07.426786+)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.423+ 7fe929be8700  1 
>> mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from 
>> osd.6
>> is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Igor Fedotov
Just a note - all the below is almost completely unrelated to high RAM 
usage. The latter is a different issue which presumably just triggered 
PG removal one...



On 12/14/2020 2:39 PM, Igor Fedotov wrote:

Hi Stefan,

given the crash backtrace in your log I presume some data removal is 
in progress:


Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3: 
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long, 
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4: 
(KernelDevice::read_random(unsigned long, unsigned long, char*, 
bool)+0x1b3) [0x5587b93653e3]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5: 
(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
long, char*)+0x674) [0x5587b9328cb4]

...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19: 
(BlueStore::_do_omap_clear(BlueStore::TransContext*, 
boost::intrusive_ptr&)+0xa2) [0x5587b922f0e2]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20: 
(BlueStore::_do_remove(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr)+0xc65) [0x5587b923b555]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21: 
(BlueStore::_remove(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&)+0x64) [0x5587b923c3b4]

...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24: 
(ObjectStore::queue_transaction(boost::intrusive_ptr&, 
ceph::os::Transaction&&, boost::intrusive_ptr, 
ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25: 
(PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26: 
(PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e) 
[0x5587b8fd6ede]

...

Did you initiate some large pool removal recently? Or may be data 
rebalancing triggered PG migration (and hence source PG removal) for you?


Highly likely you're facing a well known issue with RocksDB/BlueFS 
performance issues caused by massive data removal.


So your OSDs are just processing I/O very slowly which triggers 
suicide timeout.


We've had multiple threads on the issue in this mailing list - the 
latest one is at 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72ZVQ6PD5NABEEYRDMX7OZTT/


For now the good enough workaround is manual offline DB compaction for 
all the OSDs (this might have temporary effect though as the removal 
proceeds).


Additionally there are users' reports that recent default value's 
modification  for bluefs_buffered_io setting has negative impact (or 
just worsen existing issue with massive removal) as well. So you might 
want to switch it back to true.


As for OSD.10 - can't say for sure as I haven't seen its' logs but I 
think it's experiencing the same issue which might eventually lead it 
into unresponsive state as well. Just grep its log for "heartbeat_map 
is_healthy 'OSD::osd_op_tp thread" strings.



Thanks,

Igor

On 12/13/2020 3:46 PM, Stefan Wild wrote:

Hi Igor,

Full osd logs from startup to failed exit:
https://tiltworks.com/osd.1.log

In other news, can I expect osd.10 to go down next?

Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug 
2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:43.310905+ front 2020-12-13T12:39:43.311164+ 
(oldest deadline 2020-12-13T12:40:06.810981+)
Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug 
2020-12-13T12:40:15.055+ 7f9220af3700 -1 osd.11 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:42.972558+ front 2020-12-13T12:39:42.972702+ 
(oldest deadline 2020-12-13T12:40:05.272435+)
Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug 
2020-12-13T12:40:15.155+ 7fb247eaf700 -1 osd.8 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:42.181904+ front 2020-12-13T12:39:42.181856+ 
(oldest deadline 2020-12-13T12:40:06.281648+)
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.171+ 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.2 is reporting failure:0
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.171+ 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.2
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:15.176057+ mon.ceph-tpa-server1 (mon.0) 1172513 
: cluster [DBG] osd.10 failure report canceled by osd.2
Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug 
2020-12-13T12:40:15.295+ 7fa60679a700 -1 osd.0 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:43.326792+ front 2020-12-13T12:39:43.32+ 
(oldest deadline 2020-12-13T12:40:07.426786+)
Dec 13 07:40:15 ceph-tpa-server1

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
Hi Igor,

Thank you for the detailed analysis. That makes me hopeful we can get the 
cluster back on track. No pools have been removed, but yes, due to the initial 
crash of multiple OSDs and the subsequent issues with individual OSDs we’ve had 
substantial PG remappings happening constantly.

I will look up the referenced thread(s) and try the offline DB compaction. It 
would be amazing if that does the trick.

Will keep you posted, here.

Thanks,
Stefan


From: Igor Fedotov 
Sent: Monday, December 14, 2020 6:39:28 AM
To: Stefan Wild ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory

Hi Stefan,

given the crash backtrace in your log I presume some data removal is in
progress:

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3:
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4:
(KernelDevice::read_random(unsigned long, unsigned long, char*,
bool)+0x1b3) [0x5587b93653e3]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5:
(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long,
char*)+0x674) [0x5587b9328cb4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19:
(BlueStore::_do_omap_clear(BlueStore::TransContext*,
boost::intrusive_ptr&)+0xa2) [0x5587b922f0e2]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20:
(BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr)+0xc65) [0x5587b923b555]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21:
(BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr&)+0x64) [0x5587b923c3b4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24:
(ObjectStore::queue_transaction(boost::intrusive_ptr&,
ceph::os::Transaction&&, boost::intrusive_ptr,
ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25:
(PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26:
(PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e)
[0x5587b8fd6ede]
...

Did you initiate some large pool removal recently? Or may be data
rebalancing triggered PG migration (and hence source PG removal) for you?

Highly likely you're facing a well known issue with RocksDB/BlueFS
performance issues caused by massive data removal.

So your OSDs are just processing I/O very slowly which triggers suicide
timeout.

We've had multiple threads on the issue in this mailing list - the
latest one is at
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72ZVQ6PD5NABEEYRDMX7OZTT/

For now the good enough workaround is manual offline DB compaction for
all the OSDs (this might have temporary effect though as the removal
proceeds).

Additionally there are users' reports that recent default value's
modification  for bluefs_buffered_io setting has negative impact (or
just worsen existing issue with massive removal) as well. So you might
want to switch it back to true.

As for OSD.10 - can't say for sure as I haven't seen its' logs but I
think it's experiencing the same issue which might eventually lead it
into unresponsive state as well. Just grep its log for "heartbeat_map
is_healthy 'OSD::osd_op_tp thread" strings.


Thanks,

Igor

On 12/13/2020 3:46 PM, Stefan Wild wrote:
> Hi Igor,
>
> Full osd logs from startup to failed exit:
> https://tiltworks.com/osd.1.log
>
> In other news, can I expect osd.10 to go down next?
>
> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug 
> 2020-12-13T12:40:14.823+ 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no 
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:43.310905+ front 2020-12-13T12:39:43.311164+ (oldest 
> deadline 2020-12-13T12:40:06.810981+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug 
> 2020-12-13T12:40:15.055+ 7f9220af3700 -1 osd.11 13375 heartbeat_check: no 
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:42.972558+ front 2020-12-13T12:39:42.972702+ (oldest 
> deadline 2020-12-13T12:40:05.272435+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug 
> 2020-12-13T12:40:15.155+ 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no 
> reply from 172.18.189.20:6878 osd.10 since back 
> 2020-12-13T12:39:42.181904+ front 2020-12-13T12:39:42.181856+ (oldest 
> deadline 2020-12-13T12:40:06.281648+)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
> 2020-12-13T12:40:15.171+ 7fe929be8700  1 
> mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from 
> osd.2 is reporting failure:0
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
> 2020-12-13T12:40:15.171+ 7fe929be8700  0 log_channel(cluster) log [DBG] : 
> osd.10 failure report canceled by osd.2
> Dec 13 07:40:15 ceph-tpa-server1 bash[182249

[ceph-users] Re: Removing an applied service set

2020-12-14 Thread Michael Wodniok
Thank you Eugen, it worked.

For the record this is what I have done to remove the services completely. My 
CephFS had the name "testfs".

* `ceph orch ls mds mds.testfs --export yaml >change.yaml`
* removed the placement-spec from `change.yaml`.
* reapplied using `cephadm shell -m change.yaml -- ceph orch apply -I 
/mnt/change.yaml`
* removed daemons with `ceph orch daemon rm mds.testfs.` for each 
daemon listed in `ceph orch ps | grep mds.testfs`
* removed mds-set with `ceph orch rm mds.testfs`
* remove auth using `ceph auth rm mds.testfs.` for each daemon 
listed in `ceph auth ls | grep mds.testfs`

Regards,
Michael

-Ursprüngliche Nachricht-
Von: Eugen Block [mailto:ebl...@nde.ag] 
Gesendet: Montag, 14. Dezember 2020 11:06
An: ceph-users@ceph.io
Betreff: [ceph-users] Re: Removing an applied service set

Do you have a spec file for the mds services or how did you deploy the  
services? If you have a yml file with the mds placement just remove  
the entries from that file and run 'ceph orch apply -i mds.yml'.

You can export your current config with this command and then modify  
the file to your need:

cephadm:~ # ceph orch ls mds mds.cephfs --export yaml
service_type: mds
service_id: cephfs
service_name: mds.cephfs
placement:
   hosts:
   - host5
   - host6


Then if you apply the changes cephadm should not redeploy those  
deleted services. It's possible that you have to clean-up `ceph auth  
ls` and remove deleted mds keyrings. Also check `cephadm ls` on the  
mds nodes if the containers have been removed.

Regards,
Eugen


Zitat von Michael Wodniok :

> Hi,
>
> we created multiple CephFS, this invloved deploying mutliple  
> mds-services using `ceph orch apply mds [...]`. Worked like a charm.
>
> Now the filesystem has been removed and the leftovers of the  
> filesystem should also be removed, but I can't delete the services  
> as cephadm/orchestration module is recreating them. What is the  
> "official" way to delete this applied service set? Setting placement  
> size to 0 is not possible in Ceph 15.
>
> Kind regards,
> Michael


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] The ceph balancer sets upmap items which violates my crushrule

2020-12-14 Thread Manuel Lausch
The ceph balancer sets upmap items which violates my crushrule

the rule:

rule cslivebapfirst {
id 0
type replicated
min_size 2
max_size 4
step take csliveeubap-u01dc
step chooseleaf firstn 2 type room
step emit
step take csliveeubs-u01dc
step chooseleaf firstn 2 type room
step emit
}

So my intention is, that the first two replicas are stored in the
datacenter „csliveeubap-u01dc“ and the next two replicas are stored in
the datacenter „csliveeubs-u01dc“

The cluster has 49152 PGs and 665 of them has at least 3 replicas in
one datacenter which is not expected!

One example on PG 3.96e
The acting OSDs are in this order:
504 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r03
1968 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r01
420 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r02
1945 -> DC: csliveeubs-u01dc, room: csliveeubs-u01r01

This PG has one upmap item: 
ceph osd dump | grep
3.96e pg_upmap_items 3.96e [2013,420]

OSD 2013 is in the DC: csliveeubs-u01dc 

I checked this by hand with ceph osd pg-upmap-item
If I try to set two relicas in one room I will get a appropriate error
in the mon log and nothing happens. But setting it to a other dc worked
unfortunately.


I would suggest this is a ugly bug. What do you think?

ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf)
nautilus (stable) 


Manuel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] iscsi and iser

2020-12-14 Thread Marc Boisis


Hi,

I would like to know if you support iser in gwcli like the traditional 
targetcli or if this is planned in a future version of ceph ?

Thanks

Marc
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 15.2.4 segfault, msgr-worker

2020-12-14 Thread alexandre derumier

Hi,

I had an osd crash yesterday, with 15.2.7.

seem similar:

ceph crash info 
2020-12-13T02:37:57.475315Z_63f91999-ca9c-49a5-b381-5fad9780dbbb

{
    "backtrace": [
    "(()+0x12730) [0x7f6bccbb5730]",
"(std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >, 
std::less >, 
std::allocator > 
>::find(boost::intrusive_ptr const&) const+0x24) 
[0x559442799394]",

    "(AsyncConnection::_stop()+0xa7) [0x5594427939d7]",
    "(ProtocolV2::stop()+0x8b) [0x5594427bb41b]",
    "(ProtocolV2::_fault()+0x6b) [0x5594427bb59b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptrceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) 
[0x5594427d15e8]",

"(ProtocolV2::run_continuation(Ct&)+0x34) [0x5594427bc114]",
    "(AsyncConnection::process()+0x79c) [0x55944279682c]",
    "(EventCenter::process_events(unsigned int, 
std::chrono::duration 
>*)+0xa2d) [0x5594425fa91d]",

    "(()+0x11f41cb) [0x5594426001cb]",
    "(()+0xbbb2f) [0x7f6bcca7ab2f]",
    "(()+0x7fa3) [0x7f6bccbaafa3]",
    "(clone()+0x3f) [0x7f6bcc7584cf]"
    ],
    "ceph_version": "15.2.7",
    "crash_id": 
"2020-12-13T02:37:57.475315Z_63f91999-ca9c-49a5-b381-5fad9780dbbb",

    "entity_name": "osd.57",
    "os_id": "10",
    "os_name": "Debian GNU/Linux 10 (buster)",
    "os_version": "10 (buster)",
    "os_version_id": "10",
    "process_name": "ceph-osd",
    "stack_sig": 
"897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",

    "timestamp": "2020-12-13T02:37:57.475315Z",
    "utsname_hostname": "ceph5-9",
    "utsname_machine": "x86_64",
    "utsname_release": "4.19.0-11-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 4.19.146-1 (2020-09-17)"
}

On 02/12/2020 20:43, Ivan Kurnosov wrote:

Hi Team,

this night I have caught the following segfault.

Nothing else looks suspicious (but I'm a quite newbie in ceph management,
so perhaps just don't know where to look at).

I could not google any similar segfault from anybody else.

Was it a known problem fixed in later versions?

The cluster has been running for quite a while now (several months) and
this has happened for the first time.

```
debug  0> 2020-12-02T10:31:09.295+ 7f3e61943700 -1 *** Caught
signal (Segmentation fault) **
  in thread 7f3e61943700 thread_name:msgr-worker-1

  ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus
(stable)
  1: (()+0x12dd0) [0x7f3e65933dd0]
  2: (std::_Rb_tree,
boost::intrusive_ptr,
std::_Identity >,
std::less >,
std::allocator >

::find(boost::intrusive_ptr const&) const+0x2c)

[0x55cee407ca7c]
  3: (AsyncConnection::_stop()+0xab) [0x55cee40767eb]
  4: (ProtocolV2::stop()+0x8f) [0x55cee40a189f]
  5: (ProtocolV2::_fault()+0x133) [0x55cee40a1b03]
  6:
(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr&&, int)+0x551) [0x55cee40a63d1]
  7: (ProtocolV2::run_continuation(Ct&)+0x3c) [0x55cee40a273c]
  8: (AsyncConnection::process()+0x8a9) [0x55cee4079ab9]
  9: (EventCenter::process_events(unsigned int,
std::chrono::duration >*)+0xcb7)
[0x55cee3eceb67]
  10: (()+0xdb914c) [0x55cee3ed414c]
  11: (()+0xc2b73) [0x7f3e64f83b73]
  12: (()+0x82de) [0x7f3e659292de]
  13: (clone()+0x43) [0x7f3e64660e83]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_rwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
   -2/-2 (syslog threshold)
   99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
   7f3e3d88b700 / osd_srv_heartbt
   7f3e41893700 / tp_osd_tp
   7f3e4589b700 / tp_osd_tp
   7f3e4f8af700 / rocksdb:dump_st
   7f3e52a8c700 / safe_timer
   7f3e53a8e700 / ms_dispatch
   7f3e566b7700 / bstore_mempool
   7f3e5d0ca700 / safe_timer
   7f3e61943700 / msgr-worker-1
   7f3e62144700 / msgr-worker-0
   max_recent 1
   max_new 1000
   log_file
/var/lib/ceph/crash/2020-12-02T10:31:09.301492Z_84e8430f-30fd-469f-8e22-c2e1ccc675da/log
--- end dump of recent events ---
reraise_fatal:

[ceph-users] Re: iscsi and iser

2020-12-14 Thread Jason Dillaman
On Mon, Dec 14, 2020 at 9:39 AM Marc Boisis  wrote:
>
>
> Hi,
>
> I would like to know if you support iser in gwcli like the traditional 
> targetcli or if this is planned in a future version of ceph ?

We don't have the (HW) resources to test with iSER so it's not
something that anyone is looking at as far as I know. Plus, even if
you RDMAed the data between the initiator and target, it wouldn't RDMA
the data over to the OSD. Instead, it would be copied to and from
LIO's UIO mmap region between kernel and userspace and then re-sent
via librbd + librados. I'm also pretty confident that the current
librados RDMA support involves yet another buffer copy as well.

> Thanks
>
> Marc
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow Replication on Campus

2020-12-14 Thread Eugen Block

Hi,

could you share more information about your setup? How much bandwidth  
does the uplink have? Are there any custom configs regarding  
rbd_journal or rbd_mirror settings? If there were lots of changes on  
those images the sync would always be behind per design. But if  
there's no activity it should eventually catch up, I assume.


You can review the settings from this output:

ceph config show-with-defaults mgr. | grep -E "rbd_mirror|rbd_journal"


I assume there aren't many journal entries in the pool?

rados -p  ls | grep journal

Although I'd expect a different status maybe the sync was interrupted  
and a resync should be initiated? Or have you already tried that?


Regards,
Eugen


Zitat von Vikas Rana :


Hi Friends,



We have 2 Ceph clusters on campus and we setup the second cluster as the DR
solution.

The images on the DR side are always behind the master.



Ceph Version : 12.2.11





VMWARE_LUN0:

  global_id:   23460954-6986-4961-9579-0f2a1e58e2b2

  state:   up+replaying

  description: replaying, master_position=[object_number=2632711,
tag_tid=24, entry_tid=1967382595], mirror_position=[object_number=1452837,
tag_tid=24, entry_tid=456440697], entries_behind_master=1510941898

  last_update: 2020-11-30 14:13:38



VMWARE_LUN1:

  global_id:   cb579579-13b0-4522-b65f-c64ec44cbfaf

  state:   up+replaying

  description: replaying, master_position=[object_number=1883943,
tag_tid=28, entry_tid=1028822927], mirror_position=[object_number=1359161,
tag_tid=28, entry_tid=358296085], entries_behind_master=670526842

  last_update: 2020-11-30 14:13:33



Any suggestion on tuning or any parameters we can set on RBD-mirror to speed
up the replication. Both cluster have very little activity.





Appreciate your help.



Thanks,

-Vikas

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] performance degredation every 30 seconds

2020-12-14 Thread Philip Brown


I have a new 3 node octopus cluster, set up on SSDs.

I'm running fio to benchmark the setup, with

fio --filename=/dev/rbd0 --direct=1 --rw=randrw --bs=4k --ioengine=libaio 
--iodepth=256 --numjobs=1 --time_based --group_reporting --name=iops-test-job 
--runtime=120 --eta-newline=1



However, I notice that, approximately every 30 seconds, performance tanks for a 
bit.

Any ideas on why, and better yet, how to get rid of the problem?


Sample debug output below. Notice the transitions at [eta 01m:27s] and [eta 
00m:49s]
It happens again at [00m:09], but figured I didnt need to redundantly post that.



Jobs: 1 (f=1): [m(1)][2.5%][r=43.4MiB/s,w=43.3MiB/s][r=11.1k,w=11.1k IOPS][eta 
01m:58s]
Jobs: 1 (f=1): [m(1)][4.1%][r=47.3MiB/s,w=47.8MiB/s][r=12.1k,w=12.2k IOPS][eta 
01m:56s]
Jobs: 1 (f=1): [m(1)][5.8%][r=48.6MiB/s,w=49.3MiB/s][r=12.5k,w=12.6k IOPS][eta 
01m:54s]
Jobs: 1 (f=1): [m(1)][7.4%][r=52.4MiB/s,w=53.1MiB/s][r=13.4k,w=13.6k IOPS][eta 
01m:52s]
Jobs: 1 (f=1): [m(1)][9.1%][r=54.7MiB/s,w=54.1MiB/s][r=13.0k,w=13.8k IOPS][eta 
01m:50s]
Jobs: 1 (f=1): [m(1)][10.7%][r=41.5MiB/s,w=42.6MiB/s][r=10.6k,w=10.9k IOPS][eta 
01m:48s]
Jobs: 1 (f=1): [m(1)][12.4%][r=51.5MiB/s,w=50.6MiB/s][r=13.2k,w=12.0k IOPS][eta 
01m:46s]
Jobs: 1 (f=1): [m(1)][14.0%][r=16.6MiB/s,w=16.0MiB/s][r=4248,w=4098 IOPS][eta 
01m:44s]
Jobs: 1 (f=1): [m(1)][14.9%][r=33.3MiB/s,w=33.5MiB/s][r=8526,w=8579 IOPS][eta 
01m:43s]
Jobs: 1 (f=1): [m(1)][16.5%][r=47.1MiB/s,w=47.4MiB/s][r=12.1k,w=12.1k IOPS][eta 
01m:41s]
Jobs: 1 (f=1): [m(1)][18.2%][r=49.6MiB/s,w=49.0MiB/s][r=12.7k,w=12.8k IOPS][eta 
01m:39s]
Jobs: 1 (f=1): [m(1)][19.8%][r=50.3MiB/s,w=51.4MiB/s][r=12.9k,w=13.1k IOPS][eta 
01m:37s]
Jobs: 1 (f=1): [m(1)][21.5%][r=53.5MiB/s,w=52.9MiB/s][r=13.7k,w=13.5k IOPS][eta 
01m:35s]
Jobs: 1 (f=1): [m(1)][23.1%][r=52.7MiB/s,w=52.1MiB/s][r=13.5k,w=13.3k IOPS][eta 
01m:33s]
Jobs: 1 (f=1): [m(1)][24.8%][r=55.3MiB/s,w=54.9MiB/s][r=14.1k,w=14.1k IOPS][eta 
01m:31s]
Jobs: 1 (f=1): [m(1)][26.4%][r=44.0MiB/s,w=45.2MiB/s][r=11.5k,w=11.6k IOPS][eta 
01m:29s]
Jobs: 1 (f=1): [m(1)][28.1%][r=12.1MiB/s,w=11.8MiB/s][r=3105,w=3011 IOPS][eta 
01m:27s]
Jobs: 1 (f=1): [m(1)][29.8%][r=16.6MiB/s,w=17.3MiB/s][r=4238,w=4422 IOPS][eta 
01m:25s]
Jobs: 1 (f=1): [m(1)][31.4%][r=9820KiB/s,w=9516KiB/s][r=2455,w=2379 IOPS][eta 
01m:23s]
Jobs: 1 (f=1): [m(1)][33.1%][r=6974KiB/s,w=7099KiB/s][r=1743,w=1774 IOPS][eta 
01m:21s]
Jobs: 1 (f=1): [m(1)][34.7%][r=49.5MiB/s,w=49.2MiB/s][r=12.7k,w=12.6k IOPS][eta 
01m:19s]
Jobs: 1 (f=1): [m(1)][36.4%][r=49.3MiB/s,w=49.8MiB/s][r=12.6k,w=12.8k IOPS][eta 
01m:17s]
Jobs: 1 (f=1): [m(1)][38.0%][r=36.4MiB/s,w=35.9MiB/s][r=9326,w=9200 IOPS][eta 
01m:15s]
Jobs: 1 (f=1): [m(1)][39.7%][r=43.4MiB/s,w=43.3MiB/s][r=11.1k,w=11.1k IOPS][eta 
01m:13s]
Jobs: 1 (f=1): [m(1)][41.3%][r=47.1MiB/s,w=47.1MiB/s][r=12.1k,w=12.1k IOPS][eta 
01m:11s]
Jobs: 1 (f=1): [m(1)][43.0%][r=47.9MiB/s,w=48.0MiB/s][r=12.3k,w=12.5k IOPS][eta 
01m:09s]
Jobs: 1 (f=1): [m(1)][44.6%][r=49.9MiB/s,w=48.8MiB/s][r=12.8k,w=12.5k IOPS][eta 
01m:07s]
Jobs: 1 (f=1): [m(1)][46.3%][r=46.4MiB/s,w=46.9MiB/s][r=11.9k,w=11.0k IOPS][eta 
01m:05s]
Jobs: 1 (f=1): [m(1)][47.9%][r=46.7MiB/s,w=46.4MiB/s][r=11.0k,w=11.9k IOPS][eta 
01m:03s]
Jobs: 1 (f=1): [m(1)][49.6%][r=55.3MiB/s,w=55.3MiB/s][r=14.1k,w=14.2k IOPS][eta 
01m:01s]
Jobs: 1 (f=1): [m(1)][51.2%][r=54.1MiB/s,w=53.2MiB/s][r=13.8k,w=13.6k IOPS][eta 
00m:59s]
Jobs: 1 (f=1): [m(1)][52.9%][r=53.4MiB/s,w=52.9MiB/s][r=13.7k,w=13.6k IOPS][eta 
00m:57s]
Jobs: 1 (f=1): [m(1)][54.5%][r=58.8MiB/s,w=58.0MiB/s][r=15.1k,w=15.1k IOPS][eta 
00m:55s]
Jobs: 1 (f=1): [m(1)][56.2%][r=60.0MiB/s,w=58.6MiB/s][r=15.4k,w=15.0k IOPS][eta 
00m:53s]
Jobs: 1 (f=1): [m(1)][57.9%][r=57.7MiB/s,w=58.1MiB/s][r=14.8k,w=14.9k IOPS][eta 
00m:51s]
Jobs: 1 (f=1): [m(1)][59.5%][r=14.0MiB/s,w=14.3MiB/s][r=3592,w=3651 IOPS][eta 
00m:49s]
Jobs: 1 (f=1): [m(1)][61.2%][r=17.4MiB/s,w=17.4MiB/s][r=4443,w=4457 IOPS][eta 
00m:47s]
Jobs: 1 (f=1): [m(1)][62.8%][r=18.1MiB/s,w=18.7MiB/s][r=4640,w=4783 IOPS][eta 
00m:45s]
Jobs: 1 (f=1): [m(1)][64.5%][r=7896KiB/s,w=8300KiB/s][r=1974,w=2075 IOPS][eta 
00m:43s]
Jobs: 1 (f=1): [m(1)][66.1%][r=47.8MiB/s,w=47.3MiB/s][r=12.2k,w=12.1k IOPS][eta 
00m:41s]



--
Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
5 Peters Canyon Rd Suite 250 
Irvine CA 92606 
Office 714.918.1310| Fax 714.918.1325 
pbr...@medata.com| www.medata.com
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: performance degredation every 30 seconds

2020-12-14 Thread Jason Dillaman
On Mon, Dec 14, 2020 at 11:28 AM Philip Brown  wrote:
>
>
> I have a new 3 node octopus cluster, set up on SSDs.
>
> I'm running fio to benchmark the setup, with
>
> fio --filename=/dev/rbd0 --direct=1 --rw=randrw --bs=4k --ioengine=libaio 
> --iodepth=256 --numjobs=1 --time_based --group_reporting --name=iops-test-job 
> --runtime=120 --eta-newline=1
>
>
>
> However, I notice that, approximately every 30 seconds, performance tanks for 
> a bit.
>
> Any ideas on why, and better yet, how to get rid of the problem?

Does the same issue appear when running a direct rados bench? What
brand are your SSDs (i.e. are they data center grade)?

>
> Sample debug output below. Notice the transitions at [eta 01m:27s] and [eta 
> 00m:49s]
> It happens again at [00m:09], but figured I didnt need to redundantly post 
> that.
>
>
>
> Jobs: 1 (f=1): [m(1)][2.5%][r=43.4MiB/s,w=43.3MiB/s][r=11.1k,w=11.1k 
> IOPS][eta 01m:58s]
> Jobs: 1 (f=1): [m(1)][4.1%][r=47.3MiB/s,w=47.8MiB/s][r=12.1k,w=12.2k 
> IOPS][eta 01m:56s]
> Jobs: 1 (f=1): [m(1)][5.8%][r=48.6MiB/s,w=49.3MiB/s][r=12.5k,w=12.6k 
> IOPS][eta 01m:54s]
> Jobs: 1 (f=1): [m(1)][7.4%][r=52.4MiB/s,w=53.1MiB/s][r=13.4k,w=13.6k 
> IOPS][eta 01m:52s]
> Jobs: 1 (f=1): [m(1)][9.1%][r=54.7MiB/s,w=54.1MiB/s][r=13.0k,w=13.8k 
> IOPS][eta 01m:50s]
> Jobs: 1 (f=1): [m(1)][10.7%][r=41.5MiB/s,w=42.6MiB/s][r=10.6k,w=10.9k 
> IOPS][eta 01m:48s]
> Jobs: 1 (f=1): [m(1)][12.4%][r=51.5MiB/s,w=50.6MiB/s][r=13.2k,w=12.0k 
> IOPS][eta 01m:46s]
> Jobs: 1 (f=1): [m(1)][14.0%][r=16.6MiB/s,w=16.0MiB/s][r=4248,w=4098 IOPS][eta 
> 01m:44s]
> Jobs: 1 (f=1): [m(1)][14.9%][r=33.3MiB/s,w=33.5MiB/s][r=8526,w=8579 IOPS][eta 
> 01m:43s]
> Jobs: 1 (f=1): [m(1)][16.5%][r=47.1MiB/s,w=47.4MiB/s][r=12.1k,w=12.1k 
> IOPS][eta 01m:41s]
> Jobs: 1 (f=1): [m(1)][18.2%][r=49.6MiB/s,w=49.0MiB/s][r=12.7k,w=12.8k 
> IOPS][eta 01m:39s]
> Jobs: 1 (f=1): [m(1)][19.8%][r=50.3MiB/s,w=51.4MiB/s][r=12.9k,w=13.1k 
> IOPS][eta 01m:37s]
> Jobs: 1 (f=1): [m(1)][21.5%][r=53.5MiB/s,w=52.9MiB/s][r=13.7k,w=13.5k 
> IOPS][eta 01m:35s]
> Jobs: 1 (f=1): [m(1)][23.1%][r=52.7MiB/s,w=52.1MiB/s][r=13.5k,w=13.3k 
> IOPS][eta 01m:33s]
> Jobs: 1 (f=1): [m(1)][24.8%][r=55.3MiB/s,w=54.9MiB/s][r=14.1k,w=14.1k 
> IOPS][eta 01m:31s]
> Jobs: 1 (f=1): [m(1)][26.4%][r=44.0MiB/s,w=45.2MiB/s][r=11.5k,w=11.6k 
> IOPS][eta 01m:29s]
> Jobs: 1 (f=1): [m(1)][28.1%][r=12.1MiB/s,w=11.8MiB/s][r=3105,w=3011 IOPS][eta 
> 01m:27s]
> Jobs: 1 (f=1): [m(1)][29.8%][r=16.6MiB/s,w=17.3MiB/s][r=4238,w=4422 IOPS][eta 
> 01m:25s]
> Jobs: 1 (f=1): [m(1)][31.4%][r=9820KiB/s,w=9516KiB/s][r=2455,w=2379 IOPS][eta 
> 01m:23s]
> Jobs: 1 (f=1): [m(1)][33.1%][r=6974KiB/s,w=7099KiB/s][r=1743,w=1774 IOPS][eta 
> 01m:21s]
> Jobs: 1 (f=1): [m(1)][34.7%][r=49.5MiB/s,w=49.2MiB/s][r=12.7k,w=12.6k 
> IOPS][eta 01m:19s]
> Jobs: 1 (f=1): [m(1)][36.4%][r=49.3MiB/s,w=49.8MiB/s][r=12.6k,w=12.8k 
> IOPS][eta 01m:17s]
> Jobs: 1 (f=1): [m(1)][38.0%][r=36.4MiB/s,w=35.9MiB/s][r=9326,w=9200 IOPS][eta 
> 01m:15s]
> Jobs: 1 (f=1): [m(1)][39.7%][r=43.4MiB/s,w=43.3MiB/s][r=11.1k,w=11.1k 
> IOPS][eta 01m:13s]
> Jobs: 1 (f=1): [m(1)][41.3%][r=47.1MiB/s,w=47.1MiB/s][r=12.1k,w=12.1k 
> IOPS][eta 01m:11s]
> Jobs: 1 (f=1): [m(1)][43.0%][r=47.9MiB/s,w=48.0MiB/s][r=12.3k,w=12.5k 
> IOPS][eta 01m:09s]
> Jobs: 1 (f=1): [m(1)][44.6%][r=49.9MiB/s,w=48.8MiB/s][r=12.8k,w=12.5k 
> IOPS][eta 01m:07s]
> Jobs: 1 (f=1): [m(1)][46.3%][r=46.4MiB/s,w=46.9MiB/s][r=11.9k,w=11.0k 
> IOPS][eta 01m:05s]
> Jobs: 1 (f=1): [m(1)][47.9%][r=46.7MiB/s,w=46.4MiB/s][r=11.0k,w=11.9k 
> IOPS][eta 01m:03s]
> Jobs: 1 (f=1): [m(1)][49.6%][r=55.3MiB/s,w=55.3MiB/s][r=14.1k,w=14.2k 
> IOPS][eta 01m:01s]
> Jobs: 1 (f=1): [m(1)][51.2%][r=54.1MiB/s,w=53.2MiB/s][r=13.8k,w=13.6k 
> IOPS][eta 00m:59s]
> Jobs: 1 (f=1): [m(1)][52.9%][r=53.4MiB/s,w=52.9MiB/s][r=13.7k,w=13.6k 
> IOPS][eta 00m:57s]
> Jobs: 1 (f=1): [m(1)][54.5%][r=58.8MiB/s,w=58.0MiB/s][r=15.1k,w=15.1k 
> IOPS][eta 00m:55s]
> Jobs: 1 (f=1): [m(1)][56.2%][r=60.0MiB/s,w=58.6MiB/s][r=15.4k,w=15.0k 
> IOPS][eta 00m:53s]
> Jobs: 1 (f=1): [m(1)][57.9%][r=57.7MiB/s,w=58.1MiB/s][r=14.8k,w=14.9k 
> IOPS][eta 00m:51s]
> Jobs: 1 (f=1): [m(1)][59.5%][r=14.0MiB/s,w=14.3MiB/s][r=3592,w=3651 IOPS][eta 
> 00m:49s]
> Jobs: 1 (f=1): [m(1)][61.2%][r=17.4MiB/s,w=17.4MiB/s][r=4443,w=4457 IOPS][eta 
> 00m:47s]
> Jobs: 1 (f=1): [m(1)][62.8%][r=18.1MiB/s,w=18.7MiB/s][r=4640,w=4783 IOPS][eta 
> 00m:45s]
> Jobs: 1 (f=1): [m(1)][64.5%][r=7896KiB/s,w=8300KiB/s][r=1974,w=2075 IOPS][eta 
> 00m:43s]
> Jobs: 1 (f=1): [m(1)][66.1%][r=47.8MiB/s,w=47.3MiB/s][r=12.2k,w=12.1k 
> IOPS][eta 00m:41s]
>
>
>
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> 5 Peters Canyon Rd Suite 250
> Irvine CA 92606
> Office 714.918.1310| Fax 714.918.1325
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Jason
___
ceph-users mailing lis

[ceph-users] Re: performance degredation every 30 seconds

2020-12-14 Thread Philip Brown
Aha Insightful question!
running rados bench write to the same pool, does not exhibit any problems. It 
consistently shows around 480M/sec throughput, every second.

So this would seem to be something to do with using rbd devices. Which we need 
to do.

For what it's worth, I'm using Micron 5200 Pro SSDs on all nodes.


- Original Message -
From: "Jason Dillaman" 
To: "Philip Brown" 
Cc: "ceph-users" 
Sent: Monday, December 14, 2020 8:33:09 AM
Subject: Re: [ceph-users] performance degredation every 30 seconds

On Mon, Dec 14, 2020 at 11:28 AM Philip Brown  wrote:
>
>
> I have a new 3 node octopus cluster, set up on SSDs.
>
> I'm running fio to benchmark the setup, with
>
> fio --filename=/dev/rbd0 --direct=1 --rw=randrw --bs=4k --ioengine=libaio 
> --iodepth=256 --numjobs=1 --time_based --group_reporting --name=iops-test-job 
> --runtime=120 --eta-newline=1
>
>
>
> However, I notice that, approximately every 30 seconds, performance tanks for 
> a bit.
>
> Any ideas on why, and better yet, how to get rid of the problem?

Does the same issue appear when running a direct rados bench? What
brand are your SSDs (i.e. are they data center grade)?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: performance degredation every 30 seconds

2020-12-14 Thread Philip Brown
Further experimentation with fio's -rw flag, setting to rw=read, and 
rw=randwrite, in addition to the original rw=randrw, indicates that it is tied 
to writes.

Possibly some kind of buffer flush delay or cache sync delay when using rbd 
device, even though fio specified --direct=1   ?



- Original Message -
From: "Philip Brown" 
To: "dillaman" 
Cc: "ceph-users" 
Sent: Monday, December 14, 2020 9:01:21 AM
Subject: Re: [ceph-users] performance degredation every 30 seconds

Aha Insightful question!
running rados bench write to the same pool, does not exhibit any problems. It 
consistently shows around 480M/sec throughput, every second.

So this would seem to be something to do with using rbd devices. Which we need 
to do.

For what it's worth, I'm using Micron 5200 Pro SSDs on all nodes.


- Original Message -
From: "Jason Dillaman" 
To: "Philip Brown" 
Cc: "ceph-users" 
Sent: Monday, December 14, 2020 8:33:09 AM
Subject: Re: [ceph-users] performance degredation every 30 seconds

On Mon, Dec 14, 2020 at 11:28 AM Philip Brown  wrote:
>
>
> I have a new 3 node octopus cluster, set up on SSDs.
>
> I'm running fio to benchmark the setup, with
>
> fio --filename=/dev/rbd0 --direct=1 --rw=randrw --bs=4k --ioengine=libaio 
> --iodepth=256 --numjobs=1 --time_based --group_reporting --name=iops-test-job 
> --runtime=120 --eta-newline=1
>
>
>
> However, I notice that, approximately every 30 seconds, performance tanks for 
> a bit.
>
> Any ideas on why, and better yet, how to get rid of the problem?

Does the same issue appear when running a direct rados bench? What
brand are your SSDs (i.e. are they data center grade)?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: performance degredation every 30 seconds

2020-12-14 Thread Jason Dillaman
On Mon, Dec 14, 2020 at 12:46 PM Philip Brown  wrote:
>
> Further experimentation with fio's -rw flag, setting to rw=read, and 
> rw=randwrite, in addition to the original rw=randrw, indicates that it is 
> tied to writes.
>
> Possibly some kind of buffer flush delay or cache sync delay when using rbd 
> device, even though fio specified --direct=1   ?

It might be worthwhile testing with a more realistic io-depth instead
of 256 in case you are hitting weird limits due to an untested corner
case? Does the performance still degrade with "--iodepth=16" or
"--iodepth=32"?

>
>
> - Original Message -
> From: "Philip Brown" 
> To: "dillaman" 
> Cc: "ceph-users" 
> Sent: Monday, December 14, 2020 9:01:21 AM
> Subject: Re: [ceph-users] performance degredation every 30 seconds
>
> Aha Insightful question!
> running rados bench write to the same pool, does not exhibit any problems. It 
> consistently shows around 480M/sec throughput, every second.
>
> So this would seem to be something to do with using rbd devices. Which we 
> need to do.
>
> For what it's worth, I'm using Micron 5200 Pro SSDs on all nodes.
>
>
> - Original Message -
> From: "Jason Dillaman" 
> To: "Philip Brown" 
> Cc: "ceph-users" 
> Sent: Monday, December 14, 2020 8:33:09 AM
> Subject: Re: [ceph-users] performance degredation every 30 seconds
>
> On Mon, Dec 14, 2020 at 11:28 AM Philip Brown  wrote:
> >
> >
> > I have a new 3 node octopus cluster, set up on SSDs.
> >
> > I'm running fio to benchmark the setup, with
> >
> > fio --filename=/dev/rbd0 --direct=1 --rw=randrw --bs=4k --ioengine=libaio 
> > --iodepth=256 --numjobs=1 --time_based --group_reporting 
> > --name=iops-test-job --runtime=120 --eta-newline=1
> >
> >
> >
> > However, I notice that, approximately every 30 seconds, performance tanks 
> > for a bit.
> >
> > Any ideas on why, and better yet, how to get rid of the problem?
>
> Does the same issue appear when running a direct rados bench? What
> brand are your SSDs (i.e. are they data center grade)?
>


-- 
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: performance degredation every 30 seconds

2020-12-14 Thread Philip Brown
Our goal is to put up a high performance ceph cluster that can deal with 100 
very active clients. So for us, testing with iodepth=256 is actually fairly 
realistic.

but it does also exhibit the problem with iodepth=32

[root@irviscsi03 ~]# fio --filename=/dev/rbd0 --direct=1 --rw=randwrite --bs=4k 
--ioengine=libaio --iodepth=32 --numjobs=1 --time_based --group_reporting 
--name=iops-test-job --runtime=120 --eta-newline=1
iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=32
fio-3.7
Starting 1 process
fio: file /dev/rbd0 exceeds 32-bit tausworthe random generator.
fio: Switching to tausworthe64. Use the random_generator= option to get rid of 
this warning.
Jobs: 1 (f=1): [w(1)][2.5%][r=0KiB/s,w=20.5MiB/s][r=0,w=5258 IOPS][eta 01m:58s]
Jobs: 1 (f=1): [w(1)][4.1%][r=0KiB/s,w=41.1MiB/s][r=0,w=10.5k IOPS][eta 01m:56s]
Jobs: 1 (f=1): [w(1)][5.8%][r=0KiB/s,w=45.7MiB/s][r=0,w=11.7k IOPS][eta 01m:54s]
Jobs: 1 (f=1): [w(1)][7.4%][r=0KiB/s,w=55.3MiB/s][r=0,w=14.2k IOPS][eta 01m:52s]
Jobs: 1 (f=1): [w(1)][9.1%][r=0KiB/s,w=54.4MiB/s][r=0,w=13.9k IOPS][eta 01m:50s]
Jobs: 1 (f=1): [w(1)][10.7%][r=0KiB/s,w=53.4MiB/s][r=0,w=13.7k IOPS][eta 
01m:48s]
Jobs: 1 (f=1): [w(1)][12.4%][r=0KiB/s,w=53.7MiB/s][r=0,w=13.7k IOPS][eta 
01m:46s]
Jobs: 1 (f=1): [w(1)][14.0%][r=0KiB/s,w=55.7MiB/s][r=0,w=14.3k IOPS][eta 
01m:44s]
Jobs: 1 (f=1): [w(1)][15.7%][r=0KiB/s,w=54.4MiB/s][r=0,w=13.9k IOPS][eta 
01m:42s]
Jobs: 1 (f=1): [w(1)][17.4%][r=0KiB/s,w=51.6MiB/s][r=0,w=13.2k IOPS][eta 
01m:40s]
Jobs: 1 (f=1): [w(1)][19.0%][r=0KiB/s,w=38.1MiB/s][r=0,w=9748 IOPS][eta 01m:38s]
Jobs: 1 (f=1): [w(1)][20.7%][r=0KiB/s,w=24.1MiB/s][r=0,w=6158 IOPS][eta 01m:36s]
Jobs: 1 (f=1): [w(1)][22.3%][r=0KiB/s,w=12.4MiB/s][r=0,w=3178 IOPS][eta 01m:34s]
Jobs: 1 (f=1): [w(1)][24.0%][r=0KiB/s,w=31.5MiB/s][r=0,w=8056 IOPS][eta 01m:32s]
Jobs: 1 (f=1): [w(1)][25.6%][r=0KiB/s,w=48.6MiB/s][r=0,w=12.4k IOPS][eta 
01m:30s]
Jobs: 1 (f=1): [w(1)][27.3%][r=0KiB/s,w=52.2MiB/s][r=0,w=13.4k IOPS][eta 
01m:28s]
Jobs: 1 (f=1): [w(1)][28.9%][r=0KiB/s,w=54.3MiB/s][r=0,w=13.9k IOPS][eta 
01m:26s]
Jobs: 1 (f=1): [w(1)][30.6%][r=0KiB/s,w=52.6MiB/s][r=0,w=13.5k IOPS][eta 
01m:24s]
Jobs: 1 (f=1): [w(1)][32.2%][r=0KiB/s,w=55.1MiB/s][r=0,w=14.1k IOPS][eta 
01m:22s]
Jobs: 1 (f=1): [w(1)][33.9%][r=0KiB/s,w=34.3MiB/s][r=0,w=8775 IOPS][eta 01m:20s]
Jobs: 1 (f=1): [w(1)][35.5%][r=0KiB/s,w=52.5MiB/s][r=0,w=13.4k IOPS][eta 
01m:18s]
Jobs: 1 (f=1): [w(1)][37.2%][r=0KiB/s,w=52.7MiB/s][r=0,w=13.5k IOPS][eta 
01m:16s]
Jobs: 1 (f=1): [w(1)][38.8%][r=0KiB/s,w=53.9MiB/s][r=0,w=13.8k IOPS][eta 
01m:14s]
  .. etc.


- Original Message -
From: "Jason Dillaman" 
To: "Philip Brown" 
Cc: "ceph-users" 
Sent: Monday, December 14, 2020 10:19:48 AM
Subject: Re: [ceph-users] performance degredation every 30 seconds

On Mon, Dec 14, 2020 at 12:46 PM Philip Brown  wrote:
>
> Further experimentation with fio's -rw flag, setting to rw=read, and 
> rw=randwrite, in addition to the original rw=randrw, indicates that it is 
> tied to writes.
>
> Possibly some kind of buffer flush delay or cache sync delay when using rbd 
> device, even though fio specified --direct=1   ?

It might be worthwhile testing with a more realistic io-depth instead
of 256 in case you are hitting weird limits due to an untested corner
case? Does the performance still degrade with "--iodepth=16" or
"--iodepth=32"?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Frédéric Nass

Hi Stefan,

Initial data removal could also have resulted from a snapshot removal 
leading to OSDs OOMing and then pg remappings leading to more removals 
after OOMed OSDs rejoined the cluster and so on.


As mentioned by Igor : "Additionally there are users' reports that 
recent default value's modification for bluefs_buffered_io setting has 
negative impact (or just worsen existing issue with massive removal) as 
well. So you might want to switch it back to true."


We're some of them. Our cluster suffered from a severe performance drop 
during snapshot removal right after upgrading to Nautilus, due to 
bluefs_buffered_io being set to false by default, with slow requests 
observed around the cluster.
Once back to true (can be done with ceph tell osd.* injectargs 
'--bluefs_buffered_io=true') snap trimming would be fast again so as 
before the upgrade, with no more slow requests.


But of course we've seen the excessive memory swap usage described here 
: https://github.com/ceph/ceph/pull/34224
So we lower osd_memory_target from 8MB to 4MB and haven't observed any 
swap usage since then. You can also have a look here : 
https://github.com/ceph/ceph/pull/38044


What you need to look at to understand if your cluster would benefit 
from changing bluefs_buffered_io back to true is the %util of your 
RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if you're 
using SSD RocksDB devices) and look at the %util of the device with 
bluefs_buffered_io=false and with bluefs_buffered_io=true. If with 
bluefs_buffered_io=false, the %util is over 75% most of the time, then 
you'd better change it to true. :-)


Regards,

Frédéric.

Le 14/12/2020 à 12:47, Stefan Wild a écrit :

Hi Igor,

Thank you for the detailed analysis. That makes me hopeful we can get the 
cluster back on track. No pools have been removed, but yes, due to the initial 
crash of multiple OSDs and the subsequent issues with individual OSDs we’ve had 
substantial PG remappings happening constantly.

I will look up the referenced thread(s) and try the offline DB compaction. It 
would be amazing if that does the trick.

Will keep you posted, here.

Thanks,
Stefan


From: Igor Fedotov 
Sent: Monday, December 14, 2020 6:39:28 AM
To: Stefan Wild ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory

Hi Stefan,

given the crash backtrace in your log I presume some data removal is in
progress:

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3:
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4:
(KernelDevice::read_random(unsigned long, unsigned long, char*,
bool)+0x1b3) [0x5587b93653e3]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5:
(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long,
char*)+0x674) [0x5587b9328cb4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19:
(BlueStore::_do_omap_clear(BlueStore::TransContext*,
boost::intrusive_ptr&)+0xa2) [0x5587b922f0e2]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20:
(BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr)+0xc65) [0x5587b923b555]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21:
(BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr&)+0x64) [0x5587b923c3b4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24:
(ObjectStore::queue_transaction(boost::intrusive_ptr&,
ceph::os::Transaction&&, boost::intrusive_ptr,
ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25:
(PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26:
(PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e)
[0x5587b8fd6ede]
...

Did you initiate some large pool removal recently? Or may be data
rebalancing triggered PG migration (and hence source PG removal) for you?

Highly likely you're facing a well known issue with RocksDB/BlueFS
performance issues caused by massive data removal.

So your OSDs are just processing I/O very slowly which triggers suicide
timeout.

We've had multiple threads on the issue in this mailing list - the
latest one is at
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72ZVQ6PD5NABEEYRDMX7OZTT/

For now the good enough workaround is manual offline DB compaction for
all the OSDs (this might have temporary effect though as the removal
proceeds).

Additionally there are users' reports that recent default value's
modification  for bluefs_buffered_io setting has negative impact (or
just worsen existing issue with massive removal) as well. So you might
want to switch it back to true.

As for OSD.10 - can't say for sure as I haven't seen its' logs but I
think it's experiencing the same issue which might eventually lead it
into unresponsive state as well. Just grep its log for "heartb

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
Hi Frédéric,

Thanks for the additional input. We are currently only running RGW on the 
cluster, so no snapshot removal, but there have been plenty of remappings with 
the OSDs failing (all of them at first during and after the OOM incident, then 
one-by-one). I haven't had a chance to look into or test the bluefs_buffered_io 
setting, but will do that next. Initial results from compacting all OSDs' 
RocksDBs look promising (thank you, Igor!). Things have been stable for the 
past two hours, including the two OSDs with issues (one in reboot loop, the 
other with some heartbeats missed), while 15 degraded PGs are backfilling.

The ballooning of each OSD to over 15GB memory right after the initial crash 
was even with osd_memory_target set to 2GB. The only thing that helped at that 
point was to temporarily add enough swap space to fit 12 x 15GB and let them do 
their thing. Once they had all booted, memory usage went back down to normal 
levels.

I will report back here with more details when the cluster is hopefully back to 
a healthy state.

Thanks,
Stefan



On 12/14/20, 3:35 PM, "Frédéric Nass"  wrote:

Hi Stefan,

Initial data removal could also have resulted from a snapshot removal 
leading to OSDs OOMing and then pg remappings leading to more removals 
after OOMed OSDs rejoined the cluster and so on.

As mentioned by Igor : "Additionally there are users' reports that 
recent default value's modification for bluefs_buffered_io setting has 
negative impact (or just worsen existing issue with massive removal) as 
well. So you might want to switch it back to true."

We're some of them. Our cluster suffered from a severe performance drop 
during snapshot removal right after upgrading to Nautilus, due to 
bluefs_buffered_io being set to false by default, with slow requests 
observed around the cluster.
Once back to true (can be done with ceph tell osd.* injectargs 
'--bluefs_buffered_io=true') snap trimming would be fast again so as 
before the upgrade, with no more slow requests.

But of course we've seen the excessive memory swap usage described here 
: https://github.com/ceph/ceph/pull/34224
So we lower osd_memory_target from 8MB to 4MB and haven't observed any 
swap usage since then. You can also have a look here : 
https://github.com/ceph/ceph/pull/38044

What you need to look at to understand if your cluster would benefit 
from changing bluefs_buffered_io back to true is the %util of your 
RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if you're 
using SSD RocksDB devices) and look at the %util of the device with 
bluefs_buffered_io=false and with bluefs_buffered_io=true. If with 
bluefs_buffered_io=false, the %util is over 75% most of the time, then 
you'd better change it to true. :-)

Regards,

Frédéric.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Frédéric Nass
I forgot to mention "If with bluefs_buffered_io=false, the %util is over 
75% most of the time ** during data removal (like snapshot removal) **, 
then you'd better change it to true."


Regards,

Frédéric.

Le 14/12/2020 à 21:35, Frédéric Nass a écrit :

Hi Stefan,

Initial data removal could also have resulted from a snapshot removal 
leading to OSDs OOMing and then pg remappings leading to more removals 
after OOMed OSDs rejoined the cluster and so on.


As mentioned by Igor : "Additionally there are users' reports that 
recent default value's modification for bluefs_buffered_io setting has 
negative impact (or just worsen existing issue with massive removal) 
as well. So you might want to switch it back to true."


We're some of them. Our cluster suffered from a severe performance 
drop during snapshot removal right after upgrading to Nautilus, due to 
bluefs_buffered_io being set to false by default, with slow requests 
observed around the cluster.
Once back to true (can be done with ceph tell osd.* injectargs 
'--bluefs_buffered_io=true') snap trimming would be fast again so as 
before the upgrade, with no more slow requests.


But of course we've seen the excessive memory swap usage described 
here : https://github.com/ceph/ceph/pull/34224
So we lower osd_memory_target from 8MB to 4MB and haven't observed any 
swap usage since then. You can also have a look here : 
https://github.com/ceph/ceph/pull/38044


What you need to look at to understand if your cluster would benefit 
from changing bluefs_buffered_io back to true is the %util of your 
RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if 
you're using SSD RocksDB devices) and look at the %util of the device 
with bluefs_buffered_io=false and with bluefs_buffered_io=true. If 
with bluefs_buffered_io=false, the %util is over 75% most of the time, 
then you'd better change it to true. :-)


Regards,

Frédéric.

Le 14/12/2020 à 12:47, Stefan Wild a écrit :

Hi Igor,

Thank you for the detailed analysis. That makes me hopeful we can get 
the cluster back on track. No pools have been removed, but yes, due 
to the initial crash of multiple OSDs and the subsequent issues with 
individual OSDs we’ve had substantial PG remappings happening 
constantly.


I will look up the referenced thread(s) and try the offline DB 
compaction. It would be amazing if that does the trick.


Will keep you posted, here.

Thanks,
Stefan


From: Igor Fedotov 
Sent: Monday, December 14, 2020 6:39:28 AM
To: Stefan Wild ; ceph-users@ceph.io 

Subject: Re: [ceph-users] Re: OSD reboot loop after running out of 
memory


Hi Stefan,

given the crash backtrace in your log I presume some data removal is in
progress:

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3:
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4:
(KernelDevice::read_random(unsigned long, unsigned long, char*,
bool)+0x1b3) [0x5587b93653e3]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5:
(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long,
char*)+0x674) [0x5587b9328cb4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19:
(BlueStore::_do_omap_clear(BlueStore::TransContext*,
boost::intrusive_ptr&)+0xa2) [0x5587b922f0e2]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20:
(BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr)+0xc65) [0x5587b923b555]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21:
(BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr&)+0x64) [0x5587b923c3b4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24:
(ObjectStore::queue_transaction(boost::intrusive_ptr&, 


ceph::os::Transaction&&, boost::intrusive_ptr,
ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25:
(PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26:
(PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e)
[0x5587b8fd6ede]
...

Did you initiate some large pool removal recently? Or may be data
rebalancing triggered PG migration (and hence source PG removal) for 
you?


Highly likely you're facing a well known issue with RocksDB/BlueFS
performance issues caused by massive data removal.

So your OSDs are just processing I/O very slowly which triggers suicide
timeout.

We've had multiple threads on the issue in this mailing list - the
latest one is at
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72ZVQ6PD5NABEEYRDMX7OZTT/ 



For now the good enough workaround is manual offline DB compaction for
all the OSDs (this might have temporary effect though as the removal
proceeds).

Additionally there are users' reports that recent default value's
modification  for bluefs_buffered_io setting has negative impact (or
just worsen existing issue with mass

[ceph-users] Re: Provide more documentation for MDS performance tuning on large file systems

2020-12-14 Thread Patrick Donnelly
On Mon, Dec 7, 2020 at 12:06 PM Patrick Donnelly  wrote:
>
> Hi Dan & Janek,
>
> On Sat, Dec 5, 2020 at 6:26 AM Dan van der Ster  wrote:
> > My understanding is that the recall thresholds (see my list below)
> > should be scaled proportionally. OTOH, I haven't played with the decay
> > rates (and don't know if there's any significant value to tuning
> > those).
>
> I haven't gone through this thread yet but I want to note for those
> reading that we do now have documentation (thanks for the frequent
> pokes Janek!) for the recall configurations:
>
> https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall
>
> Please let us know if it's missing information or if something could
> be more clear.

I also now have a PR open for updating the defaults based on these and
other discussions: https://github.com/ceph/ceph/pull/38574

Feedback welcome.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitors not starting, getting "e3 handle_auth_request failed to assign global_id"

2020-12-14 Thread Wesley Dillingham
We had to rebuild our mons on a few occasions because of this. Only one mon
was ever dropped from quorum at a time in our case. In other scenarios with
the same error the mon was able to rejoin after thirty minutes or so. We
believe we may have tracked it down (in our case) to the upgrade of an AV /
packet inspection security technology being run on the servers. Perhaps
you've made similar updates.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Tue, Dec 8, 2020 at 7:46 PM Wesley Dillingham 
wrote:

> We have also had this issue multiple times in 14.2.11
>
> On Tue, Dec 8, 2020, 5:11 PM  wrote:
>
>> I have same issue. My cluster runing 14.2.11 versions. What is your
>> version ceph?
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitors not starting, getting "e3 handle_auth_request failed to assign global_id"

2020-12-14 Thread Hoan Nguyen Van
I found a merge request, ceph mon has new optíon :mon_sync_max_payload_keys

https://github.com/ceph/ceph/commit/d6037b7f484e13cfc9136e63e4cf7fac6ad68960#diff-495ccc5deb4f8fbd94e795e66c3720677f821314d4b9042f99664cd48a9506fd

My value of options  mon_sync_max_payload_size is 4096.
If mon_sync_max_payload_keys options is default, ceph mon may be not sync
because slow opt.

What do you think?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: performance degredation every 30 seconds

2020-12-14 Thread Jason Dillaman
On Mon, Dec 14, 2020 at 1:28 PM Philip Brown  wrote:
>
> Our goal is to put up a high performance ceph cluster that can deal with 100 
> very active clients. So for us, testing with iodepth=256 is actually fairly 
> realistic.

100 active clients on the same node or just 100 active clients?

> but it does also exhibit the problem with iodepth=32
>
> [root@irviscsi03 ~]# fio --filename=/dev/rbd0 --direct=1 --rw=randwrite 
> --bs=4k --ioengine=libaio --iodepth=32 --numjobs=1 --time_based 
> --group_reporting --name=iops-test-job --runtime=120 --eta-newline=1
> iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=32
> fio-3.7
> Starting 1 process
> fio: file /dev/rbd0 exceeds 32-bit tausworthe random generator.
> fio: Switching to tausworthe64. Use the random_generator= option to get rid 
> of this warning.
> Jobs: 1 (f=1): [w(1)][2.5%][r=0KiB/s,w=20.5MiB/s][r=0,w=5258 IOPS][eta 
> 01m:58s]
> Jobs: 1 (f=1): [w(1)][4.1%][r=0KiB/s,w=41.1MiB/s][r=0,w=10.5k IOPS][eta 
> 01m:56s]
> Jobs: 1 (f=1): [w(1)][5.8%][r=0KiB/s,w=45.7MiB/s][r=0,w=11.7k IOPS][eta 
> 01m:54s]
> Jobs: 1 (f=1): [w(1)][7.4%][r=0KiB/s,w=55.3MiB/s][r=0,w=14.2k IOPS][eta 
> 01m:52s]
> Jobs: 1 (f=1): [w(1)][9.1%][r=0KiB/s,w=54.4MiB/s][r=0,w=13.9k IOPS][eta 
> 01m:50s]
> Jobs: 1 (f=1): [w(1)][10.7%][r=0KiB/s,w=53.4MiB/s][r=0,w=13.7k IOPS][eta 
> 01m:48s]
> Jobs: 1 (f=1): [w(1)][12.4%][r=0KiB/s,w=53.7MiB/s][r=0,w=13.7k IOPS][eta 
> 01m:46s]
> Jobs: 1 (f=1): [w(1)][14.0%][r=0KiB/s,w=55.7MiB/s][r=0,w=14.3k IOPS][eta 
> 01m:44s]
> Jobs: 1 (f=1): [w(1)][15.7%][r=0KiB/s,w=54.4MiB/s][r=0,w=13.9k IOPS][eta 
> 01m:42s]
> Jobs: 1 (f=1): [w(1)][17.4%][r=0KiB/s,w=51.6MiB/s][r=0,w=13.2k IOPS][eta 
> 01m:40s]
> Jobs: 1 (f=1): [w(1)][19.0%][r=0KiB/s,w=38.1MiB/s][r=0,w=9748 IOPS][eta 
> 01m:38s]
> Jobs: 1 (f=1): [w(1)][20.7%][r=0KiB/s,w=24.1MiB/s][r=0,w=6158 IOPS][eta 
> 01m:36s]
> Jobs: 1 (f=1): [w(1)][22.3%][r=0KiB/s,w=12.4MiB/s][r=0,w=3178 IOPS][eta 
> 01m:34s]
> Jobs: 1 (f=1): [w(1)][24.0%][r=0KiB/s,w=31.5MiB/s][r=0,w=8056 IOPS][eta 
> 01m:32s]
> Jobs: 1 (f=1): [w(1)][25.6%][r=0KiB/s,w=48.6MiB/s][r=0,w=12.4k IOPS][eta 
> 01m:30s]
> Jobs: 1 (f=1): [w(1)][27.3%][r=0KiB/s,w=52.2MiB/s][r=0,w=13.4k IOPS][eta 
> 01m:28s]
> Jobs: 1 (f=1): [w(1)][28.9%][r=0KiB/s,w=54.3MiB/s][r=0,w=13.9k IOPS][eta 
> 01m:26s]
> Jobs: 1 (f=1): [w(1)][30.6%][r=0KiB/s,w=52.6MiB/s][r=0,w=13.5k IOPS][eta 
> 01m:24s]
> Jobs: 1 (f=1): [w(1)][32.2%][r=0KiB/s,w=55.1MiB/s][r=0,w=14.1k IOPS][eta 
> 01m:22s]
> Jobs: 1 (f=1): [w(1)][33.9%][r=0KiB/s,w=34.3MiB/s][r=0,w=8775 IOPS][eta 
> 01m:20s]
> Jobs: 1 (f=1): [w(1)][35.5%][r=0KiB/s,w=52.5MiB/s][r=0,w=13.4k IOPS][eta 
> 01m:18s]
> Jobs: 1 (f=1): [w(1)][37.2%][r=0KiB/s,w=52.7MiB/s][r=0,w=13.5k IOPS][eta 
> 01m:16s]
> Jobs: 1 (f=1): [w(1)][38.8%][r=0KiB/s,w=53.9MiB/s][r=0,w=13.8k IOPS][eta 
> 01m:14s]

Have you tried different kernel versions? Might also be worthwhile
testing using fio's "rados" engine [1] (vs your rados bench test)
since it might not have been comparing apples-to-apples given the
>400MiB/s throughout you listed (i.e. large IOs are handled
differently than small IOs internally).

>   .. etc.
>
>
> - Original Message -
> From: "Jason Dillaman" 
> To: "Philip Brown" 
> Cc: "ceph-users" 
> Sent: Monday, December 14, 2020 10:19:48 AM
> Subject: Re: [ceph-users] performance degredation every 30 seconds
>
> On Mon, Dec 14, 2020 at 12:46 PM Philip Brown  wrote:
> >
> > Further experimentation with fio's -rw flag, setting to rw=read, and 
> > rw=randwrite, in addition to the original rw=randrw, indicates that it is 
> > tied to writes.
> >
> > Possibly some kind of buffer flush delay or cache sync delay when using rbd 
> > device, even though fio specified --direct=1   ?
>
> It might be worthwhile testing with a more realistic io-depth instead
> of 256 in case you are hitting weird limits due to an untested corner
> case? Does the performance still degrade with "--iodepth=16" or
> "--iodepth=32"?
>

[1] https://github.com/axboe/fio/blob/master/examples/rados.fio

-- 
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: performance degredation every 30 seconds

2020-12-14 Thread Nathan Fish
Perhaps WAL is filling up when iodepth is so high? Is WAL on the same
SSDs? If you double the WAL size, does it change?


On Mon, Dec 14, 2020 at 9:05 PM Jason Dillaman  wrote:
>
> On Mon, Dec 14, 2020 at 1:28 PM Philip Brown  wrote:
> >
> > Our goal is to put up a high performance ceph cluster that can deal with 
> > 100 very active clients. So for us, testing with iodepth=256 is actually 
> > fairly realistic.
>
> 100 active clients on the same node or just 100 active clients?
>
> > but it does also exhibit the problem with iodepth=32
> >
> > [root@irviscsi03 ~]# fio --filename=/dev/rbd0 --direct=1 --rw=randwrite 
> > --bs=4k --ioengine=libaio --iodepth=32 --numjobs=1 --time_based 
> > --group_reporting --name=iops-test-job --runtime=120 --eta-newline=1
> > iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
> > (T) 4096B-4096B, ioengine=libaio, iodepth=32
> > fio-3.7
> > Starting 1 process
> > fio: file /dev/rbd0 exceeds 32-bit tausworthe random generator.
> > fio: Switching to tausworthe64. Use the random_generator= option to get rid 
> > of this warning.
> > Jobs: 1 (f=1): [w(1)][2.5%][r=0KiB/s,w=20.5MiB/s][r=0,w=5258 IOPS][eta 
> > 01m:58s]
> > Jobs: 1 (f=1): [w(1)][4.1%][r=0KiB/s,w=41.1MiB/s][r=0,w=10.5k IOPS][eta 
> > 01m:56s]
> > Jobs: 1 (f=1): [w(1)][5.8%][r=0KiB/s,w=45.7MiB/s][r=0,w=11.7k IOPS][eta 
> > 01m:54s]
> > Jobs: 1 (f=1): [w(1)][7.4%][r=0KiB/s,w=55.3MiB/s][r=0,w=14.2k IOPS][eta 
> > 01m:52s]
> > Jobs: 1 (f=1): [w(1)][9.1%][r=0KiB/s,w=54.4MiB/s][r=0,w=13.9k IOPS][eta 
> > 01m:50s]
> > Jobs: 1 (f=1): [w(1)][10.7%][r=0KiB/s,w=53.4MiB/s][r=0,w=13.7k IOPS][eta 
> > 01m:48s]
> > Jobs: 1 (f=1): [w(1)][12.4%][r=0KiB/s,w=53.7MiB/s][r=0,w=13.7k IOPS][eta 
> > 01m:46s]
> > Jobs: 1 (f=1): [w(1)][14.0%][r=0KiB/s,w=55.7MiB/s][r=0,w=14.3k IOPS][eta 
> > 01m:44s]
> > Jobs: 1 (f=1): [w(1)][15.7%][r=0KiB/s,w=54.4MiB/s][r=0,w=13.9k IOPS][eta 
> > 01m:42s]
> > Jobs: 1 (f=1): [w(1)][17.4%][r=0KiB/s,w=51.6MiB/s][r=0,w=13.2k IOPS][eta 
> > 01m:40s]
> > Jobs: 1 (f=1): [w(1)][19.0%][r=0KiB/s,w=38.1MiB/s][r=0,w=9748 IOPS][eta 
> > 01m:38s]
> > Jobs: 1 (f=1): [w(1)][20.7%][r=0KiB/s,w=24.1MiB/s][r=0,w=6158 IOPS][eta 
> > 01m:36s]
> > Jobs: 1 (f=1): [w(1)][22.3%][r=0KiB/s,w=12.4MiB/s][r=0,w=3178 IOPS][eta 
> > 01m:34s]
> > Jobs: 1 (f=1): [w(1)][24.0%][r=0KiB/s,w=31.5MiB/s][r=0,w=8056 IOPS][eta 
> > 01m:32s]
> > Jobs: 1 (f=1): [w(1)][25.6%][r=0KiB/s,w=48.6MiB/s][r=0,w=12.4k IOPS][eta 
> > 01m:30s]
> > Jobs: 1 (f=1): [w(1)][27.3%][r=0KiB/s,w=52.2MiB/s][r=0,w=13.4k IOPS][eta 
> > 01m:28s]
> > Jobs: 1 (f=1): [w(1)][28.9%][r=0KiB/s,w=54.3MiB/s][r=0,w=13.9k IOPS][eta 
> > 01m:26s]
> > Jobs: 1 (f=1): [w(1)][30.6%][r=0KiB/s,w=52.6MiB/s][r=0,w=13.5k IOPS][eta 
> > 01m:24s]
> > Jobs: 1 (f=1): [w(1)][32.2%][r=0KiB/s,w=55.1MiB/s][r=0,w=14.1k IOPS][eta 
> > 01m:22s]
> > Jobs: 1 (f=1): [w(1)][33.9%][r=0KiB/s,w=34.3MiB/s][r=0,w=8775 IOPS][eta 
> > 01m:20s]
> > Jobs: 1 (f=1): [w(1)][35.5%][r=0KiB/s,w=52.5MiB/s][r=0,w=13.4k IOPS][eta 
> > 01m:18s]
> > Jobs: 1 (f=1): [w(1)][37.2%][r=0KiB/s,w=52.7MiB/s][r=0,w=13.5k IOPS][eta 
> > 01m:16s]
> > Jobs: 1 (f=1): [w(1)][38.8%][r=0KiB/s,w=53.9MiB/s][r=0,w=13.8k IOPS][eta 
> > 01m:14s]
>
> Have you tried different kernel versions? Might also be worthwhile
> testing using fio's "rados" engine [1] (vs your rados bench test)
> since it might not have been comparing apples-to-apples given the
> >400MiB/s throughout you listed (i.e. large IOs are handled
> differently than small IOs internally).
>
> >   .. etc.
> >
> >
> > - Original Message -
> > From: "Jason Dillaman" 
> > To: "Philip Brown" 
> > Cc: "ceph-users" 
> > Sent: Monday, December 14, 2020 10:19:48 AM
> > Subject: Re: [ceph-users] performance degredation every 30 seconds
> >
> > On Mon, Dec 14, 2020 at 12:46 PM Philip Brown  wrote:
> > >
> > > Further experimentation with fio's -rw flag, setting to rw=read, and 
> > > rw=randwrite, in addition to the original rw=randrw, indicates that it is 
> > > tied to writes.
> > >
> > > Possibly some kind of buffer flush delay or cache sync delay when using 
> > > rbd device, even though fio specified --direct=1   ?
> >
> > It might be worthwhile testing with a more realistic io-depth instead
> > of 256 in case you are hitting weird limits due to an untested corner
> > case? Does the performance still degrade with "--iodepth=16" or
> > "--iodepth=32"?
> >
>
> [1] https://github.com/axboe/fio/blob/master/examples/rados.fio
>
> --
> Jason
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Weird ceph df

2020-12-14 Thread Osama Elswah
ceph df [detail] output (POOLS section) has been modified in plain format:

   -

   ‘BYTES USED’ column renamed to ‘STORED’. Represents amount of data
   stored by the user.
   -

   ‘USED’ column now represent amount of space allocated purely for data by
   all OSD nodes in KB.

source: https://docs.ceph.com/en/nautilus/releases/nautilus/
On Tue, Dec 15, 2020 at 6:35 AM Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com> wrote:

> Hi,
>
> It is a nautilus 14.2.13 ceph.
>
> The quota on the pool is 745GiB, how can be the stored data 788GiB? (2
> replicas pool).
> Based on the used column it means just 334GiB is used because the pool has
> 2 replicas only. I don't understand.
>
> POOLS:
> POOLID STORED  OBJECTS USED%USED
>MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY   USED COMPR
>UNDER COMPR
> k8s-dbss-w-mdc  12 788 GiB 202.42k 668 GiB  0.75
>   43 TiB N/A   745 GiB 202.42k0 B
>0 B
>
> Thank you
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io