[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-04 Thread Manuel Lausch
On Tue, 2 Nov 2021 09:02:31 -0500
Sage Weil  wrote:


> 
> Just to be clear, you should try
>   osd_fast_shutdown = true
>   osd_fast_shutdown_notify_mon = false

I added some logs to the tracker ticket with this options set.


> You write if the osd rejects messenger connections, because it is
> > stopped, the peering process will skip the read_lease timeout. If
> > the OSD annouces its shutdown, can we not skip this read_lease
> > timeout as well?
> >  
> 
> If memory serves, yes, but the notify_mon process can take more time
> than a peer OSD getting ECONNREFUSED.  The combination above is the
> recommended combation (and the default).

On my tests yesterday I saw again, that it took about 2 seconds between
stopping a OSD and the first blame in the ceph.log
With the notification enabled, I got immediately the down message. 




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2021-11-04 Thread Dan van der Ster
Hello Benoît, (and others in this great thread),

Apologies for replying to this ancient thread.

We have been debugging similar issues during an ongoing migration to
new servers with TOSHIBA MG07ACA14TE hdds.

We see a similar commit_latency_ms issue on the new drives (~60ms in
our env vs ~20ms for some old 6TB Seagates).
However, disabling the write cache (hdparm -W 0) made absolutely no
difference for us.

So we're wondering:
* Are we running the same firmware as you? (We have 0104). I wonder if
Toshiba has changed the implementation of the cache in the meantime...
* Is anyone aware of some HBA or other setting in the middle that
might be masking this setting from reaching the drive?

Best Regards,

Dan



On Wed, Jun 24, 2020 at 9:44 AM Benoît Knecht  wrote:
>
> Hi,
>
> We have a Nautilus (14.2.9) Ceph cluster with two types of HDDs:
>
> - TOSHIBA MG07ACA14TE   [1]
> - HGST HUH721212ALE604  [2]
>
> They're all bluestore OSDs with no separate DB+WAL and part of the same pool.
>
> We noticed that while the HGST OSDs have a commit latency of about 15ms, the 
> Toshiba OSDs hover around 150ms (these values come from the 
> `ceph_osd_commit_latency_ms` metric in Prometheus).
>
> On paper, it seems like those drives have very similar specs, so it's not 
> clear to me why we're seeing such a large difference when it comes to commit 
> latency.
>
> Has anyone had any experience with those Toshiba drives? Or looking at the 
> specs, do you spot anything suspicious?
>
> And if you're running a Ceph cluster with various disk brands/models, have 
> you ever noticed some of them standing out when looking at 
> `ceph_osd_commit_latency_ms`?
>
> Thanks in advance for your feedback.
>
> Cheers,
>
> --
> Ben
>
> [1]: 
> https://toshiba.semicon-storage.com/content/dam/toshiba-ss/asia-pacific/docs/product/storage/product-manual/eHDD-MG07ACA-Product-Manual.pdf
> [2]: 
> https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-dc-hc500-series/data-sheet-ultrastar-dc-hc520.pdf
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2021-11-04 Thread Mark Nelson

Hi Dan,


I can't speak for those specific Toshiba drives, but we have absolutely 
seen very strange behavior (sometimes with cache enabled and sometimes 
not) with different drives and firmwares over the years from various 
manufacturers.  There was one especially bad case from back in the 
Inktank days, but my memory is a bit fuzzy.  I think we were seeing 
weird periodic commit latency spikes that grew worse over time.  That 
one might have been cache related.  I believe we ended up doing a lot of 
tests with blktrace and iowatcher to show the manufacturer what we were 
seeing, but I don't recall if anything ever got fixed.



Mark


On 11/4/21 5:33 AM, Dan van der Ster wrote:

Hello Benoît, (and others in this great thread),

Apologies for replying to this ancient thread.

We have been debugging similar issues during an ongoing migration to
new servers with TOSHIBA MG07ACA14TE hdds.

We see a similar commit_latency_ms issue on the new drives (~60ms in
our env vs ~20ms for some old 6TB Seagates).
However, disabling the write cache (hdparm -W 0) made absolutely no
difference for us.

So we're wondering:
* Are we running the same firmware as you? (We have 0104). I wonder if
Toshiba has changed the implementation of the cache in the meantime...
* Is anyone aware of some HBA or other setting in the middle that
might be masking this setting from reaching the drive?

Best Regards,

Dan



On Wed, Jun 24, 2020 at 9:44 AM Benoît Knecht  wrote:

Hi,

We have a Nautilus (14.2.9) Ceph cluster with two types of HDDs:

- TOSHIBA MG07ACA14TE   [1]
- HGST HUH721212ALE604  [2]

They're all bluestore OSDs with no separate DB+WAL and part of the same pool.

We noticed that while the HGST OSDs have a commit latency of about 15ms, the 
Toshiba OSDs hover around 150ms (these values come from the 
`ceph_osd_commit_latency_ms` metric in Prometheus).

On paper, it seems like those drives have very similar specs, so it's not clear 
to me why we're seeing such a large difference when it comes to commit latency.

Has anyone had any experience with those Toshiba drives? Or looking at the 
specs, do you spot anything suspicious?

And if you're running a Ceph cluster with various disk brands/models, have you 
ever noticed some of them standing out when looking at 
`ceph_osd_commit_latency_ms`?

Thanks in advance for your feedback.

Cheers,

--
Ben

[1]: 
https://toshiba.semicon-storage.com/content/dam/toshiba-ss/asia-pacific/docs/product/storage/product-manual/eHDD-MG07ACA-Product-Manual.pdf
[2]: 
https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-dc-hc500-series/data-sheet-ultrastar-dc-hc520.pdf
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Multisite replication is on gateway layer right?

2021-11-04 Thread Janne Johansson
Den tors 4 nov. 2021 kl 13:37 skrev Szabo, Istvan (Agoda)
:
> Hi,
>
> In case of bucket replication is the replication happening on osd level or 
> gateway layer?

bucket == gateway layer.

> Could that be a problem, that in my 3 clustered multisite environment the 
> cluster networks are in 2 cluster on mtu 9000, in 1x cluster 1500?

Probably not.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2021-11-04 Thread Dan van der Ster
Thanks Mark.

With the help of the crowd on Telegram, we found that (at least here)
the drive cache needs to be disabled like this:

```
for x in /sys/class/scsi_disk/*/cache_type; do echo 'write through' > $x; done
```

This disables the cache (confirmed afterwards with hdparm) but more
importantly fio --fsync=1 --direct=1 is now giving 400 iops (as
opposed to 80iops, ootb).

And commit_latency_ms drops from 80ms to 3ms :-)

We'll do more testing here but this looks like a magic switch. (we
should consider documenting or automating this to some extent, imho).

Cheers, Dan

On Thu, Nov 4, 2021 at 11:48 AM Mark Nelson  wrote:
>
> Hi Dan,
>
>
> I can't speak for those specific Toshiba drives, but we have absolutely
> seen very strange behavior (sometimes with cache enabled and sometimes
> not) with different drives and firmwares over the years from various
> manufacturers.  There was one especially bad case from back in the
> Inktank days, but my memory is a bit fuzzy.  I think we were seeing
> weird periodic commit latency spikes that grew worse over time.  That
> one might have been cache related.  I believe we ended up doing a lot of
> tests with blktrace and iowatcher to show the manufacturer what we were
> seeing, but I don't recall if anything ever got fixed.
>
>
> Mark
>
>
> On 11/4/21 5:33 AM, Dan van der Ster wrote:
> > Hello Benoît, (and others in this great thread),
> >
> > Apologies for replying to this ancient thread.
> >
> > We have been debugging similar issues during an ongoing migration to
> > new servers with TOSHIBA MG07ACA14TE hdds.
> >
> > We see a similar commit_latency_ms issue on the new drives (~60ms in
> > our env vs ~20ms for some old 6TB Seagates).
> > However, disabling the write cache (hdparm -W 0) made absolutely no
> > difference for us.
> >
> > So we're wondering:
> > * Are we running the same firmware as you? (We have 0104). I wonder if
> > Toshiba has changed the implementation of the cache in the meantime...
> > * Is anyone aware of some HBA or other setting in the middle that
> > might be masking this setting from reaching the drive?
> >
> > Best Regards,
> >
> > Dan
> >
> >
> >
> > On Wed, Jun 24, 2020 at 9:44 AM Benoît Knecht  wrote:
> >> Hi,
> >>
> >> We have a Nautilus (14.2.9) Ceph cluster with two types of HDDs:
> >>
> >> - TOSHIBA MG07ACA14TE   [1]
> >> - HGST HUH721212ALE604  [2]
> >>
> >> They're all bluestore OSDs with no separate DB+WAL and part of the same 
> >> pool.
> >>
> >> We noticed that while the HGST OSDs have a commit latency of about 15ms, 
> >> the Toshiba OSDs hover around 150ms (these values come from the 
> >> `ceph_osd_commit_latency_ms` metric in Prometheus).
> >>
> >> On paper, it seems like those drives have very similar specs, so it's not 
> >> clear to me why we're seeing such a large difference when it comes to 
> >> commit latency.
> >>
> >> Has anyone had any experience with those Toshiba drives? Or looking at the 
> >> specs, do you spot anything suspicious?
> >>
> >> And if you're running a Ceph cluster with various disk brands/models, have 
> >> you ever noticed some of them standing out when looking at 
> >> `ceph_osd_commit_latency_ms`?
> >>
> >> Thanks in advance for your feedback.
> >>
> >> Cheers,
> >>
> >> --
> >> Ben
> >>
> >> [1]: 
> >> https://toshiba.semicon-storage.com/content/dam/toshiba-ss/asia-pacific/docs/product/storage/product-manual/eHDD-MG07ACA-Product-Manual.pdf
> >> [2]: 
> >> https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-dc-hc500-series/data-sheet-ultrastar-dc-hc520.pdf
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High cephfs MDS latency and CPU load

2021-11-04 Thread Patrick Donnelly
Hi Andras,

On Wed, Nov 3, 2021 at 10:18 AM Andras Pataki
 wrote:
>
> Hi cephers,
>
> Recently we've started using cephfs snapshots more - and seem to be
> running into a rather annoying performance issue with the MDS.  The
> cluster in question is on Nautilus 14.2.20.
>
> Typically, the MDS processes a few thousand requests per second with all
> operations showing latencies in the few millisecond range (in mds perf
> dump) and the system seems quite responsive (directory listings, general
> metadata operations feel quick).  Every so often, the MDS transitions
> into a period of super high latency: 0.1 to 2 seconds per operation (as
> measured by increases in the latency counters in mds perf dump).  During
> these high latency periods, the request rate is about the same (low
> 1000s requests/s) - but one thread of the MDS called 'fn_anonymous' is
> 100% busy.  Pointing the debugger to it and getting a stack trace at
> random times always shows a similar picture:

Thanks for the report and useful stack trace. This is probably
corrected by the new use of a "fair" mutex in the MDS:

https://tracker.ceph.com/issues/52441

The fix will be in 16.2.7.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-04 Thread Boris Behrens
Hi everybody,

we maintain three ceph clusters (2x octopus, 1x nautilus) that use three
zonegroups to sync metadata, without syncing the actual data (only one zone
per zonegroup).

Some customer got buckets with >4m objects in our largest cluster (the
other two a very fresh with close to 0 data in it)

How do I handle that in regards of the "Large OMAP objects" warning?
- Sharding is not an option, because it is a multisite environment (at
least thats what I read everywere)
- Limiting the customers is not a great option, because he already got that
huge amount of files in their buckets
- disabling the warning / increasing the threashold, is IMHO a bad option
(people might have put some thinking in that limit and having 40x the limit
is far off the "just roll with it" threashold)

I really hope that someone does have an answer, or maybe there is some
roadmap which addresses this issue.

Cheers
 Boris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-04 Thread Gregory Farnum
On Tue, Nov 2, 2021 at 7:03 AM Sage Weil  wrote:

> On Tue, Nov 2, 2021 at 8:29 AM Manuel Lausch 
> wrote:
>
> > Hi Sage,
> >
> > The "osd_fast_shutdown" is set to "false"
> > As we upgraded to luminous I also had blocked IO issuses with this
> > enabled.
> >
> > Some weeks ago I tried out the options "osd_fast_shutdown" and
> > "osd_fast_shutdown_notify_mon" and also got slow ops while
> > stopping/starting OSDs. But I didn't ceck if this triggert the
> > problem with the read_leases or if this triggert my old issue
>
> with the fast shutodnw.
> >
>
> Just to be clear, you should try
>   osd_fast_shutdown = true
>   osd_fast_shutdown_notify_mon = false
>
> You write if the osd rejects messenger connections, because it is
> > stopped, the peering process will skip the read_lease timeout. If the
> > OSD annouces its shutdown, can we not skip this read_lease timeout as
> > well?
> >
>
> If memory serves, yes, but the notify_mon process can take more time than a
> peer OSD getting ECONNREFUSED.  The combination above is the recommended
> combation (and the default).


Hmmm, if the OSDs are detecting shutdown based on networking error codes,
could a networking configuration or security switch prevent them from
seeing the “correct” failure result?
-Greg



>
>
> > These days I will test the fast_shutdown switch again and will share the
> > corresponding logs with you.
> >
>
> Thanks!
> sage
>
>
>
> >
> >
> > Viele Grüße aus Karlsruhe
> > Manuel
> >
> >
> > On Mon, 1 Nov 2021 15:55:35 -0500
> > Sage Weil  wrote:
> >
> > > Hi Manuel,
> > >
> > > I'm looking at the ticket for this issue (
> > > https://tracker.ceph.com/issues/51463) and tried to reproduce.  This
> > > was initially trivial to do with vstart (rados bench paused for many
> > > seconds afters stopping an osd) but it turns out that was because the
> > > vstart ceph.conf includes `osd_fast_shutdown = false`.  Once I
> > > enabled that again (as it is by default on a normal cluster) I did
> > > not see any noticeable interruption when an OSD was stopped.
> > >
> > > Can you confirm what osd_fast_shutdown and
> > > osd_fast_shutdown_notify_mon are set to on your cluster?
> > >
> > > The intent is that when an OSD goes down, it will no longer accept
> > > messenger connection attempts, and peer OSDs will inform the monitor
> > > with a flag indicating the OSD is definitely dead (vs slow or
> > > unresponsive).  This will allow the peering process to skip waiting
> > > for the read lease to time out.  If you're seeing the laggy or
> > > 'waiting for readable' state, then that isn't happening.. probably
> > > because the OSD shutdown isn't working as originally intended.
> > >
> > > If it's not one of those two options, make you can include a 'ceph
> > > config dump' (or jsut a list of the changed values at least) so we
> > > can see what else might be affecting OSD shutdown...
> > >
> > > Thanks!
> > > sage
> > >
> >
> >
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-04 Thread Teoman Onay
AFAIK dynamic resharding is not supported for multisite setups but you can
reshard manually.
Note that this is a very expensive process which requires you to:

- disable the sync of the bucket you want to reshard.
- Stops all the RGW (no more access to your Ceph cluster)
- On a node of the master zone, reshard the bucket
- On the secondary zone, purge the bucket
- Restart the RGW(s)
- re-enable sync of the bucket.

4m objects/bucket is way to much...

Regards

Teoman

On Thu, Nov 4, 2021 at 5:57 PM Boris Behrens  wrote:

> Hi everybody,
>
> we maintain three ceph clusters (2x octopus, 1x nautilus) that use three
> zonegroups to sync metadata, without syncing the actual data (only one zone
> per zonegroup).
>
> Some customer got buckets with >4m objects in our largest cluster (the
> other two a very fresh with close to 0 data in it)
>
> How do I handle that in regards of the "Large OMAP objects" warning?
> - Sharding is not an option, because it is a multisite environment (at
> least thats what I read everywere)
> - Limiting the customers is not a great option, because he already got that
> huge amount of files in their buckets
> - disabling the warning / increasing the threashold, is IMHO a bad option
> (people might have put some thinking in that limit and having 40x the limit
> is far off the "just roll with it" threashold)
>
> I really hope that someone does have an answer, or maybe there is some
> roadmap which addresses this issue.
>
> Cheers
>  Boris
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to setup radosgw with https on pacific?

2021-11-04 Thread Scharfenberg, Carsten
Hello everybody,

I'm quite new to ceph and I'm facing a myriad of issues trying to use it. So 
I've subscribed to this mailing list. Hopefully you guys can help me with some 
of those issues.

My current goal is to setup a local S3 storage -- i.e. a ceph "cluster" with 
radosgw. In my test environment this is the only purpose of ceph so I get along 
with a single ceph node.

I failed to setup ceph with cephadm (maybe I file an additional request for 
this) so I've installed proxmox, using its built-in ceph support. This works 
nicely.
As proxmox does not feature radosgw support I've followed this procedure to set 
it up: https://pve.proxmox.com/wiki/User:Grin/Ceph_Object_Gateway
Because I'm running a single node cluster I had to modify the crushmap: 
https://www.cnblogs.com/boshen-hzb/p/13305560.html

Now I have a running radosgw listening on port 7480. This is the actual 
starting point of this request.

The next step would be to setup https on the radosgw. I followed this 
procedure: https://greenstatic.dev/posts/2020/ssl-tls-rgw-ceph-config/
My current radosgw settings are:

[client.radosgw.pve]
host = pve
keyring = /var/lib/ceph/radosgw/ceph-pve/keyring
log file = /var/log/ceph/client.radosgw.$host.log
rgw_frontends = beast ssl_endpoint=0.0.0.0:7480 
ssl_certificate=config://rgw/cert/terraform/default.crt 
ssl_private_key=config://rgw/cert/terraform/default.key

This is the result in the logs:

2021-11-04T18:05:35.668+0100 7fdf8d2ce6c0  0 framework: beast
2021-11-04T18:05:35.668+0100 7fdf8d2ce6c0  0 framework conf key: 
ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
2021-11-04T18:05:35.668+0100 7fdf8d2ce6c0  0 framework conf key: 
ssl_private_key, val: config://rgw/cert/$realm/$zone.key
2021-11-04T18:05:35.668+0100 7fdf8d2ce6c0  0 starting handler: beast
2021-11-04T18:05:35.668+0100 7fdf8d2ce6c0 -1 ssl_private_key was not found: 
rgw/cert/terraform/default.key
2021-11-04T18:05:35.668+0100 7fdf8d2ce6c0 -1 ssl_private_key was not found: 
rgw/cert/terraform/default.crt
2021-11-04T18:05:35.668+0100 7fdf8d2ce6c0 -1 no ssl_certificate configured for 
ssl_endpoint
2021-11-04T18:05:35.668+0100 7fdf8d2ce6c0 -1 ERROR: failed initializing frontend

The referenced config keys do exist:

root@pve:~# ceph config-key get rgw/cert/terraform/default.crt
-BEGIN CERTIFICATE-
...

root@pve:~# ceph config-key get rgw/cert/terraform/default.key
-BEGIN RSA PRIVATE KEY-
...

Trying to use local files does not improve things:

2021-11-04T18:13:41.680+0100 7f05df2f46c0  0 framework: beast
2021-11-04T18:13:41.680+0100 7f05df2f46c0  0 framework conf key: 
ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
2021-11-04T18:13:41.680+0100 7f05df2f46c0  0 framework conf key: 
ssl_private_key, val: config://rgw/cert/$realm/$zone.key
2021-11-04T18:13:41.680+0100 7f05df2f46c0  0 starting handler: beast
2021-11-04T18:13:41.680+0100 7f0575feb700  0 INFO: RGWReshardLock::lock found 
lock on reshard.02 to be held by another RGW process; skipping for now
2021-11-04T18:13:41.680+0100 7f05df2f46c0 -1 failed to add 
ssl_private_key=/root/default.key: No such file or directory
2021-11-04T18:13:41.680+0100 7f05df2f46c0 -1 failed to use 
ssl_certificate=/root/default.crt as a private key: No such file or directory
2021-11-04T18:13:41.680+0100 7f05df2f46c0 -1 no ssl_certificate configured for 
ssl_endpoint
2021-11-04T18:13:41.680+0100 7f05df2f46c0 -1 ERROR: failed initializing frontend

With:, s

root@pve:~# cat /root/default.crt
-BEGIN CERTIFICATE-
...

root@pve:~# cat /root/default.key
-BEGIN RSA PRIVATE KEY-
...

For me this behavior looks like a bug, but please correct me if I'm wrong.
So how would I setup https for radosgw?



I've also tried out to setup apache as TLS endpoint by following these 
instructions: https://docs.ceph.com/en/pacific/man/8/radosgw/
Communication is expected to take place via unix domain sockets. But... radosgw 
does not create the socket file, so it does not work either.
Of course the next attempt would be to skip unix domain sockets and listen on 
localhost instead...

BTW: I'm using this software setup:

  *   Proxmox 7.0-11, based on

  *   Debian 11.0 bullseye
  *   Ceph 16.2.6 pacific


I hope anybody can help me.
Regards,

Carsten
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-04 Thread Сергей Процун
If sharding is not option at all, then you can increase
osd_deep_scrub_large_omap_object_key threshold which is not the best idea.
I would still go with resharding which might result in taking offline at
least slave sites. In the future you can set the higher number of shards
during initial creation of buckets which would store a big amount of
objects.

чт, 4 лист. 2021, 19:21 користувач Teoman Onay  пише:

> AFAIK dynamic resharding is not supported for multisite setups but you can
> reshard manually.
> Note that this is a very expensive process which requires you to:
>
> - disable the sync of the bucket you want to reshard.
> - Stops all the RGW (no more access to your Ceph cluster)
> - On a node of the master zone, reshard the bucket
> - On the secondary zone, purge the bucket
> - Restart the RGW(s)
> - re-enable sync of the bucket.
>
> 4m objects/bucket is way to much...
>
> Regards
>
> Teoman
>
> On Thu, Nov 4, 2021 at 5:57 PM Boris Behrens  wrote:
>
> > Hi everybody,
> >
> > we maintain three ceph clusters (2x octopus, 1x nautilus) that use three
> > zonegroups to sync metadata, without syncing the actual data (only one
> zone
> > per zonegroup).
> >
> > Some customer got buckets with >4m objects in our largest cluster (the
> > other two a very fresh with close to 0 data in it)
> >
> > How do I handle that in regards of the "Large OMAP objects" warning?
> > - Sharding is not an option, because it is a multisite environment (at
> > least thats what I read everywere)
> > - Limiting the customers is not a great option, because he already got
> that
> > huge amount of files in their buckets
> > - disabling the warning / increasing the threashold, is IMHO a bad option
> > (people might have put some thinking in that limit and having 40x the
> limit
> > is far off the "just roll with it" threashold)
> >
> > I really hope that someone does have an answer, or maybe there is some
> > roadmap which addresses this issue.
> >
> > Cheers
> >  Boris
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] fresh pacific installation does not detect available disks

2021-11-04 Thread Scharfenberg, Carsten
Hello everybody,

as ceph newbie I've tried out setting up ceph pacific according to the official 
documentation: https://docs.ceph.com/en/latest/cephadm/install/
The intention was to setup a single node "cluster" with radosgw to feature 
local S3 storage.
This failed because my ceph "cluster" would not detect OSDs.
I started from a Debain 11.1 (bullseye) VM hosted on VMware workstation. Of 
course I've added some additional disk images to be used as OSDs.
These are the steps I've performed:

curl --silent --remote-name --location 
https://github.com/ceph/ceph/raw/pacific/src/cephadm/cephadm
chmod +x cephadm
./cephadm add-repo --release pacific
./cephadm install

apt install -y cephadm

cephadm bootstrap --mon-ip 

cephadm add-repo --release pacific

cephadm install ceph-common

ceph orch apply osd --all-available-devices


The last command would have no effect. Its sole output is:

Scheduled osd.all-available-devices update...



Also ceph -s shows that no OSDs were added:

  cluster:

id: 655a7a32-3bbf-11ec-920e-000c29da2e6a

health: HEALTH_WARN

OSD count 0 < osd_pool_default_size 1



  services:

mon: 1 daemons, quorum terraformdemo (age 2d)

mgr: terraformdemo.aylzbb(active, since 2d)

osd: 0 osds: 0 up, 0 in (since 2d)



  data:

pools:   0 pools, 0 pgs

objects: 0 objects, 0 B

usage:   0 B used, 0 B / 0 B avail

pgs:


To find out what may be going wrong I've also tried out this:

cephadm install ceph-osd

ceph-volume inventory
This results in a list that makes more sense:

Device Path   Size rotates available Model name

/dev/sdc  20.00 GB TrueTrue  VMware Virtual S

/dev/sde  20.00 GB TrueTrue  VMware Virtual S

/dev/sda  20.00 GB TrueFalse VMware Virtual S

/dev/sdb  20.00 GB TrueFalse VMware Virtual S

/dev/sdd  20.00 GB TrueFalse VMware Virtual S


So how can I convince cephadm to use the available devices?

Regards,
Carsten

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Grafana embed in dashboard no longer functional

2021-11-04 Thread Zach Heise (SSCC)

  
We're using cephadm with all 5 nodes on 16.2.6. Until today,
  grafana has been running only on ceph05.

Before the 16.2.6 update, the embedded frames would pop up an
  expected security error for self-signed certificates, but after
  accepting would work. After the 16.2.6 update, the frames still
  pop up the cert warning, but then after accepting it instead of
  seeing graphs I now get orange text that simply states:
 If you're
  seeing this Grafana has failed to load its application files

  
 1. This could be
  caused by your reverse proxy settings.
  
  2. If you host grafana under subpath make sure your grafana.ini
  root_url setting includes subpath
  
  3. If you have a local dev build make sure you build frontend
  using: yarn start, yarn start:hot, or yarn build

 4. Sometimes
restarting grafana-server can help
I have restarted grafana with ceph orch stop/start grafana
  commands - worked, but no change.
I deleted the grafana daemon entirely from the cluster, then
  recreated it and cephadm placed it on ceph01. Dashboard required a
  disable/enable in order for the new IP address to start being
  used, but after that, same error message.

Grafana is functional and accessible when accessed directly at
  ceph01:3000 and the command ceph dashboard
  get-grafana-api-ssl-verify is showing https://IP:3000 as expected.

I've made no modifications from whatever cephadm has as its
  default grafana/prometheus containerized daemon settings; this
  problem only started showing up after a completely ordinary
  upgrade from 16.2.5 to 16.2.6.
Searching "If you're seeing this Grafana has failed to load its
  application files" with "ceph" query doesn't yield much. We are
  not using a reverse proxy AFAIK.
Thanks for any insight you might have,
-Zach

  

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: fresh pacific installation does not detect available disks

2021-11-04 Thread Zach Heise
Hi Carsten,

When I had problems on my physical hosts (recycled systems that we wanted to
just use in a test cluster) I found that I needed to use sgdisk --zap-all
/dev/sd{letter} to clean all partition maps off the disks before ceph would
recognize them as available. Worth a shot in your case, even though as fresh
virtual volumes they shouldn't have anything on them (yet) anyway.

-Original Message-
From: Scharfenberg, Carsten  
Sent: Thursday, November 4, 2021 12:59 PM
To: ceph-users@ceph.io
Subject: [ceph-users] fresh pacific installation does not detect available
disks

Hello everybody,

as ceph newbie I've tried out setting up ceph pacific according to the
official documentation: https://docs.ceph.com/en/latest/cephadm/install/
The intention was to setup a single node "cluster" with radosgw to feature
local S3 storage.
This failed because my ceph "cluster" would not detect OSDs.
I started from a Debain 11.1 (bullseye) VM hosted on VMware workstation. Of
course I've added some additional disk images to be used as OSDs.
These are the steps I've performed:

curl --silent --remote-name --location
https://github.com/ceph/ceph/raw/pacific/src/cephadm/cephadm
chmod +x cephadm
./cephadm add-repo --release pacific
./cephadm install

apt install -y cephadm

cephadm bootstrap --mon-ip 

cephadm add-repo --release pacific

cephadm install ceph-common

ceph orch apply osd --all-available-devices


The last command would have no effect. Its sole output is:

Scheduled osd.all-available-devices update...



Also ceph -s shows that no OSDs were added:

  cluster:

id: 655a7a32-3bbf-11ec-920e-000c29da2e6a

health: HEALTH_WARN

OSD count 0 < osd_pool_default_size 1



  services:

mon: 1 daemons, quorum terraformdemo (age 2d)

mgr: terraformdemo.aylzbb(active, since 2d)

osd: 0 osds: 0 up, 0 in (since 2d)



  data:

pools:   0 pools, 0 pgs

objects: 0 objects, 0 B

usage:   0 B used, 0 B / 0 B avail

pgs:


To find out what may be going wrong I've also tried out this:

cephadm install ceph-osd

ceph-volume inventory
This results in a list that makes more sense:

Device Path   Size rotates available Model name

/dev/sdc  20.00 GB TrueTrue  VMware Virtual S

/dev/sde  20.00 GB TrueTrue  VMware Virtual S

/dev/sda  20.00 GB TrueFalse VMware Virtual S

/dev/sdb  20.00 GB TrueFalse VMware Virtual S

/dev/sdd  20.00 GB TrueFalse VMware Virtual S


So how can I convince cephadm to use the available devices?

Regards,
Carsten

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email
to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: fresh pacific installation does not detect available disks

2021-11-04 Thread Yury Kirsanov
Hi,
You should erase any partitions or LVM groups on the disks and restart OSD
hosts so CEPH would be able to detect drives. I usually just do 'dd
if=/dev/zero of=/dev/ bs=1M count=1024' and then reboot host to make
sure it will definitely be clean. Or, alternatively, you can zap the
drives, or you can just remove LVM groups using pvremove or remove
patitions using fdisk.

Regards,
Yury.

On Fri, 5 Nov 2021, 07:24 Zach Heise,  wrote:

> Hi Carsten,
>
> When I had problems on my physical hosts (recycled systems that we wanted
> to
> just use in a test cluster) I found that I needed to use sgdisk --zap-all
> /dev/sd{letter} to clean all partition maps off the disks before ceph would
> recognize them as available. Worth a shot in your case, even though as
> fresh
> virtual volumes they shouldn't have anything on them (yet) anyway.
>
> -Original Message-
> From: Scharfenberg, Carsten 
> Sent: Thursday, November 4, 2021 12:59 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] fresh pacific installation does not detect available
> disks
>
> Hello everybody,
>
> as ceph newbie I've tried out setting up ceph pacific according to the
> official documentation: https://docs.ceph.com/en/latest/cephadm/install/
> The intention was to setup a single node "cluster" with radosgw to feature
> local S3 storage.
> This failed because my ceph "cluster" would not detect OSDs.
> I started from a Debain 11.1 (bullseye) VM hosted on VMware workstation. Of
> course I've added some additional disk images to be used as OSDs.
> These are the steps I've performed:
>
> curl --silent --remote-name --location
> https://github.com/ceph/ceph/raw/pacific/src/cephadm/cephadm
> chmod +x cephadm
> ./cephadm add-repo --release pacific
> ./cephadm install
>
> apt install -y cephadm
>
> cephadm bootstrap --mon-ip 
>
> cephadm add-repo --release pacific
>
> cephadm install ceph-common
>
> ceph orch apply osd --all-available-devices
>
>
> The last command would have no effect. Its sole output is:
>
> Scheduled osd.all-available-devices update...
>
>
>
> Also ceph -s shows that no OSDs were added:
>
>   cluster:
>
> id: 655a7a32-3bbf-11ec-920e-000c29da2e6a
>
> health: HEALTH_WARN
>
> OSD count 0 < osd_pool_default_size 1
>
>
>
>   services:
>
> mon: 1 daemons, quorum terraformdemo (age 2d)
>
> mgr: terraformdemo.aylzbb(active, since 2d)
>
> osd: 0 osds: 0 up, 0 in (since 2d)
>
>
>
>   data:
>
> pools:   0 pools, 0 pgs
>
> objects: 0 objects, 0 B
>
> usage:   0 B used, 0 B / 0 B avail
>
> pgs:
>
>
> To find out what may be going wrong I've also tried out this:
>
> cephadm install ceph-osd
>
> ceph-volume inventory
> This results in a list that makes more sense:
>
> Device Path   Size rotates available Model name
>
> /dev/sdc  20.00 GB TrueTrue  VMware Virtual S
>
> /dev/sde  20.00 GB TrueTrue  VMware Virtual S
>
> /dev/sda  20.00 GB TrueFalse VMware Virtual S
>
> /dev/sdb  20.00 GB TrueFalse VMware Virtual S
>
> /dev/sdd  20.00 GB TrueFalse VMware Virtual S
>
>
> So how can I convince cephadm to use the available devices?
>
> Regards,
> Carsten
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email
> to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: fresh pacific installation does not detect available disks

2021-11-04 Thread Сергей Процун
Hello,

I agree with that point. When ceph creates lvm volumes it adds lvm tags to
them. Thats how ceph finds that those they are occupied by ceph. So you
should remove lvm volumes and even better clean all data on those lvm
volumes. Usually its enough to clean just the head of lvm partition where
it stores information of the volumes itself.

---
Sergey Protsun


чт, 4 лист. 2021, 22:29 користувач Yury Kirsanov 
пише:

> Hi,
> You should erase any partitions or LVM groups on the disks and restart OSD
> hosts so CEPH would be able to detect drives. I usually just do 'dd
> if=/dev/zero of=/dev/ bs=1M count=1024' and then reboot host to make
> sure it will definitely be clean. Or, alternatively, you can zap the
> drives, or you can just remove LVM groups using pvremove or remove
> patitions using fdisk.
>
> Regards,
> Yury.
>
> On Fri, 5 Nov 2021, 07:24 Zach Heise,  wrote:
>
> > Hi Carsten,
> >
> > When I had problems on my physical hosts (recycled systems that we wanted
> > to
> > just use in a test cluster) I found that I needed to use sgdisk --zap-all
> > /dev/sd{letter} to clean all partition maps off the disks before ceph
> would
> > recognize them as available. Worth a shot in your case, even though as
> > fresh
> > virtual volumes they shouldn't have anything on them (yet) anyway.
> >
> > -Original Message-
> > From: Scharfenberg, Carsten 
> > Sent: Thursday, November 4, 2021 12:59 PM
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] fresh pacific installation does not detect
> available
> > disks
> >
> > Hello everybody,
> >
> > as ceph newbie I've tried out setting up ceph pacific according to the
> > official documentation: https://docs.ceph.com/en/latest/cephadm/install/
> > The intention was to setup a single node "cluster" with radosgw to
> feature
> > local S3 storage.
> > This failed because my ceph "cluster" would not detect OSDs.
> > I started from a Debain 11.1 (bullseye) VM hosted on VMware workstation.
> Of
> > course I've added some additional disk images to be used as OSDs.
> > These are the steps I've performed:
> >
> > curl --silent --remote-name --location
> > https://github.com/ceph/ceph/raw/pacific/src/cephadm/cephadm
> > chmod +x cephadm
> > ./cephadm add-repo --release pacific
> > ./cephadm install
> >
> > apt install -y cephadm
> >
> > cephadm bootstrap --mon-ip 
> >
> > cephadm add-repo --release pacific
> >
> > cephadm install ceph-common
> >
> > ceph orch apply osd --all-available-devices
> >
> >
> > The last command would have no effect. Its sole output is:
> >
> > Scheduled osd.all-available-devices update...
> >
> >
> >
> > Also ceph -s shows that no OSDs were added:
> >
> >   cluster:
> >
> > id: 655a7a32-3bbf-11ec-920e-000c29da2e6a
> >
> > health: HEALTH_WARN
> >
> > OSD count 0 < osd_pool_default_size 1
> >
> >
> >
> >   services:
> >
> > mon: 1 daemons, quorum terraformdemo (age 2d)
> >
> > mgr: terraformdemo.aylzbb(active, since 2d)
> >
> > osd: 0 osds: 0 up, 0 in (since 2d)
> >
> >
> >
> >   data:
> >
> > pools:   0 pools, 0 pgs
> >
> > objects: 0 objects, 0 B
> >
> > usage:   0 B used, 0 B / 0 B avail
> >
> > pgs:
> >
> >
> > To find out what may be going wrong I've also tried out this:
> >
> > cephadm install ceph-osd
> >
> > ceph-volume inventory
> > This results in a list that makes more sense:
> >
> > Device Path   Size rotates available Model name
> >
> > /dev/sdc  20.00 GB TrueTrue  VMware Virtual S
> >
> > /dev/sde  20.00 GB TrueTrue  VMware Virtual S
> >
> > /dev/sda  20.00 GB TrueFalse VMware Virtual S
> >
> > /dev/sdb  20.00 GB TrueFalse VMware Virtual S
> >
> > /dev/sdd  20.00 GB TrueFalse VMware Virtual S
> >
> >
> > So how can I convince cephadm to use the available devices?
> >
> > Regards,
> > Carsten
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email
> > to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Grafana embed in dashboard no longer functional

2021-11-04 Thread Zach Heise (SSCC)

  
Argh - that was it. Tested in Microsoft Edge and it worked fine.
  I was using Firefox as my primary browser, and the "enhanced
  tracking protection" setting was the issue killing the iframe
  loading. Once I disabled that for the mgr daemon's URL the embeds
  started loading correctly.
All good now on my end, I tried from multiple computers in my
  testing - but on firefox both times. 



On 2021-11-04 4:13 PM, Ernesto Puerta
  wrote:


  
  Hi Zach,


Just tried and I cannot reproduce this issue. I see a
  "Grafana v6.7.4 (8e44bbc5f5)" version.


Can you please check if you hit this with other browsers?


In order to debug this issue, it'd be helpful if you can
  open your browser's dev-tools (CTRL+SHIFT+i), click on tab
  "Network" and check the network requests the Ceph-Dashboard is
  sending to Grafana:


Also any messages in the _javascript_ Console would be
  useful.

  

  
Kind Regards,
Ernesto

  
  

  
  
  
On Thu, Nov 4, 2021 at 8:30 PM
  Zach Heise (SSCC) 
  wrote:


  
We're using cephadm with all 5 nodes on 16.2.6. Until
  today, grafana has been running only on ceph05.

Before the 16.2.6 update, the embedded frames would pop
  up an expected security error for self-signed
  certificates, but after accepting would work. After the
  16.2.6 update, the frames still pop up the cert warning,
  but then after accepting it instead of seeing graphs I now
  get orange text that simply states:
 If
  you're seeing this Grafana has failed to load its
  application files 
  
 1. This could
  be caused by your reverse proxy settings.
  
  2. If you host grafana under subpath make sure your
  grafana.ini root_url setting includes subpath
  
  3. If you have a local dev build make sure you build
  frontend using: yarn start, yarn start:hot, or yarn build

 4.
Sometimes restarting grafana-server can help
I have restarted grafana with ceph orch stop/start
  grafana commands - worked, but no change.
I deleted the grafana daemon entirely from the cluster,
  then recreated it and cephadm placed it on ceph01.
  Dashboard required a disable/enable in order for the new
  IP address to start being used, but after that, same error
  message.

Grafana is functional and accessible when accessed
  directly at ceph01:3000 and the command ceph dashboard
  get-grafana-api-ssl-verify is showing https://IP:3000 as expected.

I've made no modifications from whatever cephadm has as
  its default grafana/prometheus containerized daemon
  settings; this problem only started showing up after a
  completely ordinary upgrade from 16.2.5 to 16.2.6.
Searching "If you're seeing this Grafana has failed to
  load its application files" with "ceph" query doesn't
  yield much. We are not using a reverse proxy AFAIK.
Thanks for any insight you might have,
-Zach

  
  ___
  ceph-users mailing list -- ceph-users@ceph.io
  To unsubscribe send an email to ceph-users-le...@ceph.io

  

  

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-04 Thread Sage Weil
Can you try setting paxos_propose_interval to a smaller number, like .3 (by
default it is 2 seconds) and see if that has any effect.

It sounds like the problem is not related to getting the OSD marked down
(or at least that is not the only thing going on).  My next guess is that
the peering process that follows needs to get OSDs' up_thru values to
update and there is delay there.

Thanks!
sage


On Thu, Nov 4, 2021 at 4:15 AM Manuel Lausch  wrote:

> On Tue, 2 Nov 2021 09:02:31 -0500
> Sage Weil  wrote:
>
>
> >
> > Just to be clear, you should try
> >   osd_fast_shutdown = true
> >   osd_fast_shutdown_notify_mon = false
>
> I added some logs to the tracker ticket with this options set.
>
>
> > You write if the osd rejects messenger connections, because it is
> > > stopped, the peering process will skip the read_lease timeout. If
> > > the OSD annouces its shutdown, can we not skip this read_lease
> > > timeout as well?
> > >
> >
> > If memory serves, yes, but the notify_mon process can take more time
> > than a peer OSD getting ECONNREFUSED.  The combination above is the
> > recommended combation (and the default).
>
> On my tests yesterday I saw again, that it took about 2 seconds between
> stopping a OSD and the first blame in the ceph.log
> With the notification enabled, I got immediately the down message.
>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] One cephFS snapshot kills performance

2021-11-04 Thread Sebastian Mazza
Hi all!

I’m new to cephFS. My test file system uses a replicated pool on NVMe SSDs for 
metadata and an erasure coded pool on HDDs for data. All OSDs uses bluestore.
I used the ceph version 16.2.6 for all daemons - created with this version and 
running this version. The linux kernel that I used for Mounting CephFS is from 
Debian: Linux file2 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 
GNU/Linux

Until I created the first snapshot (e.g.: `mkdir 
/mnt/shares/users/.snap/test-01`) the performance of the cephFS (mount point: 
`/mnt/shares`) seams fine to me. I first noticed the performance problem while 
I was re-syncing a directory with `rsync`, since the re-sync / update took 
longer than the initial `rsync` run. After multiple days of investigation, I’m 
rather sure that the performance problem is directly related to snapshots. I 
first experienced my problem with `rsync` but the problem can be observed by a 
simple execution of `du`. Therefore, I guess that some sort of "stat" call in 
combination with snapshots are responsible for the bad performance.

My test folder `/mnt/shares/backup-remote/` contains lots small files and many 
hard links in lots of sub folder.

After a restart of the whole cluster and the client and without a single 
snapshot in the whole file system, a run of `du` takes 4m 17s.  When all the 
OSD, MON and client caches are warmed the same `du` takes only `12s`. After 
umount and mount the cephFS again, which should empty all the client caches but 
keep caches on the OSD and MON side warmed, the execution of `du` takes 1m 56s. 
This runtimes are all perfectly fine for me.

However, If I take a single snapshots in another folder (e.g. `mkdir 
/mnt/shares/users/.snap/test-01`) that is not even related to the 
`/mnt/shares/backup-remote/` test folder, the runtime of `du` with cold client 
caches jumps to 19m 42s. An immediate second run of `du` take only 12s but 
after unmounting and mounting the cephFS it take again nearly 20 minutes. That 
is 10 times longer than without a single snapshot. I need to do a bit more 
testing but at the moment it looks like that every further snapshots add around 
1 minute of additional runtime. 

During such a run of `du` with a snapshot anywhere in the file system all the 
Ceph daemons seam to be bored, also the OSDs do hardly any IO. The only thing 
in the system that I can find that looks busy is a kernel worker of the client 
that mounts the FS and runs `du`. A process named “kworker/0:1+ceph-msgr" is 
constantly near 100% CPU usage. The fact that the kernel seams to spend all the 
time in a method called “ceph_update_snap_trace” makes me even more confident 
that the problem is a result of snapshots.

Kernel Stack Trace examples (`echo l > /proc/sysrq-trigger` and `dmesg`)
--
[11316.757494] Call Trace:
[11316.757494]  ceph_queue_cap_snap+0x37/0x4e0 [ceph]
[11316.757496]  ? ceph_put_snap_realm+0x28/0xd0 [ceph]
[11316.757497]  ceph_update_snap_trace+0x3f0/0x4f0 [ceph]
[11316.757498]  dispatch+0x79d/0x1520 [ceph]
[11316.757499]  ceph_con_workfn+0x1a5f/0x2850 [libceph]
[11316.757500]  ? finish_task_switch+0x72/0x250
[11316.757502]  process_one_work+0x1b6/0x350
[11316.757503]  worker_thread+0x53/0x3e0
[11316.757504]  ? process_one_work+0x350/0x350
[11316.757505]  kthread+0x11b/0x140
[11316.757506]  ? __kthread_bind_mask+0x60/0x60
[11316.757507]  ret_from_fork+0x22/0x30
--
[36120.030685] Call Trace:
[36120.030686]  sort_r+0x173/0x210
[36120.030687]  build_snap_context+0x115/0x260 [ceph]
[36120.030688]  rebuild_snap_realms+0x23/0x70 [ceph]
[36120.030689]  rebuild_snap_realms+0x3d/0x70 [ceph]
[36120.030690]  ceph_update_snap_trace+0x2eb/0x4f0 [ceph]
[36120.030691]  dispatch+0x79d/0x1520 [ceph]
[36120.030692]  ceph_con_workfn+0x1a5f/0x2850 [libceph]
[36120.030693]  ? finish_task_switch+0x72/0x250
[36120.030694]  process_one_work+0x1b6/0x350
[36120.030695]  worker_thread+0x53/0x3e0
[36120.030695]  ? process_one_work+0x350/0x350
[36120.030696]  kthread+0x11b/0x140
[36120.030697]  ? __kthread_bind_mask+0x60/0x60
[36120.030698]  ret_from_fork+0x22/0x30
[36120.030960] NMI backtrace for cpu 3 skipped: idling at 
native_safe_halt+0xe/0x10
--

Deleting all snapshots does not restore the original performance. Only after a 
recursive copy (with rsync) of the whole  `backup-remote` folder to a new 
location and using this new folder for `du`, the performance is as it was 
before taking the first snapshot.

Related issue reports I have found:
* https://tracker.ceph.com/issues/44100?next_issue_id=44099
* 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IDMLNQMFGTJRR5QXFZ2YAYPN67UZH4Q4/


I would be very interested in an explanation for this behaviour. Of course I 
would be very thankful for a solution of the problem or an advice that could 
help.


Thanks in advance.

Best wishes,
Sebastian

[ceph-users] Optimal Erasure Code profile?

2021-11-04 Thread Zakhar Kirpichenko
Hi!

I've got a CEPH 16.2.6 cluster, the hardware is 6 x Supermicro SSG-6029P
nodes, each equipped with:

2 x Intel(R) Xeon(R) Gold 5220R CPUs
384 GB RAM
2 x boot drives
2 x 1.6 TB enterprise NVME drives (DB/WAL)
2 x 6.4 TB enterprise drives (storage tier)
9 x 9TB HDDs (storage tier)
2 x Intel XL710 NICs connected to a pair of 40/100GE switches

Please help me understand the calculation / choice of the optimal EC
profile for this setup. I would like the EC pool to span all 6 nodes on HDD
only and have the optimal combination of resiliency and efficiency with the
view that the cluster will expand. Previously when I had only 3 nodes I
tested EC with:

crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8

I am leaning towards using the above profile with k=4,m=2 for "production"
use, but am not sure that I understand the math correctly, that this
profile is optimal for my current setup, and that I'll be able to scale it
properly by adding new nodes. I would very much appreciate any advice!

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Are setting 'ceph auth caps' and/or adding a cache pool I/O-disruptive operations?

2021-11-04 Thread Zakhar Kirpichenko
Hi,

I'm trying to figure out if setting auth caps and/or adding a cache pool
are I/O-disruptive operations, i.e. if caps reset to 'none' for a brief
moment or client I/O momentarily stops for other reasons.

For example, I had the following auth setting in my 16.2.x cluster:

client.cinder
key: BLA
caps: [mgr] profile rbd pool=volumes, profile rbd
pool=volumes-nvme, profile rbd pool=ec-volumes-meta, profile rbd
pool=ec-volumes-data, profile rbd pool=vms
caps: [mon] profile rbd
caps: [osd] profile rbd pool=volumes, profile rbd
pool=volumes-nvme, profile rbd pool=ec-volumes-meta, profile rbd
pool=ec-volumes-data, profile rbd pool=vms, profile rbd-read-only
pool=images

I executed the following command to grant client 'cinder' access to the
'volume-cache' pool:

ceph auth caps client.cinder mgr 'profile rbd pool=volumes, profile rbd
pool=volumes-nvme, profile rbd pool=ec-volumes-meta, profile rbd
pool=ec-volumes-data, profile rbd pool=vms, profile rbd pool=volumes-cache'
mon 'profile rbd' osd 'profile rbd pool=volumes, profile rbd
pool=volumes-nvme, profile rbd pool=ec-volumes-meta, profile rbd
pool=ec-volumes-data, profile rbd pool=vms, profile rbd-read-only
pool=images, profile rbd pool=volumes-cache'

I.e., just added the necessary access to the 'volumes-cache' pool, nothing
else. And then I set the 'volumes-cache' pool as an overlay for 'volumes'
('volumes-cache' was previously set up as a writeback cache tier for
'volumes'):

ceph osd tier set-overlay volumes volumes-cache

One of these operations, i.e. either 'ceph auth caps' or 'ceph osd tier
set-overlay', resulted in a brief interruption of client I/O towards the
'volumes' pool, which caused some VMs  (qemu, librbd) running on clients to
lose their virtual disks. I'm not sure which one, and now I'm overly
cautious about touching either of these things :-)

I would very much appreciate any advice!

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Stale monitoring alerts in UI

2021-11-04 Thread Zakhar Kirpichenko
Hi,

I seem to have some stale monitoring alerts in my Mgr UI, which do not want
to go away. For example (I'm also attaching an image for your convenience):

MTU Mismatch: Node ceph04 has a different MTU size (9000) than the median
value on device storage-int.

The alerts appears to be active, but doesn't reflect the actual situation:

06:00 [root@ceph04 ~]# ip li li | grep -E "ens2f0|ens3f0|8:
bond0|storage-int"
4: ens3f0:  mtu 9000 qdisc mq master
bond0 state UP mode DEFAULT group default qlen 1000
6: ens2f0:  mtu 9000 qdisc mq master
bond0 state UP mode DEFAULT group default qlen 1000
8: bond0:  mtu 9000 qdisc noqueue
state UP mode DEFAULT group default qlen 1000
10: storage-int@bond0:  mtu 9000 qdisc
noqueue state UP mode DEFAULT group default qlen 1000

I have similarly stuck alerts about 'high pg count deviation', which
triggered during the cluster rebalance but somehow never cleared, despite
all operations finished successfully and CLI tools report that the cluster
is healthy. How can I clear these alerts?

I would very much appreciate any advice.

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Are setting 'ceph auth caps' and/or adding a cache pool I/O-disruptive operations?

2021-11-04 Thread Zakhar Kirpichenko
Yes, it was an attempt to address poor performance, which didn't go well.

Btw, this isn't the first time I'm reading that cache tier is "kind of
deprecated", but the documentation doesn't really say this but explains how
to make a cache tier instead. Perhaps it should be made more clear that
adding a cache tier isn't a good idea.

/Z

On Fri, Nov 5, 2021 at 7:55 AM Anthony D'Atri 
wrote:

> Cache tiers are kind of deprecated, they’re finicky and easy to run into
> trouble with.  Suggest avoiding.
>
> > On Nov 4, 2021, at 10:42 PM, Zakhar Kirpichenko 
> wrote:
> >
> > Hi,
> >
> > I'm trying to figure out if setting auth caps and/or adding a cache pool
> > are I/O-disruptive operations, i.e. if caps reset to 'none' for a brief
> > moment or client I/O momentarily stops for other reasons.
> >
> > For example, I had the following auth setting in my 16.2.x cluster:
> >
> > client.cinder
> >key: BLA
> >caps: [mgr] profile rbd pool=volumes, profile rbd
> > pool=volumes-nvme, profile rbd pool=ec-volumes-meta, profile rbd
> > pool=ec-volumes-data, profile rbd pool=vms
> >caps: [mon] profile rbd
> >caps: [osd] profile rbd pool=volumes, profile rbd
> > pool=volumes-nvme, profile rbd pool=ec-volumes-meta, profile rbd
> > pool=ec-volumes-data, profile rbd pool=vms, profile rbd-read-only
> > pool=images
> >
> > I executed the following command to grant client 'cinder' access to the
> > 'volume-cache' pool:
> >
> > ceph auth caps client.cinder mgr 'profile rbd pool=volumes, profile rbd
> > pool=volumes-nvme, profile rbd pool=ec-volumes-meta, profile rbd
> > pool=ec-volumes-data, profile rbd pool=vms, profile rbd
> pool=volumes-cache'
> > mon 'profile rbd' osd 'profile rbd pool=volumes, profile rbd
> > pool=volumes-nvme, profile rbd pool=ec-volumes-meta, profile rbd
> > pool=ec-volumes-data, profile rbd pool=vms, profile rbd-read-only
> > pool=images, profile rbd pool=volumes-cache'
> >
> > I.e., just added the necessary access to the 'volumes-cache' pool,
> nothing
> > else. And then I set the 'volumes-cache' pool as an overlay for 'volumes'
> > ('volumes-cache' was previously set up as a writeback cache tier for
> > 'volumes'):
> >
> > ceph osd tier set-overlay volumes volumes-cache
> >
> > One of these operations, i.e. either 'ceph auth caps' or 'ceph osd tier
> > set-overlay', resulted in a brief interruption of client I/O towards the
> > 'volumes' pool, which caused some VMs  (qemu, librbd) running on clients
> to
> > lose their virtual disks. I'm not sure which one, and now I'm overly
> > cautious about touching either of these things :-)
> >
> > I would very much appreciate any advice!
> >
> > Best regards,
> > Zakhar
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io