[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-01-24 Thread Igor Fedotov

Hey Sebastian,

thanks a lot for the update, please see more questions inline.


Thanks,

Igor

On 1/22/2022 2:13 AM, Sebastian Mazza wrote:

Hey Igor,

thank you for your response and your suggestions.


I've tried to simulate every imaginable load that the cluster might have done 
before the three OSD crashed.
I rebooted the servers many times while the Custer was under load. If more than 
a single node was rebooted at the same time, the client hangs until enough 
servers are up again. Which is perfectly fine!
I really tried hard to crash it, but I failed. Wich is excellent in general, 
but unfortunately not helpful for finding the root cause of the problem with 
the corrupted Rocks DBs.

And you haven't made any environsment/config changes, e.g. disk caching 
disablement, since the last issue, right?

There is an environmental change, since I’m currently missing one of my two 
ethernet switches for the cluster. The switches (should) provide a MLAG for 
every server, so every server uses a linux interface bond that is connected 
with one cable to each switch. However, one of the switches is currently for 
RMA because it sporadically failed to (re)boot. I did not change anything at 
the network config of the server, but of corse the linux bond driver is 
currently not able to balance the network traffic across two link, since only 
one is active. Could this have an influence?
Except from disconnecting half of the network cables I did not change anything. 
Alle the HDDs are the same and are inserted into the same drive bays.

Configuration wise I’m not aware of any change. I did only destroy and recreate 
the 3 failed OSDs.

I did now checke the write cache settings of all HDDs by `hdparm -W /dev/sdX` 
which always returns “write-caching =  1 (on)”.
I did also check the OSD setting “bluefs_buffered_io” by `ceph daemon osd.X 
config show | grep bluefs_buffered_io` which returned true for all OSDs.
I’m pretty sure that all this caches was always on.


Do you suggest to disable the HDD write-caching and / or the bluefs_buffered_io 
for productive clusters?

Generally upstream recommendation is to disable disk write caching, 
there were multiple complains it might negatively impact the performance 
in some setups.


As for bluefs_buffered_io - please keep it on, the disablmement is known 
to cause performance drop.





When rebooting a node  - did you perform it by regular OS command (reboot or 
poweroff) or by a power switch?

I never did a hard reset or used the power switch. I used `init 6` for 
performing a reboot. Each server has redundant power supplies with one 
connected to a battery backup and the other to the grid. Therefore, I do think 
that none of the servers ever faced a non clean shutdown or reboot.

So the original reboot which caused the failures was made in the same 
manner, right?

Best regards,
Sebastian


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph RGW 16.2.7 CLI changes

2022-01-24 Thread Александр Махов
I am trying to run a new Ceph cluster with Rados GW using the last software
version 16.2.7, but when I set up RGW nodes I found out there are some
changes in the CLI comparing with a version 16.2.4 I tested before.

The next commands are missed in the 16.2.7 version:

ceph dashboard set-rgw-api-user-id $USER
ceph dashboard set-rgw-api-access-key ...
ceph dashboard set-rgw-api-secret-key ...

they don't exist in ceph dashboard -h output on the 16.2.7 version:

# ceph dashboard -h | grep set-rgw-api | grep -v reset
dashboard set-rgw-api-access-key Set the
RGW_API_ACCESS_KEY option value read from -i
dashboard set-rgw-api-admin-resource  Set the
RGW_API_ADMIN_RESOURCE option value
dashboard set-rgw-api-secret-key Set the
RGW_API_SECRET_KEY option value read from -i
dashboard set-rgw-api-ssl-verify  Set the
RGW_API_SSL_VERIFY option value

But on the 16.2.4 version everything is on place:

# ceph dashboard -h | grep set-rgw-api | grep -v reset
dashboard set-rgw-api-access-key
   Set the RGW_API_ACCESS_KEY option
value read from -i 
dashboard set-rgw-api-admin-resource 
   Set the RGW_API_ADMIN_RESOURCE
option value
dashboard set-rgw-api-host 
   Set the RGW_API_HOST option value
dashboard set-rgw-api-port 
   Set the RGW_API_PORT option value
dashboard set-rgw-api-scheme 
   Set the RGW_API_SCHEME option value
dashboard set-rgw-api-secret-key
   Set the RGW_API_SECRET_KEY option
value read from -i 
dashboard set-rgw-api-ssl-verify 
   Set the RGW_API_SSL_VERIFY option
value
dashboard set-rgw-api-user-id 
   Set the RGW_API_USER_ID option
value


In both cases the host OS is Debian 10.

The list of ceph packages is identical on both RGW setups: 16.2.4:

# dpkg -l | grep ceph
ii  ceph  16.2.4-1~bpo10+1
amd64distributed storage and file system
ii  ceph-base 16.2.4-1~bpo10+1
amd64common ceph daemon libraries and management tools
ii  ceph-common   16.2.4-1~bpo10+1
amd64common utilities to mount and interact with a ceph
storage cluster
ii  ceph-mgr  16.2.4-1~bpo10+1
amd64manager for the ceph distributed storage system
ii  ceph-mgr-modules-core 16.2.4-1~bpo10+1 all
 ceph manager modules which are always enabled
ii  ceph-mon  16.2.4-1~bpo10+1
amd64monitor server for the ceph storage system
ii  ceph-osd  16.2.4-1~bpo10+1
amd64OSD server for the ceph storage system
ii  libcephfs216.2.4-1~bpo10+1
amd64Ceph distributed file system client library
ii  libsqlite3-mod-ceph   16.2.4-1~bpo10+1
amd64SQLite3 VFS for Ceph
ii  python3-ceph-argparse 16.2.4-1~bpo10+1 all
 Python 3 utility libraries for Ceph CLI
ii  python3-ceph-common   16.2.4-1~bpo10+1 all
 Python 3 utility libraries for Ceph
ii  python3-cephfs16.2.4-1~bpo10+1
amd64Python 3 libraries for the Ceph libcephfs library

# dpkg -l | grep rados
ii  librados2 16.2.4-1~bpo10+1
amd64RADOS distributed object store client library
ii  libradosstriper1  16.2.4-1~bpo10+1
amd64RADOS striping interface
ii  python3-rados 16.2.4-1~bpo10+1
amd64Python 3 libraries for the Ceph librados library
ii  radosgw   16.2.4-1~bpo10+1
amd64REST gateway for RADOS distributed object store

16.2.7:

# dpkg -l | grep ceph
ii  ceph  16.2.7-1~bpo10+1
amd64distributed storage and file system
ii  ceph-base 16.2.7-1~bpo10+1
amd64common ceph daemon libraries and management tools
ii  ceph-common   16.2.7-1~bpo10+1
amd64common utilities to mount and interact with a ceph
storage cluster
ii  ceph-mgr  16.2.7-1~bpo10+1
amd64manager for the ceph distributed storage system
ii  ceph-mgr-modules-core 16.2.7-1~bpo10+1 all
 ceph manager modules which are always enabled
ii  ceph-mon  16.2.7-1~bpo10+1
amd64monitor server for the ceph storage system
ii  ceph-osd  16.2.7-1~bpo10+1
amd64OSD server for the ceph storage system
ii  libcephfs216.2.7-1~bpo10+1
amd64Ceph distributed file system client library
ii  libsqlite3-mod-ceph   16.2.7-1~bpo10+1
amd64SQLite3 VFS for Ceph
ii  python3-ceph-argparse 16.2.7-1~bpo10+1 all
 Python 3 utility l

[ceph-users] PG_SLOW_SNAP_TRIMMING and possible storage leakage on 16.2.5

2022-01-24 Thread David Prude
Hello,

   We have a 5-node, 30 hdd (6 hdds/node) cluster running 16.2.5. We
utilize a snapshot scheme within cephfs that results in 24 hourly
snapshots, 7 daily snapshots, and 2 weekly snapshots. This has been
running without overt issues for several months. As of this weekend, we
started receiving a  PG_SLOW_SNAP_TRIMMING warning on a single PG. Over
the last 24 hours we are now seeing that this warning is associated with
123 of our 1513 PGs. As recommended by the output of "ceph health
detail" we have tried tuning the following from their default values:

osd_pg_max_concurrent_snap_trims=4 (default 2)
osd_snap_trim_sleep_hdd=3 (default 5)
osd_snap_trim_sleep=0.5 (default 0, it was suggested somewhere in a
search that 0 actually disables trim?)

I am uncertain how to best measure if the above is having an effect on
the trimming process. I am unclear on how to clearly monitor the
progress of the snaptrim process or even of the total queue depth.
Interestingly, "ceph pg stat" does not show any PGs in the snaptrim state:

SNIP
1513 pgs: 2 active+clean+scrubbing+deep, 1511 active+clean; 114 TiB
data, 344 TiB used, 93 TiB / 437 TiB avail; 6.2 KiB/s rd, 2.2 MiB/s wr,
118 op/s

SNIP

We have, for the time being, disabled our snapshots in the hopes that
the cluster will catch up with the trimming process. Two potential
things of note:

1. We are unaware of any particular action which would be associated
with this happening now (there were no unusual deletions of either live
data or snapshots).
2. For the past month or two it has appeared as if there has been a
steady unchecked growth in storage utilization as if snapshots have not
been actually being trimmed.

Any assistance in determining what exactly has prompted this behavior or
any guidance on how to evaluate the total snaptrim queue size to see if
we are making progress would be much appreciated.

Thank you,

-David 

-- 
David Prude
Systems Administrator
PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
Democracy Now!
www.democracynow.org


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph RGW 16.2.7 CLI changes

2022-01-24 Thread Ernesto Puerta
Hi Александр,

Starting Pacific 16.2.6, cephadm now configures and manages the RGW
credentials. You can also trigger that auto-configuration on an upgraded
cluster with `ceph dashboard set-rgw-credentials` [docs]

.

Kind Regards,
Ernesto


On Mon, Jan 24, 2022 at 4:48 PM Александр Махов  wrote:

> I am trying to run a new Ceph cluster with Rados GW using the last software
> version 16.2.7, but when I set up RGW nodes I found out there are some
> changes in the CLI comparing with a version 16.2.4 I tested before.
>
> The next commands are missed in the 16.2.7 version:
>
> ceph dashboard set-rgw-api-user-id $USER
> ceph dashboard set-rgw-api-access-key ...
> ceph dashboard set-rgw-api-secret-key ...
>
> they don't exist in ceph dashboard -h output on the 16.2.7 version:
>
> # ceph dashboard -h | grep set-rgw-api | grep -v reset
> dashboard set-rgw-api-access-key Set the
> RGW_API_ACCESS_KEY option value read from -i
> dashboard set-rgw-api-admin-resource  Set the
> RGW_API_ADMIN_RESOURCE option value
> dashboard set-rgw-api-secret-key Set the
> RGW_API_SECRET_KEY option value read from -i
> dashboard set-rgw-api-ssl-verify  Set the
> RGW_API_SSL_VERIFY option value
>
> But on the 16.2.4 version everything is on place:
>
> # ceph dashboard -h | grep set-rgw-api | grep -v reset
> dashboard set-rgw-api-access-key
>Set the RGW_API_ACCESS_KEY option
> value read from -i 
> dashboard set-rgw-api-admin-resource 
>Set the RGW_API_ADMIN_RESOURCE
> option value
> dashboard set-rgw-api-host 
>Set the RGW_API_HOST option value
> dashboard set-rgw-api-port 
>Set the RGW_API_PORT option value
> dashboard set-rgw-api-scheme 
>Set the RGW_API_SCHEME option value
> dashboard set-rgw-api-secret-key
>Set the RGW_API_SECRET_KEY option
> value read from -i 
> dashboard set-rgw-api-ssl-verify 
>Set the RGW_API_SSL_VERIFY option
> value
> dashboard set-rgw-api-user-id 
>Set the RGW_API_USER_ID option
> value
>
>
> In both cases the host OS is Debian 10.
>
> The list of ceph packages is identical on both RGW setups: 16.2.4:
>
> # dpkg -l | grep ceph
> ii  ceph  16.2.4-1~bpo10+1
> amd64distributed storage and file system
> ii  ceph-base 16.2.4-1~bpo10+1
> amd64common ceph daemon libraries and management tools
> ii  ceph-common   16.2.4-1~bpo10+1
> amd64common utilities to mount and interact with a ceph
> storage cluster
> ii  ceph-mgr  16.2.4-1~bpo10+1
> amd64manager for the ceph distributed storage system
> ii  ceph-mgr-modules-core 16.2.4-1~bpo10+1 all
>  ceph manager modules which are always enabled
> ii  ceph-mon  16.2.4-1~bpo10+1
> amd64monitor server for the ceph storage system
> ii  ceph-osd  16.2.4-1~bpo10+1
> amd64OSD server for the ceph storage system
> ii  libcephfs216.2.4-1~bpo10+1
> amd64Ceph distributed file system client library
> ii  libsqlite3-mod-ceph   16.2.4-1~bpo10+1
> amd64SQLite3 VFS for Ceph
> ii  python3-ceph-argparse 16.2.4-1~bpo10+1 all
>  Python 3 utility libraries for Ceph CLI
> ii  python3-ceph-common   16.2.4-1~bpo10+1 all
>  Python 3 utility libraries for Ceph
> ii  python3-cephfs16.2.4-1~bpo10+1
> amd64Python 3 libraries for the Ceph libcephfs library
>
> # dpkg -l | grep rados
> ii  librados2 16.2.4-1~bpo10+1
> amd64RADOS distributed object store client library
> ii  libradosstriper1  16.2.4-1~bpo10+1
> amd64RADOS striping interface
> ii  python3-rados 16.2.4-1~bpo10+1
> amd64Python 3 libraries for the Ceph librados library
> ii  radosgw   16.2.4-1~bpo10+1
> amd64REST gateway for RADOS distributed object store
>
> 16.2.7:
>
> # dpkg -l | grep ceph
> ii  ceph  16.2.7-1~bpo10+1
> amd64distributed storage and file system
> ii  ceph-base 16.2.7-1~bpo10+1
> amd64common ceph daemon libraries and management tools
> ii  ceph-common   16.2.7-1~bpo10+1
> amd64common utilities to mount and interact with a ceph
> storage cluster
> ii  ceph-mgr  16.2.7-1~bpo10+1
> amd64manager for the ceph distributed storage system
> ii  ceph-mgr-modules-core 16.2.7-1~bpo10+1

[ceph-users] Using s3website with ceph orch?

2022-01-24 Thread Manuel Holtgrewe
Dear all,

I'm trying to configure the s3website with a site managed by
ceph-orch. I'm trying to follow [1] in spirit. I have configured two
ingress.rgw services "ingress.rgw.ext" and "ingress.rgw.ext-website"
and point to them via ceph-s3-ext.example.com and
ceph-s3-website-ext.example.com in DNS. I'm attempting to pass the
configuration from below.

However, looking at the configuration of the daemons via the admin
socket tells me that the website-related configuration is not applied.

Is this configuration supported? Would there be a workaround?

Best wishes,
Manuel

# cat  rgw.ext.yml
service_type: rgw
service_id: ext
service_name: rgw.ext
placement:
  hosts:
- osd-1
# count_per_host: 1
# label: rgw
spec:
  rgw_frontend_port: 8100
  rgw_realm: ext
  rgw_zone: ext-default-primary
  config:
rgw_dns_name: ceph-s3-ext.example.com
rgw_dns_s3website_name: ceph-s3-website-ext.example.com
rgw_enable_apis: s3, swift, swift_auth, admin
rgw_enable_static_website: true
rgw_expose_bucket: true
rgw_resolve_cname: true
# cat rgw.ext-website.yml
service_type: rgw
service_id: ext-website
service_name: rgw.ext-website
placement:
  hosts:
- osd-1
# count_per_host: 1
# label: rgw
spec:
  rgw_frontend_port: 8200
  rgw_realm: ext
  rgw_zone: ext-default-primary
  config:
rgw_dns_name: ceph-s3-ext.example.com
rgw_dns_s3website_name: ceph-s3-website-ext.example.com
rgw_enable_apis: s3website
rgw_enable_static_website: true
rgw_resolve_cname: true


[1] https://gist.github.com/robbat2/ec0a66eed28e5f0e1ef7018e9c77910c
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG_SLOW_SNAP_TRIMMING and possible storage leakage on 16.2.5

2022-01-24 Thread Dan van der Ster
Hi David,

We observed the same here: https://tracker.ceph.com/issues/52026
You can poke the trimming by repeering the PGs.

Also, depending on your hardware, the defaults for osd_snap_trim_sleep
might be far too conservative.
We use osd_snap_trim_sleep = 0.1 on our mixed hdd block / ssd block.db OSDs.

Cheers, Dan

On Mon, Jan 24, 2022 at 4:54 PM David Prude  wrote:
>
> Hello,
>
>We have a 5-node, 30 hdd (6 hdds/node) cluster running 16.2.5. We
> utilize a snapshot scheme within cephfs that results in 24 hourly
> snapshots, 7 daily snapshots, and 2 weekly snapshots. This has been
> running without overt issues for several months. As of this weekend, we
> started receiving a  PG_SLOW_SNAP_TRIMMING warning on a single PG. Over
> the last 24 hours we are now seeing that this warning is associated with
> 123 of our 1513 PGs. As recommended by the output of "ceph health
> detail" we have tried tuning the following from their default values:
>
> osd_pg_max_concurrent_snap_trims=4 (default 2)
> osd_snap_trim_sleep_hdd=3 (default 5)
> osd_snap_trim_sleep=0.5 (default 0, it was suggested somewhere in a
> search that 0 actually disables trim?)
>
> I am uncertain how to best measure if the above is having an effect on
> the trimming process. I am unclear on how to clearly monitor the
> progress of the snaptrim process or even of the total queue depth.
> Interestingly, "ceph pg stat" does not show any PGs in the snaptrim state:
>
> SNIP
> 1513 pgs: 2 active+clean+scrubbing+deep, 1511 active+clean; 114 TiB
> data, 344 TiB used, 93 TiB / 437 TiB avail; 6.2 KiB/s rd, 2.2 MiB/s wr,
> 118 op/s
>
> SNIP
>
> We have, for the time being, disabled our snapshots in the hopes that
> the cluster will catch up with the trimming process. Two potential
> things of note:
>
> 1. We are unaware of any particular action which would be associated
> with this happening now (there were no unusual deletions of either live
> data or snapshots).
> 2. For the past month or two it has appeared as if there has been a
> steady unchecked growth in storage utilization as if snapshots have not
> been actually being trimmed.
>
> Any assistance in determining what exactly has prompted this behavior or
> any guidance on how to evaluate the total snaptrim queue size to see if
> we are making progress would be much appreciated.
>
> Thank you,
>
> -David
>
> --
> David Prude
> Systems Administrator
> PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
> Democracy Now!
> www.democracynow.org
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG_SLOW_SNAP_TRIMMING and possible storage leakage on 16.2.5

2022-01-24 Thread David Prude
Dan,

  Thank you for replying. Since I posted I did some more digging. It
really seemed as if snaptrim simply wasn't being processed. The output
of "ceph health detail" showed that PG 3.9b had the longest queue. I
examined this PG and saw that it's primary was osd.8 so I manually
restarted that daemon. This seems to have kicked off snaptrim on some PGs:

SNIP
1513 pgs: 1 active+clean+scrubbing, 1 active+clean+scrubbing+snaptrim,
44 active+clean+snaptrim, 1 active+clean+scrubbing+deep+snaptrim_wait,
1406 active+clean, 2 active+clean+scrubbing+deep, 58
active+clean+snaptrim_wait; 114 TiB data, 344 TiB used, 93 TiB / 437 TiB
avail; 2.0 KiB/s rd, 64 KiB/s wr, 5 op/s
SNIP

I can see the "snaptrimq_len* value decreasing for that PG now. I will
look into the issue you posted as well as repeering the PGs. Does an osd
restart resulting in snaptrim proceeding seem consistent with the
behavior you saw?

I notice in the bug report you linked, that you are somehow monitoring
snaptrimq with grafana. Is this a global value that is readily avilable
for monitoring or are you calculating this somehow. If there is an easy
way to access it, I would greatly appreciate instructions.

Thank you,

-David

On 1/24/22 11:53 AM, Dan van der Ster wrote:
> Hi David,
>
> We observed the same here: https://tracker.ceph.com/issues/52026
> You can poke the trimming by repeering the PGs.
>
> Also, depending on your hardware, the defaults for osd_snap_trim_sleep
> might be far too conservative.
> We use osd_snap_trim_sleep = 0.1 on our mixed hdd block / ssd block.db OSDs.
>
> Cheers, Dan
>
> On Mon, Jan 24, 2022 at 4:54 PM David Prude  wrote:
>> Hello,
>>
>>We have a 5-node, 30 hdd (6 hdds/node) cluster running 16.2.5. We
>> utilize a snapshot scheme within cephfs that results in 24 hourly
>> snapshots, 7 daily snapshots, and 2 weekly snapshots. This has been
>> running without overt issues for several months. As of this weekend, we
>> started receiving a  PG_SLOW_SNAP_TRIMMING warning on a single PG. Over
>> the last 24 hours we are now seeing that this warning is associated with
>> 123 of our 1513 PGs. As recommended by the output of "ceph health
>> detail" we have tried tuning the following from their default values:
>>
>> osd_pg_max_concurrent_snap_trims=4 (default 2)
>> osd_snap_trim_sleep_hdd=3 (default 5)
>> osd_snap_trim_sleep=0.5 (default 0, it was suggested somewhere in a
>> search that 0 actually disables trim?)
>>
>> I am uncertain how to best measure if the above is having an effect on
>> the trimming process. I am unclear on how to clearly monitor the
>> progress of the snaptrim process or even of the total queue depth.
>> Interestingly, "ceph pg stat" does not show any PGs in the snaptrim state:
>>
>> SNIP
>> 1513 pgs: 2 active+clean+scrubbing+deep, 1511 active+clean; 114 TiB
>> data, 344 TiB used, 93 TiB / 437 TiB avail; 6.2 KiB/s rd, 2.2 MiB/s wr,
>> 118 op/s
>>
>> SNIP
>>
>> We have, for the time being, disabled our snapshots in the hopes that
>> the cluster will catch up with the trimming process. Two potential
>> things of note:
>>
>> 1. We are unaware of any particular action which would be associated
>> with this happening now (there were no unusual deletions of either live
>> data or snapshots).
>> 2. For the past month or two it has appeared as if there has been a
>> steady unchecked growth in storage utilization as if snapshots have not
>> been actually being trimmed.
>>
>> Any assistance in determining what exactly has prompted this behavior or
>> any guidance on how to evaluate the total snaptrim queue size to see if
>> we are making progress would be much appreciated.
>>
>> Thank you,
>>
>> -David
>>
>> --
>> David Prude
>> Systems Administrator
>> PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
>> Democracy Now!
>> www.democracynow.org
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
David Prude
Systems Administrator
PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
Democracy Now!
www.democracynow.org


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG_SLOW_SNAP_TRIMMING and possible storage leakage on 16.2.5

2022-01-24 Thread Dan van der Ster
Hi,

Yes, restarting an OSD also works to re-peer and "kick" the
snaptrimming process.
(In the ticket we first noticed this because snap trimming restarted
after an unrelated OSD crashed/restarted).
Please feel free to add your experience to that ticket.

> monitoring snaptrimq

This is from our local monitoring probes, based on `ceph pg dump -f json`.

-- Dan


-- dan

On Mon, Jan 24, 2022 at 6:31 PM David Prude  wrote:
>
> Dan,
>
>   Thank you for replying. Since I posted I did some more digging. It
> really seemed as if snaptrim simply wasn't being processed. The output
> of "ceph health detail" showed that PG 3.9b had the longest queue. I
> examined this PG and saw that it's primary was osd.8 so I manually
> restarted that daemon. This seems to have kicked off snaptrim on some PGs:
>
> SNIP
> 1513 pgs: 1 active+clean+scrubbing, 1 active+clean+scrubbing+snaptrim,
> 44 active+clean+snaptrim, 1 active+clean+scrubbing+deep+snaptrim_wait,
> 1406 active+clean, 2 active+clean+scrubbing+deep, 58
> active+clean+snaptrim_wait; 114 TiB data, 344 TiB used, 93 TiB / 437 TiB
> avail; 2.0 KiB/s rd, 64 KiB/s wr, 5 op/s
> SNIP
>
> I can see the "snaptrimq_len* value decreasing for that PG now. I will
> look into the issue you posted as well as repeering the PGs. Does an osd
> restart resulting in snaptrim proceeding seem consistent with the
> behavior you saw?
>
> I notice in the bug report you linked, that you are somehow monitoring
> snaptrimq with grafana. Is this a global value that is readily avilable
> for monitoring or are you calculating this somehow. If there is an easy
> way to access it, I would greatly appreciate instructions.
>
> Thank you,
>
> -David
>
> On 1/24/22 11:53 AM, Dan van der Ster wrote:
> > Hi David,
> >
> > We observed the same here: https://tracker.ceph.com/issues/52026
> > You can poke the trimming by repeering the PGs.
> >
> > Also, depending on your hardware, the defaults for osd_snap_trim_sleep
> > might be far too conservative.
> > We use osd_snap_trim_sleep = 0.1 on our mixed hdd block / ssd block.db OSDs.
> >
> > Cheers, Dan
> >
> > On Mon, Jan 24, 2022 at 4:54 PM David Prude  wrote:
> >> Hello,
> >>
> >>We have a 5-node, 30 hdd (6 hdds/node) cluster running 16.2.5. We
> >> utilize a snapshot scheme within cephfs that results in 24 hourly
> >> snapshots, 7 daily snapshots, and 2 weekly snapshots. This has been
> >> running without overt issues for several months. As of this weekend, we
> >> started receiving a  PG_SLOW_SNAP_TRIMMING warning on a single PG. Over
> >> the last 24 hours we are now seeing that this warning is associated with
> >> 123 of our 1513 PGs. As recommended by the output of "ceph health
> >> detail" we have tried tuning the following from their default values:
> >>
> >> osd_pg_max_concurrent_snap_trims=4 (default 2)
> >> osd_snap_trim_sleep_hdd=3 (default 5)
> >> osd_snap_trim_sleep=0.5 (default 0, it was suggested somewhere in a
> >> search that 0 actually disables trim?)
> >>
> >> I am uncertain how to best measure if the above is having an effect on
> >> the trimming process. I am unclear on how to clearly monitor the
> >> progress of the snaptrim process or even of the total queue depth.
> >> Interestingly, "ceph pg stat" does not show any PGs in the snaptrim state:
> >>
> >> SNIP
> >> 1513 pgs: 2 active+clean+scrubbing+deep, 1511 active+clean; 114 TiB
> >> data, 344 TiB used, 93 TiB / 437 TiB avail; 6.2 KiB/s rd, 2.2 MiB/s wr,
> >> 118 op/s
> >>
> >> SNIP
> >>
> >> We have, for the time being, disabled our snapshots in the hopes that
> >> the cluster will catch up with the trimming process. Two potential
> >> things of note:
> >>
> >> 1. We are unaware of any particular action which would be associated
> >> with this happening now (there were no unusual deletions of either live
> >> data or snapshots).
> >> 2. For the past month or two it has appeared as if there has been a
> >> steady unchecked growth in storage utilization as if snapshots have not
> >> been actually being trimmed.
> >>
> >> Any assistance in determining what exactly has prompted this behavior or
> >> any guidance on how to evaluate the total snaptrim queue size to see if
> >> we are making progress would be much appreciated.
> >>
> >> Thank you,
> >>
> >> -David
> >>
> >> --
> >> David Prude
> >> Systems Administrator
> >> PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
> >> Democracy Now!
> >> www.democracynow.org
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> David Prude
> Systems Administrator
> PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C  6FDF C294 B58F A286 F847
> Democracy Now!
> www.democracynow.org
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

2022-01-24 Thread Benjamin Staffin
I have a cluster where 46 out of 120 OSDs have begun crash looping with the
same stack trace (see pasted output below).  The cluster is in a very bad
state with this many OSDs down, unsurprisingly.

The day before this problem showed up, the k8s cluster was under extreme
memory pressure and a lot of pods were OOM killed, including some of the
Ceph OSDs, but after the memory pressure abated everything seemed to
stabilize for about a day.

Then we attempted to set a 4gb memory limit on the OSD pods, because they
had been using upwards of 100gb of ram(!) per OSD after about a month of
uptime, and this was a contributing factor in the cluster-wide OOM
situation.  Everything seemed fine for a few minutes after Rook rolled out
the memory limit, but then OSDs gradually started to crash, a few at a
time, up to about 30 of them.  At this point I reverted the memory limit,
but I don't think the OSDs were hitting their memory limits at all.  In an
attempt to stabilize the cluster, we eventually the Rook operator and set
the osd norebalance, nobackfill, noout, and norecover flags, but at this
point there were 46 OSDs down and pools were hitting BackFillFull.

This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5
BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be
fine with a 4gb memory target, right?).  The crash we're seeing looks very
much like the one in this bug report: https://tracker.ceph.com/issues/52220

I don't know how to proceed from here, so any advice would be very much
appreciated.

Ceph version: 16.2.6
Rook version: 1.7.6
Kubernetes version: 1.21.5
Kernel version: 5.4.156-1.el7.elrepo.x86_64
Distro: CentOS 7.9

I've also attached the full log output from one of the crashing OSDs, in
case that is of any use.

begin stack trace paste
debug -1> 2022-01-24T22:09:09.405+ 7ff8b4315700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
In function 'void ECUtil::HashInfo::append(uint64_t, std::map&)' thread 7ff8b4315700 time
2022-01-24T22:09:09.398961+
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size())

 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x158) [0x564f88db554c]
 2: ceph-osd(+0x56a766) [0x564f88db5766]
 3: (ECUtil::HashInfo::append(unsigned long, std::map, std::allocator > >&)+0x14b) [0x564f8910ca0b]
 4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&,
std::shared_ptr&, std::set,
std::allocator > const&, unsigned long, ceph::buffer::v15_2_0::list,
unsigned int, std::shared_ptr, interval_map&, std::map,
std::allocator > >*,
DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c]
 5: ceph-osd(+0xa5a611) [0x564f892a5611]
 6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&,
std::shared_ptr&, pg_t, ECUtil::stripe_info_t
const&, std::map, std::less,
std::allocator > > > const&,
std::vector >&,
std::map, std::less,
std::allocator > > >*, std::map,
std::allocator > >*,
std::set, std::allocator >*,
std::set, std::allocator >*,
DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb]
 7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28]
 8: (ECBackend::check_ops()+0x24) [0x564f89281cd4]
 9: (CallClientContexts::finish(std::pair&)+0x1278) [0x564f8929d338]
 10: (ECBackend::complete_read_op(ECBackend::ReadOp&,
RecoveryMessages*)+0x8f) [0x564f8926dfaf]
 11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106]
 12: (ECBackend::_handle_message(boost::intrusive_ptr)+0x18f)
[0x564f89287bdf]
 13: (PGBackend::handle_message(boost::intrusive_ptr)+0x52)
[0x564f8908dd12]
 14: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e]
 15: (OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
[0x564f88eba1b9]
 16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68) [0x564f89117868]
 17: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x564f895456c4]
 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364]
 20: /lib64/libpthread.so.0(+0x814a) [0x7ff8db40e14a]
 21: clone()

debug  0> 2022-01-24T22:09:09.411+ 7ff8b4315700 -1 *** Caught
signal (Aborted) **
 in thread 7ff8b4315700 thread_name:tp_osd_tp
end paste

# ceph status
  cluster:

[ceph-users] Re: Lots of OSDs crashlooping (DRAFT - feedback?)

2022-01-24 Thread Benjamin Staffin
oh jeez, sorry about the subject line - I forgot to change it after asking
a coworker to review the message.  This is not a draft.

On Mon, Jan 24, 2022 at 6:44 PM Benjamin Staffin 
wrote:

> I have a cluster where 46 out of 120 OSDs have begun crash looping with
> the same stack trace (see pasted output below).  The cluster is in a very
> bad state with this many OSDs down, unsurprisingly.
>
> The day before this problem showed up, the k8s cluster was under extreme
> memory pressure and a lot of pods were OOM killed, including some of the
> Ceph OSDs, but after the memory pressure abated everything seemed to
> stabilize for about a day.
>
> Then we attempted to set a 4gb memory limit on the OSD pods, because they
> had been using upwards of 100gb of ram(!) per OSD after about a month of
> uptime, and this was a contributing factor in the cluster-wide OOM
> situation.  Everything seemed fine for a few minutes after Rook rolled out
> the memory limit, but then OSDs gradually started to crash, a few at a
> time, up to about 30 of them.  At this point I reverted the memory limit,
> but I don't think the OSDs were hitting their memory limits at all.  In an
> attempt to stabilize the cluster, we eventually the Rook operator and set
> the osd norebalance, nobackfill, noout, and norecover flags, but at this
> point there were 46 OSDs down and pools were hitting BackFillFull.
>
> This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
> nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5
> BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be
> fine with a 4gb memory target, right?).  The crash we're seeing looks very
> much like the one in this bug report:
> https://tracker.ceph.com/issues/52220
>
> I don't know how to proceed from here, so any advice would be very much
> appreciated.
>
> Ceph version: 16.2.6
> Rook version: 1.7.6
> Kubernetes version: 1.21.5
> Kernel version: 5.4.156-1.el7.elrepo.x86_64
> Distro: CentOS 7.9
>
> I've also attached the full log output from one of the crashing OSDs, in
> case that is of any use.
>
> begin stack trace paste
> debug -1> 2022-01-24T22:09:09.405+ 7ff8b4315700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> In function 'void ECUtil::HashInfo::append(uint64_t, std::map ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time
> 2022-01-24T22:09:09.398961+
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size())
>
>  ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x158) [0x564f88db554c]
>  2: ceph-osd(+0x56a766) [0x564f88db5766]
>  3: (ECUtil::HashInfo::append(unsigned long, std::map ceph::buffer::v15_2_0::list, std::less, std::allocator const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b]
>  4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t
> const&, std::shared_ptr&, std::set std::less, std::allocator > const&, unsigned long,
> ceph::buffer::v15_2_0::list, unsigned int,
> std::shared_ptr, interval_map ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map ceph::os::Transaction, std::less,
> std::allocator > >*,
> DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c]
>  5: ceph-osd(+0xa5a611) [0x564f892a5611]
>  6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&,
> std::shared_ptr&, pg_t, ECUtil::stripe_info_t
> const&, std::map ceph::buffer::v15_2_0::list, bl_split_merge>, std::less,
> std::allocator ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&,
> std::vector >&,
> std::map ceph::buffer::v15_2_0::list, bl_split_merge>, std::less,
> std::allocator ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map ceph::os::Transaction, std::less,
> std::allocator > >*,
> std::set, std::allocator >*,
> std::set, std::allocator >*,
> DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb]
>  7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28]
>  8: (ECBackend::check_ops()+0x24) [0x564f89281cd4]
>  9: (CallClientContexts::finish(std::pair ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338]
>  10: (ECBackend::complete_read_op(ECBackend::ReadOp&,
> RecoveryMessages*)+0x8f) [0x564f8926dfaf]
>  11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
> RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106]
>  12: (ECBackend::_handle_message(boost::intrusive_ptr)+0x18f)
> [0x564f89287bdf]
>  13: (PGBackend::handle_message(boost::intrusive_ptr)+0x52)
> [0x564f8908dd12]
>  14: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TP

[ceph-users] Re: Multipath and cephadm

2022-01-24 Thread Michal Strnad

Hi all,

we have still problem to add any disk behind multipath. We tried 
osd-spec in yml, ceph orch daemon add osd with mpath, dm-X or sdX 
devices (for sdX we disaled multipath daemon and flush multipath table). 
Do you have any idea?



ceph orch daemon add osd serverX:/dev/mapper/mpathm
RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host 
--net=host --entrypoint /usr/sbin/ceph-volume --privileged 
--group-add=disk -e CONTAINER_IMAGE=quay.io/ceph/ceph:v15 -e 
NODE_NAME=serverX -e CEPH_VOLUME_OSDSPEC_AFFINITY=None -v 
/var/run/ceph/69748548-7ba4-11ec-83c5-3cfdfec3517c:/var/run/ceph:z -v 
/var/log/ceph/69748548-7ba4-11ec-83c5-3cfdfec3517c:/var/log/ceph:z -v 
/var/lib/ceph/69748548-7ba4-11ec-83c5-3cfdfec3517c/crash:/var/lib/ceph/crash:z 
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v 
/run/lock/lvm:/run/lock/lvm -v 
/tmp/ceph-tmpu04_fcc9:/etc/ceph/ceph.conf:z -v 
/tmp/ceph-tmpbmrjdlv2:/var/lib/ceph/bootstrap-osd/ceph.keyring:z 
quay.io/ceph/ceph:v15 lvm batch --no-auto /dev/mapper/mpathm --yes 
--no-systemd

2022-01-24T18:39:08.390014+0100 mgr.serverX.jxbuay [ERR] _Promise failed
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in 
_finalize

next_result = self._on_complete(self._value)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 115, in 
return CephadmCompletion(on_complete=lambda _: f(*args, **kwargs))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1677, in create_osds
return self.osd_service.create_from_spec(drive_group)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 51, in 
create_from_spec

ret = create_from_spec_one(self.prepare_drivegroup(drive_group))
  File "/usr/share/ceph/mgr/cephadm/utils.py", line 65, in 
forall_hosts_wrapper

return CephadmOrchestrator.instance._worker_pool.map(do_work, vals)
  File "/lib64/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/lib64/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
  File "/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
  File "/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
  File "/usr/share/ceph/mgr/cephadm/utils.py", line 59, in do_work
return f(*arg)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 47, in 
create_from_spec_one
host, cmd, replace_osd_ids=osd_id_claims.get(host, []), 
env_vars=env_vars
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 67, in 
create_single_host

code, '\n'.join(err)))


ceph orch daemon add osd serverX:/dev/dm-19
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in _handle_command
return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in 
handle_command

return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in 


wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in 
wrapper

return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 781, in 
_daemon_add_osd

raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in 
raise_if_exception

raise e
RuntimeError: cephadm exited with an error code: 1, 
stderr:/usr/bin/podman: stderr --> passed data devices: 0 physical, 1 LVM

/usr/bin/podman: stderr --> relative data size: 1.0
/usr/bin/podman: stderr -->  IndexError: list index out of range
Traceback (most recent call last):
  File "", line 6251, in 
  File "", line 1359, in _infer_fsid
  File "", line 1442, in _infer_image
  File "", line 3713, in command_ceph_volume
  File "", line 1121, in call_throws
RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host 
--net=host --entrypoint /usr/sbin/ceph-volume --privileged 
--group-add=disk -e CONTAINER_IMAGE=quay.io/ceph/ceph:v15 -e 
NODE_NAME=serverX -e CEPH_VOLUME_OSDSPEC_AFFINITY=None -v 
/var/run/ceph/69748548-7ba4-11ec-83c5-3cfdfec3517c:/var/run/ceph:z -v 
/var/log/ceph/69748548-7ba4-11ec-83c5-3cfdfec3517c:/var/log/ceph:z -v 
/var/lib/ceph/69748548-7ba4-11ec-83c5-3cfdfec3517c/crash:/var/lib/ceph/crash:z 
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v 
/run/lock/lvm:/run/lock/lvm -v 
/tmp/ceph-tmpthn_t0il:/etc/ceph/ceph.conf:z -v 
/tmp/ceph-tmprygbv15w:/var/lib/ceph/bootstrap-osd/ceph.keyring:z 
quay.io/ceph/ceph:v15 lvm batch --no-auto /dev/dm-19 --yes --no-systemd


Ad. Ceph cluster is newly installed. On the backend is CentOS 8 Stream.

Thank you

Regards,
Michal


On 12/28/21 2:31 PM, Michal Strnad wrote:

Hi Dav