[ceph-users] Re: 3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

2022-03-10 Thread Igor Fedotov

Hi Sasa,

jsut a few thoughts/questions on your issue in attempt to understand 
what's happening.


First of all I'd like to clarify what exact command are you using to 
assess the fragmentation. There are two options: "bluestore allocator 
score" and "bluestore allocator fragmentation"


Both are not very accurate though but it would be interesting to have 
both numbers for the case with presumably high fragmentation.



Secondly - I can imagine two performance issues when writing to 
all-flash OSD under heavy fragmentation:


1) Bluestore Allocator takes too long to allocate a new block.

2) Bluestore to invoke a large bunch of disk write requests to process 
single 4M user writing. Which might be less efficient.


I've never seen the latter being a significant issue when SSDs are in 
use (it definitely is for spinners) .


But I recall we've seen some issues with 1), e.g. 
https://tracker.ceph.com/issues/52804


In this respect could you please try to switch bluestore and bluefs 
allocators to bitmap and run some smoke benchmarking again.


Additionally you might want to upgrade to 15.2.16 which includes a bunch 
of improvements for Avl/Hybrid allocators tail latency numbers as per 
the ticket above.



And finally it would be great to get bluestore performance counters for 
both good and bad benchmarks. This can be obtained via: ceph tell osd.N 
perf dump bluestore


but please reset the counters before each benchmarking with: ceph tell 
osd.N perf reset all



Thanks,

Igor

On 3/8/2022 12:50 PM, Sasa Glumac wrote:

Proxmox = 6.4-8

CEPH =  15.2.15

Nodes = 3

Network = 2x100G / node

Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB

 nvme Samsung PM-1733 MZWLJ1T9HBJR  2TB

CPU = EPYC 7252

CEPH pools = 2 separate pools for each disk type and each disk spliced in 2
OSD's

Replica = 3


VM don't do many writes and i migrated main testing VM's to 2TB pool which
in turns fragments faster.


[SPOILER="ceph osd df"]

[CODE]ID  CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP
META AVAIL%USE   VAR   PGS  STATUS

  3   nvme  1.74660   1.0  1.7 TiB  432 GiB  431 GiB  4.3 MiB  1.3 GiB
1.3 TiB  24.18  0.90  186  up

10   nvme  1.74660   1.0  1.7 TiB  382 GiB  381 GiB  599 KiB  1.4 GiB
1.4 TiB  21.38  0.79  151  up

  7  ssd2n  0.87329   1.0  894 GiB  279 GiB  278 GiB  2.0 MiB  1.2 GiB
615 GiB  31.19  1.16  113  up

  8  ssd2n  0.87329   1.0  894 GiB  351 GiB  349 GiB  5.8 MiB  1.2 GiB
544 GiB  39.22  1.46  143  up

  4   nvme  1.74660   1.0  1.7 TiB  427 GiB  425 GiB  9.6 MiB  1.4 GiB
1.3 TiB  23.85  0.89  180  up

11   nvme  1.74660   1.0  1.7 TiB  388 GiB  387 GiB  3.5 MiB  1.5 GiB
1.4 TiB  21.72  0.81  157  up

  2  ssd2n  0.87329   1.0  894 GiB  297 GiB  296 GiB  4.1 MiB  1.1 GiB
598 GiB  33.18  1.23  121  up

  6  ssd2n  0.87329   1.0  894 GiB  333 GiB  332 GiB  8.6 MiB  1.2 GiB
561 GiB  37.23  1.38  135  up

  5   nvme  1.74660   1.0  1.7 TiB  415 GiB  413 GiB  5.9 MiB  1.3 GiB
1.3 TiB  23.18  0.86  176  up

  9   nvme  1.74660   1.0  1.7 TiB  400 GiB  399 GiB  4.3 MiB  1.7 GiB
1.4 TiB  22.38  0.83  161  up

  0  ssd2n  0.87329   1.0  894 GiB  332 GiB  330 GiB  4.3 MiB  1.3 GiB
563 GiB  37.07  1.38  135  up

  1  ssd2n  0.87329   1.0  894 GiB  298 GiB  297 GiB  1.7 MiB  1.3 GiB
596 GiB  33.35  1.24  121  up

TOTAL   16 TiB  4.2 TiB  4.2 TiB   55 MiB   16 GiB
11 TiB  26.92

MIN/MAX VAR: 0.79/1.46  STDDEV: 6.88[/CODE]

[/SPOILER]


[SPOILER="ceph osd crush tree"]

[CODE]ID   CLASS  WEIGHTTYPE NAME

-12  ssd2n   5.23975  root default~ssd2n

  -9  ssd2n   1.74658  host pmx-s01~ssd2n

   7  ssd2n   0.87329  osd.7

   8  ssd2n   0.87329  osd.8

-10  ssd2n   1.74658  host pmx-s02~ssd2n

   2  ssd2n   0.87329  osd.2

   6  ssd2n   0.87329  osd.6

-11  ssd2n   1.74658  host pmx-s03~ssd2n

   0  ssd2n   0.87329  osd.0

   1  ssd2n   0.87329  osd.1

  -2   nvme  10.47958  root default~nvme

  -4   nvme   3.49319  host pmx-s01~nvme

   3   nvme   1.74660  osd.3

  10   nvme   1.74660  osd.10

  -6   nvme   3.49319  host pmx-s02~nvme

   4   nvme   1.74660  osd.4

  11   nvme   1.74660  osd.11

  -8   nvme   3.49319  host pmx-s03~nvme

   5   nvme   1.74660  osd.5

   9   nvme   1.74660  osd.9

  -1 15.71933  root default

  -3  5.23978  host pmx-s01

   3   nvme   1.74660  osd.3

  10   nvme   1.74660  osd.10

   7  ssd2n   0.87329  osd.7

   8  ssd2n   0.87329  osd.8

  -5  5.23978  host pmx-s02

   4   nvme   1.74660  osd.4

  11   nvme   1.74660  osd.11

   2  ssd2n   0.87329  osd.2

   6  ssd2n   0.87329  osd.6

  -7  5.23978  host pmx-s03

   5   nvme   1.74660  osd.5

   9   nvme   1.74660  osd.9

   0  ssd2n   0.87329   

[ceph-users] Scrubs stalled on Pacific

2022-03-10 Thread Filipe Azevedo
Hello Ceph team.

I've recently (3 weeks ago) upgraded a Ceph cluster from Nautilus to
Pacific (16.2.7), and have encountered a strange issue.

Scrubs, either deep or not, are scheduled and are shown in the cluster
status, but there is no disk IO and they never finish. At the moment all of
my PGs have fallen behind on scrubs which has me worried since data can be
inconsistent and I'm not aware.

This seems similar to:
https://tracker.ceph.com/issues/54172
but no solution was reported, it has not begun to work in my case.

and
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/A45BWXLWC2PKLGA5G7GXKCZDNHEOL2LL/#RHHJXQCBV2GGTGD7GAX5PACEY6TFSWRQ

Because I also see that message in the logs, but the mentioned backport has
not happened as far as I can tell. I'm assuming that this is not common,
otherwise there would be more reports, is there something special about my
cluster? Any workarounds to have scrubs working again?

Thank you and best regards,
Filipe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats

2022-03-10 Thread Claas Goltz
Hi,

I’m in the process of upgrading all our ceph servers from 14.2.9 to 16.2.7.

Two of three monitors are on 16.2.6 and one is 16.2.7. I will update them
soon.



Before updating to 16.2.6/7 I set the “bluestore_fsck_quick_fix_on_mount
false” flag and I already upgraded more than the half of my OSD Hosts (10
so far) to the latest Version without any problems. My Health Check now
says:

“92 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats”



How should I handle the warning now?

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats

2022-03-10 Thread Marc
> 
> I’m in the process of upgrading all our ceph servers from 14.2.9 to
> 16.2.7.

Why not first to 14.2.22? I think the standard is to upgrade from the newest 
version, not? 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Replace HDD with cephadm

2022-03-10 Thread Jimmy Spets
Hello I have a Ceph Pacific cluster managed by cephadm. The nodes have six HDD:s and one NVME that is shared between the six HDD:s. The OSD spec file looks like this: service_type: osdservice_id: osd_spec_defaultplacement:  host_pattern: '*'data_devices:  rotational: 1db_devices:  rotational: 0  size: '800G:1200G'db_slots: 6encrypted: true I need to replace one of the HDD:s that is broken. How do I replace the HDD in the OSD connecting it to the old HDD:s db_slot? /Jimmy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing

2022-03-10 Thread Ray Cunningham

We have 16 Storage Servers each with 16TB HDDs and 2TB SSDs for DB/WAL, so we 
are using bluestore. The system is running Nautilus 14.2.19 at the moment, with 
an upgrade scheduled this month. I can't give you a complete ceph config dump 
as this is an offline customer system, but I can get answers for specific 
questions. 

Off the top of my head, we have set:

osd_max_scrubs 20
osd_scrub_auto_repair true
osd _scrub_load_threashold 0.6
We do not limit srub hours. 

Thank you,
Ray 
 

 

-Original Message-
From: norman.kern  
Sent: Wednesday, March 9, 2022 7:28 PM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Scrubbing

Ray,

Can you  provide more information about your cluster(hardware and software 
configs)?

On 3/10/22 7:40 AM, Ray Cunningham wrote:
>   make any difference. Do
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing

2022-03-10 Thread Ray Cunningham
We have that set to 20 at the moment.

Thank you,
Ray Cunningham

Systems Engineering and Services Manager
keepertechnology
(571) 223-7242


From: Szabo, Istvan (Agoda) 
Sent: Wednesday, March 9, 2022 7:35 PM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Scrubbing

Have you tried to increase osd max scrubs to 2?
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---


On 2022. Mar 10., at 6:42, Ray Cunningham 
mailto:ray.cunning...@keepertech.com>> wrote:
Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Everyone,

We have a 900 OSD cluster and our pg scrubs aren't keeping up. We are always 
behind and have tried to tweak some of the scrub config settings to allow a 
higher priority and faster scrubbing, but it doesn't seem to make any 
difference. Does anyone have any suggestions for increasing scrub throughput?

Thank you,
Ray Cunningham

Systems Engineering and Services Manager
keepertechnology
(571) 223-7242


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph -s hang at Futex: futex_wait_setbit_private:futex_clock_realtime

2022-03-10 Thread Xianqiang Jing
Hi Ceph experts
Recently, I try to fix a ceph cluster with all monitor crashed. I created a new 
monitor with cephx disabled. Then I use the ceph-objectstore-tool to dump 
cluster map from all ceph osds. 
i move the the .sst files to a new folder then use "ceph-monstore-tool 
/tmp/monstore/ rebuild"  command to recreate the fs in 
/var/lib/ceph/mon/ceph-node1/ directory. then restart the ceph monitor deamon.
The monitor daemon start normally as expected. But when i run "strace ceph -s" 
command, it show me that process hang at " Futex: 
futex_wait_setbit_private:futex_clock_realtime."
I was confused and don't know how to deal with it. Anyone can give me some 
suggestions


Thanks so much.


|
|
Ivan Jing
|
|
jingxianqian...@126.com
|
签名由网易邮箱大师定制
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing

2022-03-10 Thread Ray Cunningham
From: 

osd_scrub_load_threshold
The normalized maximum load. Ceph will not scrub when the system load (as 
defined by getloadavg() / number of online CPUs) is higher than this number. 
Default is 0.5.

Does anyone know how I can run getloadavg() / number of online CPUs so I can 
see what our load is? Is that a ceph command, or an OS command? 

Thank you,
Ray 
 

-Original Message-
From: Ray Cunningham 
Sent: Thursday, March 10, 2022 7:59 AM
To: norman.kern 
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Scrubbing


We have 16 Storage Servers each with 16TB HDDs and 2TB SSDs for DB/WAL, so we 
are using bluestore. The system is running Nautilus 14.2.19 at the moment, with 
an upgrade scheduled this month. I can't give you a complete ceph config dump 
as this is an offline customer system, but I can get answers for specific 
questions. 

Off the top of my head, we have set:

osd_max_scrubs 20
osd_scrub_auto_repair true
osd _scrub_load_threashold 0.6
We do not limit srub hours. 

Thank you,
Ray 
 

 

-Original Message-
From: norman.kern  
Sent: Wednesday, March 9, 2022 7:28 PM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Scrubbing

Ray,

Can you  provide more information about your cluster(hardware and software 
configs)?

On 3/10/22 7:40 AM, Ray Cunningham wrote:
>   make any difference. Do
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

2022-03-10 Thread Sasa Glumac
> First of all I'd like to clarify what exact command are you using to
> assess the fragmentation. There are two options: "bluestore allocator
> score" and "bluestore allocator fragmentation"
I am using this one : "ceph daemon osd.$i bluestore allocator score block"

> Both are not very accurate though but it would be interesting to have
> both numbers for the case with presumably high fragmentation.
Here are numbers from single server to keep email shorter but almost exact
scores are on other 2 nodes , I recreated OSD's 0,1 46h ago and they were
perfect , now already extra slow and fragmented:
for i in 5 9 0 1 ; do echo $i ; ceph daemon osd.$i bluestore allocator
score block ; done
>
> 5
> {
> "fragmentation_rating": 0.29451514185657074
> }
> 9
> {
> "fragmentation_rating": 0.29940778224909959
> }
> 0
> {
> "fragmentation_rating": 0.84247390671066713
> }
> 1
> {
> "fragmentation_rating": 0.78098161172652247
> }


for i in 5 9 0 1 ; do echo $i ; ceph daemon osd.$i bluestore allocator
fragmentation block ; done

> 5
> {
> "fragmentation_rating": 0.0055253213950322861
> }
> 9
> {
> "fragmentation_rating": 0.0053455960516075665
> }
> 0
> {
> "fragmentation_rating": 0.014439265895713198
> }
> 1
> {
> "fragmentation_rating": 0.013245320572893494
> }


> In this respect could you please try to switch bluestore and bluefs
> allocators to bitmap and run some smoke benchmarking again.
Can i change this on live server (is there possibility of losing data etc
)? Can you please share correct procedure.


> Additionally you might want to upgrade to 15.2.16 which includes a bunch
> of improvements for Avl/Hybrid allocators tail latency numbers as per
> the ticket above.
Atm we use pve repository where 15.2.15 is latest , I will need to either
wait for .16 from them or create second cluster without proxmox but would
like to test on existing.
Is there any difference between pve ceph and regular so i can change repo
and install over existing ?

> And finally it would be great to get bluestore performance counters for
> both good and bad benchmarks. This can be obtained via: ceph tell osd.N
> perf dump bluestore
>
> but please reset the counters before each benchmarking with: ceph tell
> osd.N perf reset all
DATEBENCH=$(date +"%Y-%m-%d-%H-%M-%S") && ceph tell osd.0 perf reset all &&
ceph tell osd.0 bench >>
/root/ceph_osd_bench_results/$DATEBENCH-perf-dump-bluestore-osd-0-and-bench-fragmented.log
&& ceph tell osd.0 perf dump bluestore >>
/root/ceph_osd_bench_results/$DATEBENCH-perf-dump-bluestore-osd-0-and-bench-fragmented.log

{
> "bytes_written": 1073741824,
> "blocksize": 4194304,
> "elapsed_sec": 1.140284746999,
> "bytes_per_sec": 941643591.06348729,
> "iops": 224.50532700144942
> }
> {
> "bluestore": {
> "kv_flush_lat": {
> "avgcount": 142,
> "sum": 0.000820554,
> "avgtime": 0.05778
> },
> "kv_commit_lat": {
> "avgcount": 142,
> "sum": 1.208369108,
> "avgtime": 0.008509641
> },
> "kv_sync_lat": {
> "avgcount": 142,
> "sum": 1.209189662,
> "avgtime": 0.008515420
> },
> "kv_final_lat": {
> "avgcount": 141,
> "sum": 0.044558120,
> "avgtime": 0.000316015
> },
> "state_prepare_lat": {
> "avgcount": 407,
> "sum": 1.443276139,
> "avgtime": 0.003546133
> },
> "state_aio_wait_lat": {
> "avgcount": 407,
> "sum": 12.148961431,
> "avgtime": 0.029850028
> },
> "state_io_done_lat": {
> "avgcount": 407,
> "sum": 0.009644771,
> "avgtime": 0.23697
> },
> "state_kv_queued_lat": {
> "avgcount": 407,
> "sum": 5.441919173,
> "avgtime": 0.013370808
> },
> "state_kv_commiting_lat": {
> "avgcount": 407,
> "sum": 8.541078753,
> "avgtime": 0.020985451
> },
> "state_kv_done_lat": {
> "avgcount": 407,
> "sum": 0.000117127,
> "avgtime": 0.00287
> },
> "state_deferred_queued_lat": {
> "avgcount": 0,
> "sum": 0.0,
> "avgtime": 0.0
> },
> "state_deferred_aio_wait_lat": {
> "avgcount": 0,
> "sum": 0.0,
> "avgtime": 0.0
> },
> "state_deferred_cleanup_lat": {
> "avgcount": 0,
> "sum": 0.0,
> "avgtime": 0.0
> },
> "state_finishing_lat": {
> "avgcount": 407,
> "sum": 0.41350,
> "avgtime": 0.00101
> },
> "state_done_lat": {
> "avgcount": 407,
> "sum": 0.033037493,
> "avgtim

[ceph-users] Procedure for migrating wal.db to ssd

2022-03-10 Thread Anderson, Erik
Hi Everyone,

I am running a containerized pacific cluster 15.2.15 with 80 spinning disks and 
20 SSD. Currently the SSDs are being used as a cach tier and holds the metadata 
pool for cephfs. I think we could make better use of the SSDs by moving 
block.wal and block.db to the SSDs and I have a few questions about this.


  *   How do I change my config to hold block.db and block.wal? Do I need to 
fail out drives and then re-create them? What is the proper procedure for this?
  *   How do I determine the number of spinning drives that can share one SSD?
  *   Is this a good idea? Am I likely to see a performance increase – what 
kinds of workflows will benefit from this change?

Thanks for your help!

Erik Anderson
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing

2022-03-10 Thread Ray Cunningham
Well that was incorrect. Someone changed it back to 1. I have now set our max 
scrubs to 2. We’ll see if that makes a difference.

Thank you,
Ray

From: Ray Cunningham
Sent: Thursday, March 10, 2022 8:00 AM
To: Szabo, Istvan (Agoda) 
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Scrubbing

We have that set to 20 at the moment.

Thank you,
Ray

From: Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>
Sent: Wednesday, March 9, 2022 7:35 PM
To: Ray Cunningham 
mailto:ray.cunning...@keepertech.com>>
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Scrubbing

Have you tried to increase osd max scrubs to 2?
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2022. Mar 10., at 6:42, Ray Cunningham 
mailto:ray.cunning...@keepertech.com>> wrote:
Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Everyone,

We have a 900 OSD cluster and our pg scrubs aren't keeping up. We are always 
behind and have tried to tweak some of the scrub config settings to allow a 
higher priority and faster scrubbing, but it doesn't seem to make any 
difference. Does anyone have any suggestions for increasing scrub throughput?

Thank you,
Ray Cunningham

Systems Engineering and Services Manager
keepertechnology
(571) 223-7242


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Election deadlock after network split in stretch cluster

2022-03-10 Thread Florian Pritz
Hi,

We have a test cluster for evaluating stretch mode and we are running
into an issue where the monitors fail to elect a leader after a network
split, even though all monitors are online and can reach each other.

I saw a presentation by Gregory Farnum from FOSDEM 2020 about stretch
clusters and his explantation of the connectivity strategy sounds like
this case should not happen, so I've put him on CC in case he can share
more details about this process.


Our cluster consists of 3 data centers, with a total of 5 monitor nodes
and 4 osd nodes. Two data centers each have 2 monitors and 2 osd nodes,
while the third data center provides a tiebreaker monitor.

We collected some debugging information with the following commands:

- `ceph daemon /var/run/ceph/ceph-mon.$(hostname).asok mon_status` for the
"mon rank".

- `ceph daemon /var/run/ceph/ceph-mon.$(hostname).asok connection scores
dump` for the "connectivity rank".

During the tests we used `watch -n1` to monitor the output of these
commands. This also means we may be missing information that changes too
quickly.

The issue we encounter happens when the following events occur:

1. All nodes are online, the monitors have monitor and connectivity
ranks 0 to 4. Each occurs exactly once and the monitor rank of a node
matches the connectivity rank of that node. I will refer to these nodes
as nodes A to E now. So node A has rank 0, B has 1, C 2, D 3, E 4.

2. One data center with nodes B and C goes offline. We simulate this by
running `ip route add blackhole $ip` on all machines in that one data center
to block all IPs of the other nodes from the other data centers. We do
this on monitor and OSD nodes at the "same" time (like within one
second).

3. The surviving nodes create a new quorum between nodes A, D and E. So
ranks 0, 3 and 4 have survived. Their connectivity scores change to 0,
1, 2, but their monitor rank stays the same. They go to "leader" and
"peon" in the monitor status. `ceph status` also shows that stretch mode
has detected a data center failure.

3.1. The monitors B and C in the offlined data center retain their monitor
and connectivity ranks of 1 and 2, but are stuck in "probing" (I think)
or "electing" state in the monitor status. I'm not 100% sure which state
they were in, but if that matters I can retest. They were certainly not
part of the quorum.

4. We restore the connection of the offline data center again by
removing the blackhole routes.

5. The nodes momentarily managed to create a quorum, but it collapsed
within seconds and we are not quite sure if the quorum contained all
nodes or just a subset of nodes.

6. The quorum collapses and the connectivity rank of nodes B to E
changes to 1. Their monitor ranks are still 1 to 4. The monitor and
connectivity rank of node A is still 0.

7. All monitors are stuck in "electing" state in the monitor status for
at least several minutes. We've seen this happen for hours before, but
back then we hadn't yet analysed it in this detail. After reproducing it
now, it stayed like this for at least 5 minutes.

During that time, the "epoch" in the monitor status increases by 2 every
(roughly wall timed) 5 seconds. Nothing else happens.

We believe that the connectivity strategy is unable to create a quorum
due to only nodes 0 and 1 being supposedly online. We believe that ceph
does not notice that the connectivity rank and the actual monitor rank of a
node differ before the connectivity data is sent to other nodes. Those
nodes simply collect all the data from nodes B to E under the data for
rank 1. Thus rank 2 to 4 are seen as offline, but at least three nodes
are required to build a quorum since our cluster contains 5 monitor
nodes.

8. We restart one node where the connectivity rank and the monitor rank
mismatch. In this case we decided to use node D with monitor rank 3 and
connectivity rank 1. After the restart it changed the connectivity rank
to 3 as well.

Now the cluster is able to find 3 nodes again to build a quorum and
after a few seconds all 5 nodes join a working quorum, even though nodes
C and E still show connectivity rank 1, together with node B. Node B
also shows monitor rank 1 so connectivity rank 1 sounds correct here.


It appears that the connectivity rank can only be decreased and never
increased again unless the monitor process is restarted. When failovers
happen and the quorum is reduced, the rank reduces and eventually the
cluster enters a state where the connectivity ranks are too low to
support the quorum requirement.

During brief testing, we were unable to reproduce this issue by simply
taking monitors offline, without also taking their osd nodes offline as
well. This may thus be related to special strech mode handling when an
entire data center fails.

I believe this is not supposed to happen and the cluster should be able
to 1) recover completely after the data center comes back online and 2)
the connectivity rank should match the monitor rank of each node.

I hope I've de

[ceph-users] Re: Election deadlock after network split in stretch cluster

2022-03-10 Thread Florian Pritz
On Thu, Mar 10, 2022 at 06:33:10PM +0100, Florian Pritz 
 wrote:
> We have a test cluster for evaluating stretch mode and we are running
> into an issue where the monitors fail to elect a leader after a network
> split, even though all monitors are online and can reach each other.

Oh, totally forgot to mention that we are running 16.2.7 on Ubuntu,
installed with the Ceph upstream packages. So no docker containers or
stuff like that.

Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-10 Thread Sebastian Mazza
Hi Igor!

I hope I've cracked the checkpot now. I have logs with osd debug level 20 for 
bluefs, bdev, and bluestore. The log files ceph-osd.4.log shows 2 consecutive 
startups of the osd.4 where the second startup results in:
```
rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, 
found 0 in db/002193.sst
```

The log ceph-osd.4.log shows the last two starts of the OSD. However, I send 
you also the previous log that contains the “*** Immediate shutdown 
(osd_fast_shutdown=true) ***” that happened before the last successful OSD 
startup, as the last line. It is the last line, because the server was down for 
about 39 hours. Based on what you explained to me, the corruption of the rocks 
DB must have occurred during the OSD startup, which starts at the first line of 
ceph-osd.4.log.

Logs and the corrupted rocks DB file: https://we.tl/t-yVsF3akxqX

Does the logs contain what you need?
Pleas tell me If you need more data from the OSD. If not I would rebuild it.


Best wishes,
Sebastian

> On 26.02.2022, at 00:12, Igor Fedotov  wrote:
> 
> Sebastian,
> 
> On 2/25/2022 7:17 PM, Sebastian Mazza wrote:
>> Hi Igor,
>> 
>>> Unfortunately the last logs (from 24-02-2022) you shared don't include the 
>>> point where actual corruption happened as the previois log did. These new 
>>> logs miss the last successful OSD start which apparently corrupts DB data. 
>>> Do you have any output prior to their content?
>> I have logs from the last successful startup of all OSDs, since this was the 
>> cluster boot where the OSD.7 failed (21-02-2022 at 13:49:28.452+0100). That 
>> also means that the logs does not include the bluestore debug info, only 
>> bluefs and bdev. I will include all logs that I have for the crashed OSD 5 
>> and 6. The last successful startup is inside the files that ends with “4”. 
>> While the startups are collected with  bluefs and bdev debug infos, the logs 
>> between the startup (21-02-2022 at 13:49:28.452+0100) and the crash 
>> (2022-02-24T00:04:57.874+0100) was collected without any debug logging, 
>> since during that time I have rebuild OSD 7.
>> 
>> Logs: https://we.tl/t-x0la8gZl0S
> Unfortunately they're of little help without verbose output :(
>> 
>> Since I have to manually rebuild the the cluster, pools, cephFS, and RBD 
>> images now, It will take several days before I can try to reproduce the 
>> problem.
>> 
>> Should, or could I enable more verbose logging then
>> ```
>> [osd]
>> debug bluefs = 20
>> debug bdev = 20
>> debug bluestore = 20
>> ```
>> in ceph.conf? Or somehow else?
> 
> This can be done either through ceph.conf or using "ceph config set" CLI 
> command.
> 
> To make resulting log less you can use a bit less verbose logging levels:
> 
> debug bluefs = 10
> 
> debug bdev = 5
> 
> debug bluestore = 20
> 
> 
>> 
>> Thanks,
>> Sebastian
>> 
>> 
>> 
>>> On 25.02.2022, at 16:18, Igor Fedotov  wrote:
>>> 
>>> Hi Sebastian,
>>> 
>>> I submitted a ticket https://tracker.ceph.com/issues/54409 which shows my 
>>> analysis based on your previous log (from 21-02-2022). Which wasn't verbose 
>>> enough at debug-bluestore level to make the final conclusion.
>>> 
>>> Unfortunately the last logs (from 24-02-2022) you shared don't include the 
>>> point where actual corruption happened as the previois log did. These new 
>>> logs miss the last successful OSD start which apparently corrupts DB data. 
>>> Do you have any output prior to their content?
>>> 
>>> If not could you please reproduce that once again? Generally  I'd like to 
>>> see OSD log for a broken startup along with a couple of restarts back - the 
>>> event sequence for the failure seems to be as follows:
>>> 
>>> 1) OSD is shutdown for the first time. It (for uncear reasons) keeps a set 
>>> of deferred writes to be applied once again.
>>> 
>>> 2) OSD is started up which triggers deferred writes submissions. They 
>>> overlap (again for unclear reasons so far) with DB data content written 
>>> shortly before. The OSD starts properly but DB data corruption has happened 
>>> at this point
>>> 
>>> 3) OSD is restarted again which reveals the data corruption and since that 
>>> point OSD is unable to start.
>>> 
>>> So these last logs new logs include 3) only for now. While I need 1) & 2) 
>>> as well...
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 2/24/2022 3:04 AM, Sebastian Mazza wrote:
 Hi Igor,
 
 I let ceph rebuild the OSD.7. Then I added
 ```
 [osd]
debug bluefs = 20
 debug bdev = 20
 debug bluestore = 20
 ```
 to the ceph.conf of all 3 nodes and shut down all 3 nodes without writing 
 anything to the pools on the HDDs (the Debian VM was not even running).
 Immediately at the first boot OSD.5 and 6 crashed with the same “Bad table 
 magic number” error. The OSDs 5 and 6 are on the same node, but not on the 
 node of OSD 7, wich crashed the last two times.
 
 Logs and corrupted rocks

[ceph-users] Re: OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats

2022-03-10 Thread Dan van der Ster
Hi,

After Nautilus there were two omap usage stats upgrades:
Octopus (v15) fsck (on by default) enables per-pool omap usage stats.
Pacific (v16) fsck (off by default) enables per-pg omap usage stats.
(fsck is off by default in pacific because it takes quite some time to
update the on-disk metadata, and btw the pacific fsck had a data
corrupting bug until 16.2.7 [1]).

You're getting a warning because you skipped over Octopus (which is
ok!) but this jump means you miss the per-pool omap stats upgrade.
Confusingly, the per-pool omap warning is *on* by default, hence the
warning message.

You can disable the per-pool warning with:
ceph config set global bluestore_warn_on_no_per_pool_omap false

Or you can decide to fsck the OSDs now with 16.2.7. This will add
per-pg omap stats and clear the warning.

This is documented here:

https://docs.ceph.com/en/latest/rados/operations/health-checks/#bluestore-no-per-pool-omap

Cheers, Dan

On Thu, Mar 10, 2022 at 12:43 PM Claas Goltz  wrote:
>
> Hi,
>
> I’m in the process of upgrading all our ceph servers from 14.2.9 to 16.2.7.
>
> Two of three monitors are on 16.2.6 and one is 16.2.7. I will update them
> soon.
>
>
>
> Before updating to 16.2.6/7 I set the “bluestore_fsck_quick_fix_on_mount
> false” flag and I already upgraded more than the half of my OSD Hosts (10
> so far) to the latest Version without any problems. My Health Check now
> says:
>
> “92 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats”
>
>
>
> How should I handle the warning now?
>
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing

2022-03-10 Thread norman.kern

Ray,

You can use node-exporter+prom+grafana  to collect the load of CPUs
statistics. You can use uptime command to get the current statistics.

On 3/10/22 10:51 PM, Ray Cunningham wrote:

From:

osd_scrub_load_threshold
The normalized maximum load. Ceph will not scrub when the system load (as 
defined by getloadavg() / number of online CPUs) is higher than this number. 
Default is 0.5.

Does anyone know how I can run getloadavg() / number of online CPUs so I can 
see what our load is? Is that a ceph command, or an OS command?

Thank you,
Ray


-Original Message-
From: Ray Cunningham
Sent: Thursday, March 10, 2022 7:59 AM
To: norman.kern 
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Scrubbing


We have 16 Storage Servers each with 16TB HDDs and 2TB SSDs for DB/WAL, so we 
are using bluestore. The system is running Nautilus 14.2.19 at the moment, with 
an upgrade scheduled this month. I can't give you a complete ceph config dump 
as this is an offline customer system, but I can get answers for specific 
questions.

Off the top of my head, we have set:

osd_max_scrubs 20
osd_scrub_auto_repair true
osd _scrub_load_threashold 0.6
We do not limit srub hours.

Thank you,
Ray




-Original Message-
From: norman.kern 
Sent: Wednesday, March 9, 2022 7:28 PM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Scrubbing

Ray,

Can you  provide more information about your cluster(hardware and software 
configs)?

On 3/10/22 7:40 AM, Ray Cunningham wrote:

   make any difference. Do

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing

2022-03-10 Thread norman.kern

Ray,

Do you known the IOPS/BW of the cluster?  The 16TB HDD is more suitable
for cold data, If the clients' bw/iops is too big, you can never  finish
the scrub.

And if you adjust the priority, it will have a great impact to the clients.

On 3/10/22 9:59 PM, Ray Cunningham wrote:

We have 16 Storage Servers each with 16TB HDDs and 2TB SSDs for DB/WAL, so we 
are using bluestore. The system is running Nautilus 14.2.19 at the moment, with 
an upgrade scheduled this month. I can't give you a complete ceph config dump 
as this is an offline customer system, but I can get answers for specific 
questions.

Off the top of my head, we have set:

osd_max_scrubs 20
osd_scrub_auto_repair true
osd _scrub_load_threashold 0.6
We do not limit srub hours.

Thank you,
Ray




-Original Message-
From: norman.kern 
Sent: Wednesday, March 9, 2022 7:28 PM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Scrubbing

Ray,

Can you  provide more information about your cluster(hardware and software 
configs)?

On 3/10/22 7:40 AM, Ray Cunningham wrote:

   make any difference. Do

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mclock and background best effort

2022-03-10 Thread Aishwarya Mathuria
Hello Luis,

Background best effort includes background operations such as scrubbing, PG
deletion, and snapshot trimming.
Hope that answers your question!

Regards,
Aishwarya Mathuria
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD storage not balancing properly when crush map uses multiple device classes

2022-03-10 Thread David DELON
Hi, 

i think i have a similar problem with my Octopus cluster. 

$ ceph osd df | grep ssd 
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 
31 ssd 0.36400 1.0 373 GiB 289 GiB 258 GiB 91 MiB 31 GiB 84 GiB 77.46 1.20 
110 up ===> not rebalanced, is it normal? 
46 ssd 0.36400 1.0 373 GiB 234 GiB 209 GiB 303 KiB 25 GiB 138 GiB 62.86 
0.97 88 up 
47 ssd 0.36400 1.0 373 GiB 222 GiB 198 GiB 1.9 MiB 24 GiB 151 GiB 59.51 
0.92 94 up 
30 ssd 0.18199 1.0 186 GiB 111 GiB 99 GiB 16 MiB 13 GiB 75 GiB 59.76 0.92 
47 up ===> smaller disk to ignore. 
32 ssd 0.36400 1.0 373 GiB 210 GiB 187 GiB 51 MiB 23 GiB 163 GiB 56.26 0.87 
86 up 
33 ssd 0.36400 1.0 373 GiB 206 GiB 184 GiB 79 MiB 22 GiB 167 GiB 55.31 0.85 
87 up 

osd.31 have 110 PGs (and VAR=1.20). 
So i have activated the ceph balancer in upmap mode (on all my pools) but no 
pg_upmap_items have been created for PGs on SSD disks. 
I have 2 pools (size=3) primary hold on these 6 OSDs with the following hybrid 
crush rule: 

rule replicated_ibo_one_copy_fast { 
id 6 
type replicated 
min_size 2 
max_size 3 
step take default class ssd 
step chooseleaf firstn -1 type host 
step emit 
step take ibo 
step chooseleaf firstn 1 type host 
step emit 
} 

"ibo" is a distant rack with hosts containing only 8TB hdd disks (for 
resilience only, no performance needed). 

Thanks for reading. 
David. 


- Le 10 Déc 21, à 14:11, Erik Lindahl erik.lind...@gmail.com a écrit : 

> Hi, 
> 
> We are experimenting with using manually created crush maps to pick one SSD 
> as primary and and two HDD devices. Since all our HDDs have the DB & WAL on 
> NVMe drives, this gives us a nice combination of pretty good write 
> performance, and great read performance while keeping costs manageable for 
> hundreds of TB of storage. 
> 
> We have 16 nodes with ~300 HDDs and four separate nodes with 64 7.6TB SSDs. 
> 
> However, we're noticing that the usage on the SSDs isn't very balanced at 
> all - it's ranging from 26% to 52% for some reason (The balancer is active 
> and seems to be happy). 
> 
> 
> I suspect this might have to do with the placement groups now being mixed 
> (i.e., each pg uses 1x SSD and 2x HDD). Is there anything we can do about 
> this to achieve balanced SSD usage automatically? 
> 
> I've included the crush map below, just in case we/I screwed up something 
> there instead :-) 
> 
> 
> Cheers, 
> 
> Erik 
> 
> 
> { 
> "rule_id": 11, 
> "rule_name": "1ssd_2hdd", 
> "ruleset": 11, 
> "type": 1, 
> "min_size": 1, 
> "max_size": 10, 
> "steps": [ 
> { 
> "op": "take", 
> "item": -52, 
> "item_name": "default~ssd" 
> }, 
> { 
> "op": "chooseleaf_firstn", 
> "num": 1, 
> "type": "host" 
> }, 
> { 
> "op": "emit" 
> }, 
> { 
> "op": "take", 
> "item": -24, 
> "item_name": "default~hdd" 
> }, 
> { 
> "op": "chooseleaf_firstn", 
> "num": -1, 
> "type": "host" 
> }, 
> { 
> "op": "emit" 
> } 
> ] 
> } 
> 
> -- 
> Erik Lindahl  
> Science for Life Laboratory, Box 1031, 17121 Solna, Sweden 
> ___ 
> ceph-users mailing list -- ceph-users@ceph.io 
> To unsubscribe send an email to ceph-users-le...@ceph.io 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rbd namespace create - operation not supported

2022-03-10 Thread Kai Stian Olstad

Hi

I'm trying to create namespace in an rbd pool, but get operation not 
supported.

This is on a 16.2.6 Cephadm installed on Ubuntu 20.04.3.

The pool is erasure encoded and the commands I run was the following.

cephadm shell

ceph osd pool create rbd 32 32 erasure ec42-jerasure-blaum_roth-hdd 
--autoscale-mode=warn

ceph osd pool set rbd allow_ec_overwrites true
rbd pool init --pool rbd

rbd namespace create --pool rbd --namespace testspace
rbd: failed to created namespace: (95) Operation not supported
2022-03-11T06:13:30.570+ 7f4a9426e2c0 -1 librbd::api::Namespace: 
create: failed to add namespace: (95) Operation not supported



Isn't namespace supported with erasure encoded pools?


--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: empty lines in radosgw-admin bucket radoslist (octopus 15.2.16)

2022-03-10 Thread Boris Behrens
After removing some orphan objects (4million) I pulled the radoslist again
and got the exact same files with the empty line between them.

Can filenames contain a newline / cr character so the radosgw-admin tool
just makes a new line in the output?

Am Mi., 9. März 2022 um 17:50 Uhr schrieb Boris Behrens :

> Hi,
> I try to search for orphan objects, but it looks like that the
> rgw-orphan-list tooling can not handle empty lines.
>
> So I did a radosgw-admin bucket radoslist to check where those empty lines
> com from.
> And they are just in the middle of the output. There are 80million lines
> in the output and I have 14 empty lines.
>
> I as the orphan objects tooling seems not to expect them I wonder if
> something might be broken.
>
> Anyone ever seen this?
>
>
> root@s3db16:~/orphans/20220305# radosgw-admin bucket radoslist > radoslist
> root@s3db16:~/orphans/20220305# grep -a -B1 -A1 '^$' radoslist
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2071065474.2040_21/f6/d2/ part of the filename>_430x430_90-0.jpeg
>
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2071065474.2040_21/f6/d4/ part of the filename>_450x450_90-0.jpeg
> --
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2071065474.2040_2e/6f/91/ part of the filename>_430x430_90-0.jpeg
>
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2071065474.2040_2e/6f/96/ part of the filename>_450x450_90-0.jpeg
> --
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2071065474.2040_37/3b/ of the filename>_1000x1000_90-0.jpeg
>
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2071065474.2040_37/3b/ef/ part of the filename>_1000x1000_90-0.jpeg
> --
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2071065474.2040_38/a1/fb/ part of the filename>_430x430_90-0.jpeg
>
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2071065474.2040_38/a1/fd/ part of the filename>_450x450_90-0.jpeg
>
>
> Cheers
>  Boris
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io