[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Igor Fedotov

Hi Sebastian,

could you please share failing OSD startup log?


Thanks,

Igor

On 2/20/2022 5:10 PM, Sebastian Mazza wrote:

Hi Igor,

it happened again. One of the OSDs that crashed last time, has a corrupted 
RocksDB again. Unfortunately I do not have debug logs from the OSDs again. I 
was collecting hundreds of Gigabytes of OSD debug logs in the last two month. 
But this week, I disabled the debug logging, because I did some tests with 
rsync to cephFS and RBD Images on EC pools and the logs did fill up my boot 
drives multiple times.
The corruption happened after I did shut down all 3 nodes and booted it some 
minutes later.

If you are interested, I could share the normal log of the OSD. A log of a 
failed OSD start with debug logging enabled and als the corrupted RocksDB 
export.

It is may be worth taking a note that no crash did happen after hundreds of 
reboots but now it happens after I gracefully shut down all nodes for around 10 
minutes.
Best to my knowledge there was no IO on the crashed OSD for several hours. The 
crashed OSD was used by only two pools. Both are EC pools. One is used as data 
part for  RBD image and on as data storage for a subdirectory of a cephFS. All 
metadata for the cephFS and the RBD pool are stored on replicated NVMEs.
On RBD image on the HDD EC pool was mounted by a VM, but not as boot drive. The 
cephFS was mounted also by this VM and the 3 cluster nodes itself. Apart from 
mounting/unmounting, neither the cephFS nor the BTRFS on the RBD image was 
asked to process any IOs. So nobody was reading or writing to the failed OSD 
for many hours before the shutdown of the cluster and OSD failing happened.


I’m now thinking of how I could add more storage space for the log files to 
each node, so that I can leave on the debug logging all the time.


Best regards,
Sebastian


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-mgr : ModuleNotFoundError: No module named 'requests'

2022-02-21 Thread Ernesto Puerta
Hi Florent,

Can you please check if the location where the Python Requests package is
installed is the same for Buster and Bullseye?

   -
   
https://debian.pkgs.org/10/debian-main-amd64/python3-requests_2.21.0-1_all.deb.html
   -
   
https://debian.pkgs.org/11/debian-main-amd64/python3-requests_2.25.1+dfsg-2_all.deb.html


Kind Regards,
Ernesto


On Sat, Feb 19, 2022 at 1:49 PM Florent B.  wrote:

> Hi,
>
> On a fresh Debian Bullseye installation running Ceph Octopus (15.2.15),
> new mgr daemons can't start telemetry & dashboard modules because of
> missing "requests" Python module.
>
> 2022-02-19T12:31:50.884+ 7f30fdaaa040 -1 mgr[py] Traceback (most
> recent call last):
>File "/usr/share/ceph/mgr/dashboard/__init__.py", line 49, in
> 
>  from .module import Module, StandbyModule  # noqa: F401
>File "/usr/share/ceph/mgr/dashboard/module.py", line 38, in 
>  from .grafana import push_local_dashboards
>File "/usr/share/ceph/mgr/dashboard/grafana.py", line 8, in 
>  import requests
> ModuleNotFoundError: No module named 'requests'
>
> 2022-02-19T12:31:50.884+ 7f30fdaaa040 -1 mgr[py] Class not found
> in module 'dashboard'
> 2022-02-19T12:31:50.884+ 7f30fdaaa040 -1 mgr[py] Error loading
> module 'dashboard': (2) No such file or directory
> 2022-02-19T12:31:54.524+ 7f30fdaaa040 -1 mgr[py] Module not
> found: 'telemetry'
> 2022-02-19T12:31:54.524+ 7f30fdaaa040 -1 mgr[py] Traceback (most
> recent call last):
>File "/usr/share/ceph/mgr/telemetry/__init__.py", line 1, in
> 
>  from .module import Module
>File "/usr/share/ceph/mgr/telemetry/module.py", line 12, in 
>  import requests
> ModuleNotFoundError: No module named 'requests'
>
>
> But requests module is installed :
>
> # echo "import requests; r = requests.get('https://ceph.com/en');
> print(r.status_code)"  | python
> 200
>
> # echo "import requests; r = requests.get('https://ceph.com/en');
> print(r.status_code)"  | python3
> 200
>
> # echo "import requests; r = requests.get('https://ceph.com/en');
> print(r.status_code)"  | python3.9
> 200
>
>
> What is my problem ? I don't have problem on old Buster servers running
> 15.2.14...
>
> Thanks
>
> Florent
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph EC K+M

2022-02-21 Thread Eugen Block

Hi,

it really depends on the resiliency requirements and the use case. We  
have a couple of customers with EC profiles like k=7 m=11. The  
potential waste of space as Anthony already mentions has to be  
considered, of course. But with regards to performance we haven't  
heard any complaints yet, but those clusters I'm referring to are  
archives with no high performance requirements but rather high  
requiremnts regarding datacenter resiliency.


Regards,
Eugen



Zitat von Anthony D'Atri :


A couple of years ago someone suggested on the list wrote:


3) k should only have small prime factors, power of 2 if possible

I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All
other choices were poor. The value of m seems not relevant for performance.
Larger k will require more failure domains (more hardware).


I suspect that K being a power of 2 aligns with sharding, for  
efficiency and perhaps even minimizing space wasted due to internal  
fragmentation.


As K increases, one sees diminishing returns for incremental  
raw:usable ratio, so I would think that for most purposes aligning  
to a power of 2 wouldn’t have much of a downside.  Depending on your  
workload, large values could result in wasted space, analagous to  
eg. the dynamics of tiny S3 objects vs a large min_alloc_size  :


https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253

I usually recommend a 2,2 profile as a safer alternative to  
replication with size=2, and 4,2 for additional space efficiency  
while blowing up the fault domain factor, rebuild overhead, etc.



ymmocv


On Feb 18, 2022, at 1:13 PM, ash...@amerrick.co.uk wrote:

I have read a few places its recommended to set K to a power of 2,  
is this still a “thing” with the latest/current CEPH Versions  
(quite a few of this articles are from years ago), or is a non  
power of 2 value equally as fine performance wise as a power of 2  
now.



Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] When is the ceph.conf file evaluated?

2022-02-21 Thread Ackermann, Christoph
Dear all,

I'm on the way to change five CentOs7 monitors to Rocky8. So when is the
ceph.conf file evaluated? Only on startup of ceph-xyz daemon or
dynamically? Is it worth generating an intermediate file containing some
old and some new ceph monitor hosts/IPs for client computers?

Thanks for any clarification

Christoph


--
Christoph Ackermann | System Engineer
INFOSERVE GmbH | Am Felsbrunnen 15 | D-66119 Saarbrücken
Fon +49 (0)681 88008-59 | Fax +49 (0)681 88008-33 | c.ackerm...@infoserve.de
| www.infoserve.de
INFOSERVE Datenschutzhinweise: www.infoserve.de/datenschutz
Handelsregister: Amtsgericht Saarbrücken, HRB 11001 | Erfüllungsort:
Saarbrücken
Geschäftsführer: Dr. Stefan Leinenbach | Ust-IdNr.: DE168970599







___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: When is the ceph.conf file evaluated?

2022-02-21 Thread Janne Johansson
Den mån 21 feb. 2022 kl 13:55 skrev Ackermann, Christoph
:
>
> Dear all,
>
> I'm on the way to change five CentOs7 monitors to Rocky8. So when is the
> ceph.conf file evaluated? Only on startup of ceph-xyz daemon or
> dynamically? Is it worth generating an intermediate file containing some
> old and some new ceph monitor hosts/IPs for client computers?
>

Anytime you run a ceph-command basically. Even if you only do "ceph
status" it needs to know which mons/mgrs to talk to for status.
All mounts and operations that require any daemon to serve or any
client to request data off the cluster needs a few basic parts of
ceph.conf,
where fsid and the list of mons are among the absolute necessities.
The rest can probably use defaults or talk to the config DB for
specifics.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Problem with Ceph daemons

2022-02-21 Thread Adam King
I'd say you probably don't need both services. It looks like they're
configured to listen to the same port(80 from the output) and are being
placed on the same hosts (c01-c06). It could be that port conflict that is
causing the rgw daemons to go into error state. Cephadm will try to put 2
down on each of these hosts to satisfy both rgw services specified but if
they both try to use the same port whichever one gets placed second could
go into error state for that reason.

 - Adam King

On Fri, Feb 18, 2022 at 1:38 PM Ron Gage  wrote:

> All:
>
> I think I found the problem - hence...
>
> [root@c01 ceph]# ceph orch ls
> NAME   PORTSRUNNING  REFRESHED  AGE  PLACEMENT
> alertmanager   ?:9093,9094  1/1  2m ago 9d   count:1
> crash   6/6  2m ago 9d   *
> grafana?:3000   1/1  2m ago 9d   count:1
> mgr 2/2  2m ago 9d   count:2
> mon 5/5  2m ago 9d   count:5
> node-exporter  ?:9100   6/6  2m ago 9d   *
> osd   2  2m ago -
> 
> osd.all-available-devices16  2m ago 2d   *
> prometheus ?:9095   1/1  2m ago 9d   count:1
> rgw.obj0   ?:80 1/6  2m ago 9d
>  c01;c02;c03;c04;c05;c06;count:6
> rgw.obj01  ?:80 5/6  2m ago 5d
>  c01;c02;c03;c04;c05;c06
>
>
> To my untrained eye, it looks like rgw.obj0 is extra and unneeded.  Does
> anyone know a way to prove this out and if needed remove it?
>
> Thanks!
>
> Ron Gage
> Westland, MI
>
> -Original Message-
> From: Eugen Block 
> Sent: Thursday, February 17, 2022 2:32 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: Problem with Ceph daemons
>
> Can you retry after resetting the systemd unit? The message "Start request
> repeated too quickly." should be cleared first, then start it
> again:
>
> systemctl reset-failed
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> systemctl start
> ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
>
> Then check the logs again. If there's still nothing in the rgw log then
> you'll need to check the (active) mgr daemon logs for anything suspicious
> and also the syslog on that rgw host. Is the rest of the cluster healthy?
> Are rgw daemons colocated with other services?
>
>
> Zitat von Ron Gage :
>
> > Adam:
> >
> >
> >
> > Not really….
> >
> >
> >
> > -- Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has begun starting up.
> >
> > Feb 16 15:01:03 c01 podman[426007]:
> >
> > Feb 16 15:01:04 c01 bash[426007]:
> > 915d1e19fa0f213902c666371c8e825480e103f85172f3b15d1d5bf2427a87c9
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+ 7f4f72ff6440  0 deferred set uid:gid to
> > 167:167 (ceph:ceph)
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+ 7f4f72ff6440  0 ceph version 16.2.7
> > (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (st>
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+ 7f4f72ff6440  0 framework: beast
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+ 7f4f72ff6440  0 framework conf key:
> > port, val: 80
> >
> > Feb 16 15:01:04 c01 conmon[426038]: debug
> > 2022-02-16T20:01:04.303+ 7f4f72ff6440  1 radosgw_Main not setting
> > numa affinity
> >
> > Feb 16 15:01:04 c01 systemd[1]: Started Ceph rgw.obj0.c01.gpqshk for
> > 35194656-893e-11ec-85c8-005056870dae.
> >
> > -- Subject: Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has finished start-up
> >
> > -- Defined-By: systemd
> >
> > -- Support: https://access.redhat.com/support
> >
> > --
> >
> > -- Unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has finished starting up.
> >
> > --
> >
> > -- The start-up result is done.
> >
> > Feb 16 15:01:04 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Main process exited, code=exited, status=98/n/a
> >
> > Feb 16 15:01:05 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Failed with result 'exit-code'.
> >
> > -- Subject: Unit failed
> >
> > -- Defined-By: systemd
> >
> > -- Support: https://access.redhat.com/support
> >
> > --
> >
> > -- The unit
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service
> > has entered the 'failed' state with result 'exit-code'.
> >
> > Feb 16 15:01:15 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Service RestartSec=10s expired, scheduling restart.
> >
> > Feb 16 15:01:15 c01 systemd[1]:
> > ceph-35194656-893e-11ec-85c8-005056870dae@rgw.obj0.c01.gpqshk.service:
> > Scheduled re

[ceph-users] Re: When is the ceph.conf file evaluated?

2022-02-21 Thread Ackermann, Christoph
Ok..  Think it would be ok not to use a mashed up version, since I use
three old and four new monitors actually.

Thanks a lot
Christoph

PS I just use "minimal"  FSID and monitor list in our ceph.conf file.



Am Mo., 21. Feb. 2022 um 14:03 Uhr schrieb Janne Johansson <
icepic...@gmail.com>:

> Den mån 21 feb. 2022 kl 13:55 skrev Ackermann, Christoph
> :
> >
> > Dear all,
> >
> > I'm on the way to change five CentOs7 monitors to Rocky8. So when is the
> > ceph.conf file evaluated? Only on startup of ceph-xyz daemon or
> > dynamically? Is it worth generating an intermediate file containing some
> > old and some new ceph monitor hosts/IPs for client computers?
> >
>
> Anytime you run a ceph-command basically. Even if you only do "ceph
> status" it needs to know which mons/mgrs to talk to for status.
> All mounts and operations that require any daemon to serve or any
> client to request data off the cluster needs a few basic parts of
> ceph.conf,
> where fsid and the list of mons are among the absolute necessities.
> The rest can probably use defaults or talk to the config DB for
> specifics.
>
>
> --
> May the most significant bit of your life be positive.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: When is the ceph.conf file evaluated?

2022-02-21 Thread Janne Johansson
Den mån 21 feb. 2022 kl 14:17 skrev Ackermann, Christoph
:
>
> Ok..  Think it would be ok not to use a mashed up version, since I use three 
> old and four new monitors actually.

I think it's ok to have all old and new there, even if one or two are
not available currently out of all mons, the client will just have to
try the list one at a time until it finds one working mon. After that,
the mondb will have info about which mons are up and which are not.

When the old are all gone, you can ship a new ceph.conf with only the
new mons in it.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Sebastian Mazza
Hi Igor,

please find the the startup log under the following link: 
https://we.tl/t-E6CadpW1ZL
It also includes the “normal" log of that OSD from the the day before the crash 
and the RocksDB sst file with the “Bad table magic number” (db/001922.sst) 

Best regards,
Sebastian


> On 21.02.2022, at 12:18, Igor Fedotov  wrote:
> 
> Hi Sebastian,
> 
> could you please share failing OSD startup log?
> 
> 
> Thanks,
> 
> Igor
> 
> On 2/20/2022 5:10 PM, Sebastian Mazza wrote:
>> Hi Igor,
>> 
>> it happened again. One of the OSDs that crashed last time, has a corrupted 
>> RocksDB again. Unfortunately I do not have debug logs from the OSDs again. I 
>> was collecting hundreds of Gigabytes of OSD debug logs in the last two 
>> month. But this week, I disabled the debug logging, because I did some tests 
>> with rsync to cephFS and RBD Images on EC pools and the logs did fill up my 
>> boot drives multiple times.
>> The corruption happened after I did shut down all 3 nodes and booted it some 
>> minutes later.
>> 
>> If you are interested, I could share the normal log of the OSD. A log of a 
>> failed OSD start with debug logging enabled and als the corrupted RocksDB 
>> export.
>> 
>> It is may be worth taking a note that no crash did happen after hundreds of 
>> reboots but now it happens after I gracefully shut down all nodes for around 
>> 10 minutes.
>> Best to my knowledge there was no IO on the crashed OSD for several hours. 
>> The crashed OSD was used by only two pools. Both are EC pools. One is used 
>> as data part for  RBD image and on as data storage for a subdirectory of a 
>> cephFS. All metadata for the cephFS and the RBD pool are stored on 
>> replicated NVMEs.
>> On RBD image on the HDD EC pool was mounted by a VM, but not as boot drive. 
>> The cephFS was mounted also by this VM and the 3 cluster nodes itself. Apart 
>> from mounting/unmounting, neither the cephFS nor the BTRFS on the RBD image 
>> was asked to process any IOs. So nobody was reading or writing to the failed 
>> OSD for many hours before the shutdown of the cluster and OSD failing 
>> happened.
>> 
>> 
>> I’m now thinking of how I could add more storage space for the log files to 
>> each node, so that I can leave on the debug logging all the time.
>> 
>> 
>> Best regards,
>> Sebastian
> 
> -- 
> Igor Fedotov
> Ceph Lead Developer
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Sebastian Mazza
Hi Igor,

today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this 
time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD 
in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now 
the third time! Coincidence? Probably not…
The important thing seams to be that a shutdown and not only a restart of the 
entire cluster is performed. Since, this time the OSD failed after just 4 
shutdowns of all nodes in the cluster within 70 minutes.

I redeployed the OSD.7 after the crash from 2 days ago. And I started this new 
shutdown and boot series shortly after ceph had finished writing everything 
back to OSD.7, earlier today. 

The corrupted RocksDB file (crash) is again only 2KB in size.
You can download the RocksDB file with the bad  table magic number and the log 
of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
What else do you want?

From the log of the OSD.7:
—
2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 
/var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 
/var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 
ioc 0x55f0b8dbdd18 with 0 aios left
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated from 
/sbin/init  (PID: 1) UID: 0
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal 
Terminated ***
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown 
(osd_fast_shutdown=true) ***
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 
(ceph:ceph)
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, 
pid 1967
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty 
--pid-file
2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 
/var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
—

For me this looks like that the OSD did nothing for nearly 2 minutes before it 
receives the termination request. Shouldn't this be enough time for flushing 
every imaginable write cache?


I hope this helps you.


Best wishes,
Sebastian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph EC K+M

2022-02-21 Thread Eugen Block
The customer's requirement was to sustain the loss of one of two  
datacenters and two additional hosts. The crush failure domain is  
"host". There are 10 hosts in each DC, so we put 9 chunks in each DC  
to be able to recover completely if one host fails. This worked quite  
nicely already, they had a power outage in one DC and were very happy  
after the cluster recovered.
I don't know the details about the decision process anymore, we  
discussed a few options and decided this would fit best wrt  
resiliency, storage overhead and the amount of chunks.



Zitat von "Szabo, Istvan (Agoda)" :


What’s the aim to have soo big m number?
How many servers are in this cluster?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2022. Feb 21., at 19:20, Eugen Block  wrote:

Email received from the internet. If in doubt, don't click any link  
nor open any attachment !



Hi,

it really depends on the resiliency requirements and the use case. We
have a couple of customers with EC profiles like k=7 m=11. The
potential waste of space as Anthony already mentions has to be
considered, of course. But with regards to performance we haven't
heard any complaints yet, but those clusters I'm referring to are
archives with no high performance requirements but rather high
requiremnts regarding datacenter resiliency.

Regards,
Eugen



Zitat von Anthony D'Atri :

A couple of years ago someone suggested on the list wrote:

3) k should only have small prime factors, power of 2 if possible

I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All
other choices were poor. The value of m seems not relevant for performance.
Larger k will require more failure domains (more hardware).

I suspect that K being a power of 2 aligns with sharding, for
efficiency and perhaps even minimizing space wasted due to internal
fragmentation.

As K increases, one sees diminishing returns for incremental
raw:usable ratio, so I would think that for most purposes aligning
to a power of 2 wouldn’t have much of a downside.  Depending on your
workload, large values could result in wasted space, analagous to
eg. the dynamics of tiny S3 objects vs a large min_alloc_size  :

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253

I usually recommend a 2,2 profile as a safer alternative to
replication with size=2, and 4,2 for additional space efficiency
while blowing up the fault domain factor, rebuild overhead, etc.


ymmocv

On Feb 18, 2022, at 1:13 PM, ash...@amerrick.co.uk wrote:

I have read a few places its recommended to set K to a power of 2,
is this still a “thing” with the latest/current CEPH Versions
(quite a few of this articles are from years ago), or is a non
power of 2 value equally as fine performance wise as a power of 2
now.


Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph os filesystem in read only

2022-02-21 Thread Marc


I have a ceph node that has an os filesystem going into read only for what ever 
reason[1]. 

1. How long will ceph continue to run before it starts complaining about this?
Looks like it is fine for a few hours, ceph osd tree and ceph -s, seem not to 
notice anything.

2. This is still nautilus with majority of ceph-disk and maybe some ceph-volume 
disks
What would be a good procedure to try and recover data from this drive to use 
on a new os disk?



[1]
Feb 21 14:41:30 kernel: XFS (dm-0): writeback error on sector 11610872
Feb 21 14:41:30 systemd: ceph-mon@c.service failed.
Feb 21 14:41:31 kernel: XFS (dm-0): metadata I/O error: block 0x2ee001 
("xfs_buf_iodone_callback_error") error 121 numblks 1
Feb 21 14:41:31 kernel: XFS (dm-0): metadata I/O error: block 0x5dd5cd 
("xlog_iodone") error 121 numblks 64
Feb 21 14:41:31 kernel: XFS (dm-0): Log I/O Error Detected. Shutting down 
filesystem
Feb 21 14:41:31 kernel: XFS (dm-0): Please umount the filesystem and rectify 
the problem(s)


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Igor Fedotov

Hey Sebastian,

thanks a lot for the new logs - looks like they provides some insight. 
At this point I think the root cause is apparently a race between 
deferred writes replay and some DB maintenance task happening on OSD 
startup. It seems that deferred write replay updates a block extent 
which RocksDB/BlueFS are using. Hence the target BlueFS file gets 
all-zeros content. Evidently that's just a matter of chance whether they 
use conflicting physical extent or not hence the occasional nature of 
the issue...


And now I'd like to determine what's wrong with this deferred write replay.

So first of all I'm curious if you have any particular write patterns 
that can be culprits? E.g. something like disk wiping procedure which 
writes all-zeros to an object followed by object truncate or removal 
comes to my mind. If you can identify something like that - could you 
please collect OSD log for such an operation (followed by OSD restart) 
with debug-bluestore set to 20?



Thanks,

Igor

On 2/21/2022 5:29 PM, Sebastian Mazza wrote:

Hi Igor,

today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this time I 
have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD in the 
cluster! It was the OSD with the ID 7 again. So the HDD has failed now the third 
time! Coincidence? Probably not…
The important thing seams to be that a shutdown and not only a restart of the 
entire cluster is performed. Since, this time the OSD failed after just 4 
shutdowns of all nodes in the cluster within 70 minutes.

I redeployed the OSD.7 after the crash from 2 days ago. And I started this new 
shutdown and boot series shortly after ceph had finished writing everything 
back to OSD.7, earlier today.

The corrupted RocksDB file (crash) is again only 2KB in size.
You can download the RocksDB file with the bad  table magic number and the log 
of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
What else do you want?

 From the log of the OSD.7:
—
2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 
/var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 
/var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 4096 
ioc 0x55f0b8dbdd18 with 0 aios left
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated from 
/sbin/init  (PID: 1) UID: 0
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal 
Terminated ***
2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate shutdown 
(osd_fast_shutdown=true) ***
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 
(ceph:ceph)
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, 
pid 1967
2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty 
--pid-file
2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 
/var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
—

For me this looks like that the OSD did nothing for nearly 2 minutes before it 
receives the termination request. Shouldn't this be enough time for flushing 
every imaginable write cache?


I hope this helps you.


Best wishes,
Sebastian


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph os filesystem in read only - mgr bug

2022-02-21 Thread Marc


Interesting from this situation is: 


Feb 21 21:44:46 ceph-mgr: 2022-02-21 20:44:46.913 7f70896f9700 -1 
os_release_parse - failed to open /etc/os-release: (5) Input/output error
Feb 21 21:44:46 ceph-mgr: 2022-02-21 20:44:46.913 7f70896f9700 -1 distro_detect 
- /etc/os-release is required
Feb 21 21:44:46 ceph-mgr: 2022-02-21 20:44:46.913 7f70896f9700 -1 
os_release_parse - failed to open /etc/os-release: (5) Input/output error
Feb 21 21:44:46 ceph-mgr: 2022-02-21 20:44:46.913 7f70896f9700 -1 distro_detect 
- /etc/os-release is required
Feb 21 21:44:46 ceph-mgr: 2022-02-21 20:44:46.913 7f70896f9700 -1 distro_detect 
- can't detect distro
Feb 21 21:44:46 ceph-mgr: 2022-02-21 20:44:46.913 7f70896f9700 -1 distro_detect 
- can't detect distro_description
Feb 21 21:44:46 ceph-mgr: 2022-02-21 20:44:46.913 7f70896f9700 -1 distro_detect 
- can't detect distro
Feb 21 21:44:46 ceph-mgr: 2022-02-21 20:44:46.913 7f70896f9700 -1 distro_detect 
- can't detect distro_description

Why does this mgr even try to open this? It looks like this mgr is ok, so why 
does it want to open this file in a loop? Why not just once at startup. I do 
not think the distro file will change that much. This can only make it worse.




[@]# ceph -s
  cluster:
id: 
health: HEALTH_WARN
1 clients failing to respond to cache pressure
noout,noscrub,nodeep-scrub flag(s) set
2 pool(s) have no replicas configured
1/3 mons down, quorum a,b

  services:
mon: 3 daemons, quorum a,b (age 7h), out of quorum: c
mgr: a(active, since 10w), standbys: b, c
mds: cephfs:1 {0=a=up:active} 1 up:standby
osd: 42 osds: 42 up (since 3w), 42 in (since 4M)
 flags noout,noscrub,nodeep-scrub
rgw: 1 daemon active (rgw1)
rgw-nfs: 1 daemon active (rgwnfs1)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Reducing ceph cluster size in half

2022-02-21 Thread Jason Borden

Hi all,

I'm looking for some advice on reducing my ceph cluster in half. I 
currently have 40 hosts and 160 osds on a cephadm managed pacific 
cluster. The storage space is only 12% utilized. I want to reduce the 
cluster to 20 hosts and 80 osds while keeping the cluster operational. 
I'd prefer to do this in as few operations as possible instead of 
draining each host at a time and having to rebalance pgs 20 times. I 
think I should probably half the number of pgs at the same time too. 
Does anyone have any advice on how I can safely achieve this?


Thanks,
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reducing ceph cluster size in half

2022-02-21 Thread Matt Vandermeulen
This might be easiest to work about in two steps:  Draining hosts, and 
doing a PG merge.  You can do it in either order (though thinking about 
it, doing the merge first will give you more cluster-wide resources to 
do it faster).


Draining the hosts can be done in a few ways, too.  If you want to do it 
in one shot, you can set nobackfill, then set the crush/reweights for 
the OSDs to zero, let the peering storm settle, and unset nobackfill.  
This is probably the easiest option if a brief peering storm and 
backfill_wait isn't a concern.


If you want to reduce backfill_wait PGs, you can use something like 
`pgremapper drain`, but this will likely involve multiple data 
movements:  The initial drain is fine, but the CRUSH removal of hosts 
will cause the upmaps to be lost (which can be `pgremapper 
cancel-backfill` away).  Additional data movement will be needed if you 
want to `pgremapper undo-upmaps` to clean up what was canceled (or if 
you use the balancer and it wants to move things).



On 2022-02-21 17:58, Jason Borden wrote:

Hi all,

I'm looking for some advice on reducing my ceph cluster in half. I
currently have 40 hosts and 160 osds on a cephadm managed pacific
cluster. The storage space is only 12% utilized. I want to reduce the
cluster to 20 hosts and 80 osds while keeping the cluster operational.
I'd prefer to do this in as few operations as possible instead of
draining each host at a time and having to rebalance pgs 20 times. I
think I should probably half the number of pgs at the same time too.
Does anyone have any advice on how I can safely achieve this?

Thanks,
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reducing ceph cluster size in half

2022-02-21 Thread Etienne Menguy
Hi,

There are different ways, but I would : 

- Change OSD weight (and not reweight) I want to remove to 0
- Wait for cluster health
- Stop OSD I want to remove
- If data are ok, remove osd from crushmap.
- There is a no reason stopping osd impacts your service as they have 
no data, it’s just a safety check.

- Then decrease PG if needed.

-
Etienne Menguy
etienne.men...@croit.io




> On 21 Feb 2022, at 22:58, Jason Borden  wrote:
> 
> Hi all,
> 
> I'm looking for some advice on reducing my ceph cluster in half. I currently 
> have 40 hosts and 160 osds on a cephadm managed pacific cluster. The storage 
> space is only 12% utilized. I want to reduce the cluster to 20 hosts and 80 
> osds while keeping the cluster operational. I'd prefer to do this in as few 
> operations as possible instead of draining each host at a time and having to 
> rebalance pgs 20 times. I think I should probably half the number of pgs at 
> the same time too. Does anyone have any advice on how I can safely achieve 
> this?
> 
> Thanks,
> Jason
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-21 Thread Sebastian Mazza
Hey Igor!


> thanks a lot for the new logs - looks like they provides some insight. 

I'm glad the logs are helpful.


> At this point I think the root cause is apparently a race between deferred 
> writes replay and some DB maintenance task happening on OSD startup. It seems 
> that deferred write replay updates a block extent which RocksDB/BlueFS are 
> using. Hence the target BlueFS file gets all-zeros content. Evidently that's 
> just a matter of chance whether they use conflicting physical extent or not 
> hence the occasional nature of the issue...


Do I understand that correct: The corruption of the rocksDB (Table overwritten 
by zeros) happens at the first start of the OSD after  “*** Immediate shutdown 
(osd_fast_shutdown=true) ***”? Before the system launches the OSD Service the 
RocksDB is still fine?


> So first of all I'm curious if you have any particular write patterns that 
> can be culprits? E.g. something like disk wiping procedure which writes 
> all-zeros to an object followed by object truncate or removal comes to my 
> mind. If you can identify something like that - could you please collect OSD 
> log for such an operation (followed by OSD restart) with debug-bluestore set 
> to 20?

Best to my knowledge the OSD was hardly doing anything and I do not see any 
pattern that would fit to you explanation. 
However, you certainly understand a lot more about it than I do, so I try to 
explain everything that could be relevant.

The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which 
should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD with 
2TB and the other two are on 12TB HDDs. 

I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is a 
replicated rule that uses only OSDs with class ’nvme’. The second rule is a 
tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with class 
‘hdd’. So it only works for a size of exactly 8. That means that a pool that 
uses this rule has always only “undersized” placement groups, since the cluster 
has only 3 nodes. (I did not add the fourth server after the first crash in 
December, since we want to reproduce the problem.)

The pools device_health_metrics, test-pool, fs.metadata-root-pool, 
fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the crush 
rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools 
fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with 
k=5,m=3 and a min size of 6.

The pool fs.data-nvme.c-pool is not used and the pool test-pool was used for 
rados bench a few month ago.

The pool fs.metadata-root-pool is used as metadata pool for cephFS and 
fs.data-root-pool as the root data pool for the cephFS. The pool 
fs.data-hdd.ec-pool is an additional data pool for the cephFS and is specified 
as ceph.dir.layout for some folders of the cephFS. The whole cephFS is mounted 
by each of the 3 nodes.

The pool block-nvme.c-pool hosts two RBD images that are used as boot drives 
for two VMs. The first VM runes with Ubuntu Desktop and the second with Debian 
as OS. The pool block-hdd.ec-pool hosts one RBD image (the data part, metadata 
on block-nvme.c-pool) that is attached to the Debian VM as second drive 
formatted with BTRFS. Furthermore the Debian VM mounts a sub directory of the 
cephFS that has the fs.data-hdd.ec-pool set as layout. Both VMs was doing 
nothing, except from being booted, in the last couple of days.

I try to illustrate the pool usage as a tree:
* c3nvme (replicated, size=3, min_size=2)
+ device_health_metrics
+ test-pool
- rados bench
+ fs.metadata-root-pool
- CephFS (metadata)
+ fs.data-root-pool
- CephFS (data root)
+ fs.data-nvme.c-pool
+ block-nvme.c-pool
- RBD (Ubuntu VM, boot disk with ext4)
- RBD (Debian VM, boot disk with ext4)
* ec4x2hdd (ec,  k=5, m=3, min_size=6)
+ fs.data-hdd.ec-pool
- CephFS (layout data)
+ block-hdd.ec-pool
- RBD (Debian VM, disk 2 with BTRFS)


Last week I was writing about 30TB to the CephFS inside the fs.data-hdd.ec-pool 
and around 50GB to the BTRFS volume on the block-hdd.ec-pool. The 50GB written 
to BTRFS contains hundred thousands of hard links. There was already nearly the 
same data on the storage and I used rsync to update it. I think something 
between 1 and 4 TB has changed and was updated by rsync.

I think the cluster was totally unused on Friday, but up and running and idling 
around. Then on Saturday I did a graceful shutdown of of all cluster nodes. 
Arround 5 minutes later when booted the servers again, the OSD.7 crashed. I 
copied the logs and exported the RocksDB. Then I deleted everything from the 
HDD and deployed the OSD.7 again. When, I checkt for the first time today at 
around 12:00, ceph was already finished with backfilling to OSD.7 and the 
cluster was idle again. 

I then spend 70 minutes with writing 3 small files (one with about 500Byte and 
two with about