[ceph-users] Re: ceph-volume claiming wrong device

2022-11-01 Thread Eugen Block
As I said, I would recommend to really wipe the OSDs clean  
(ceph-volume lvm zap --destroy /dev/sdX), maybe reboot (on VMs it was  
sometimes necessary during my tests if I had too many failed  
attempts). And then also make sure you don't have any leftovers in the  
filesystem (under /var/lib/ceph) just to make sure you have a clean  
start.


Zitat von Oleksiy Stashok :


Hey Eugen,

valid points, I first tried to provision OSDs via ceph-ansible (later
excluded), which does run the batch command with all 4 disk devices, but it
often failed with the same issue I mentioned earlier, something like:
```
bluefs _replay 0x0: stop: uuid e2f72ec9-2747-82d7-c7f8-41b7b6d41e1b !=
super.uuid 0110ddb3-d4bf-4c1e-be11-654598c71db0
```
that's why I abandoned that idea and tried to provision OSDs manually one
by one.
As I mentioned I used ceph-ansible, not cephadm for legacy reasons, but I
suspect the problem I'm seeing is related to ceph-volume, so I suspect
cephadm won't change it.

I did more investigation in 1-by-1 OSD creation flow and it seems like the
fact that `ceph-volume lvm list` shows me 2 devices belonging to the same
OSD can be explained by the following flow:

1. ceph-volume lvm create --bluestore --dmcrypt --data /dev/sdd
2. trying to create osd.2
3. fails with uuid != super.uuid issue
4. ceph-volume lvm list returns /dev/sdd belong to osd.2 (even though it
failed)
5. ceph-volume lvm create --bluestore --dmcrypt --data /dev/sde
6. trying to create osd.2 (*again*)
7. succeeds
8. ceph-volume lvm list returns both /dev/sdd and /dev/sde belonging to
osd.2

osd.2 is reported to be up and running.

Any idea why this is happening?
Thank you!
Oleksiy

On Thu, Oct 27, 2022 at 12:11 AM Eugen Block  wrote:


Hi,

first of all, if you really need to issue ceph-volume manually,
there's a batch command:

cephadm ceph-volume lvm batch /dev/sdb /dev/sdc /dev/sdd /dev/sde

Second, are you using cephadm? Maybe your manual intervention
conflicts with the automatic osd setup (all available devices). You
could look into /var/log/ceph/cephadm.log on each node and see if
cephadm already tried to setup the OSDs for you. What does 'ceph orch
ls' show?
Did you end up having online OSDs or did it fail? In that case I would
purge all OSDs from the crushmap, then wipe all devices (ceph-volume
lvm zap --destroy /dev/sdX) and either let cephadm create the OSDs for
you or you disable that (unmanaged=true) and run the manual steps
again (although it's not really necessary).

Regards,
Eugen

Zitat von Oleksiy Stashok :

> Hey guys,
>
> I ran into a weird issue, hope you can explain what I'm observing. I'm
> testing* Ceph 16.2.10* on *Ubuntu 20.04* in *Google Cloud VMs*, I
created 3
> instances and attached 4 persistent SSD disks to each instance. I can see
> these disks attached as `/dev/sdb, /dev/sdc, /dev/sdd, /dev/sde` devices.
>
> As a next step I used ceph-ansible to bootstrap the ceph cluster on 3
> instances, however I intentionally skipped OSD setup. So I ended up with
a
> Ceph cluster w/o any OSD.
>
> I ssh'ed into each VM and ran:
>
> ```
>   sudo -s
>   for dev in sdb sdc sdd sde; do
> /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore
> --dmcrypt --data "/dev/$dev"
>   done
> ```
>
> The operation above randomly fails on random instances/devices with
> something like:
> ```
> bluefs _replay 0x0: stop: uuid e2f72ec9-2747-82d7-c7f8-41b7b6d41e1b !=
> super.uuid 0110ddb3-d4bf-4c1e-be11-654598c71db0
> ```
>
> The interesting this is that when I do
> ```
> /usr/sbin/ceph-volume lvm ls
> ```
>
> I can see that the device for which OSD creation failed actually belongs
to
> a different OSD that was previously created for a different device. For
> example the failure I mentioned above happened on the `/dev/sde` device,
so
> when I list lvms I see this:
> ```
> == osd.2 ===
>
>   [block]
>
/dev/ceph-103a4373-dbe0-43d6-a9e0-34db4e1b257c/osd-block-9af542ba-fd65-4355-ad17-7293856acaeb
>
>   block device
>
>
/dev/ceph-103a4373-dbe0-43d6-a9e0-34db4e1b257c/osd-block-9af542ba-fd65-4355-ad17-7293856acaeb
>   block uuidFfFnLt-h33F-F73V-tY45-VuZM-scj7-C3dg1K
>   cephx lockbox secret  AQAlelljqNPoMhAA59JwN3wGt0d6Si+nsnxsRQ==
>   cluster fsid  348fff8e-e850-4774-9694-05d5414b1c53
>   cluster name  ceph
>   crush device class
>   encrypted 1
>   osd fsid  9af542ba-fd65-4355-ad17-7293856acaeb
>   osd id2
>   osdspec affinity
>   type  block
>   vdo   0
>   devices   /dev/sdd
>
>   [block]
>
/dev/ceph-df14969f-2dfb-45f1-a579-a8e23ec12e33/osd-block-4686f6fc-8dc1-48fd-a2d9-70a281c8ee64
>
>   block device
>
>
/dev/ceph-df14969f-2dfb-45f1-a579-a8e23ec12e33/osd-block-4686f6fc-8dc1-48fd-a2d9-70a281c8ee64
>   block uuidGEajK3-Tsyf-XZS9-E5ik-M1BB-VIpb-q7D1ET
>   cephx lockbox secret  AQAwell

[ceph-users] OSDs are not utilized evenly

2022-11-01 Thread Denis Polom

Hi

I observed on my Ceph cluster running latest Pacific that same size OSDs 
are utilized differently even if balancer is running and reports status 
as perfectly balanced.


{
    "active": true,
    "last_optimize_duration": "0:00:00.622467",
    "last_optimize_started": "Tue Nov  1 12:49:36 2022",
    "mode": "upmap",
    "optimize_result": "Unable to find further optimization, or pool(s) 
pg_num is decreasing, or distribution is already perfect",

    "plans": []
}

balancer settings for upmap are:

  mgr   advanced 
mgr/balancer/mode   upmap

  mgr   advanced mgr/balancer/upmap_max_deviation    1
  mgr   advanced mgr/balancer/upmap_max_optimizations    20

It's obvious that utilization is not same (difference is about 1TB) from 
command `ceph osd df`. Following is just a partial output:


ID   CLASS  WEIGHT    REWEIGHT  SIZE RAW USE  DATA OMAP  
META AVAIL    %USE   VAR   PGS  STATUS
  0    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   3.0 MiB   
37 GiB  3.6 TiB  78.09  1.05  196  up
124    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   32 
GiB  4.7 TiB  71.20  0.96  195  up
157    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.3 MiB   35 
GiB  3.7 TiB  77.67  1.05  195  up
  1    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.0 MiB   
35 GiB  3.7 TiB  77.69  1.05  195  up
243    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31 
GiB  4.7 TiB  71.16  0.96  195  up
244    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31 
GiB  4.7 TiB  71.19  0.96  195  up
245    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   32 
GiB  4.7 TiB  71.55  0.96  196  up
246    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31 
GiB  4.7 TiB  71.17  0.96  195  up
249    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   30 
GiB  4.7 TiB  71.18  0.96  195  up
500    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   30 
GiB  4.7 TiB  71.19  0.96  195  up
501    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31 
GiB  4.7 TiB  71.57  0.96  196  up
502    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31 
GiB  4.7 TiB  71.18  0.96  195  up
532    hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31 
GiB  4.7 TiB  71.16  0.96  195  up
549    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   576 KiB   36 
GiB  3.7 TiB  77.70  1.05  195  up
550    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   3.8 MiB   36 
GiB  3.7 TiB  77.67  1.05  195  up
551    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.4 MiB   35 
GiB  3.7 TiB  77.68  1.05  195  up
552    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.5 MiB   35 
GiB  3.7 TiB  77.69  1.05  195  up
553    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.1 MiB   37 
GiB  3.6 TiB  77.71  1.05  195  up
554    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   967 KiB   36 
GiB  3.6 TiB  77.71  1.05  195  up
555    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   1.3 MiB   36 
GiB  3.6 TiB  78.08  1.05  196  up
556    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   4.7 MiB   36 
GiB  3.6 TiB  78.10  1.05  196  up
557    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.4 MiB   36 
GiB  3.7 TiB  77.69  1.05  195  up
558    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   4.5 MiB   36 
GiB  3.6 TiB  77.72  1.05  195  up
559    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   1.5 MiB   35 
GiB  3.6 TiB  78.09  1.05  196  up
560    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.2 MiB   35 
GiB  3.7 TiB  77.69  1.05  195  up
561    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.8 MiB   35 
GiB  3.7 TiB  77.69  1.05  195  up
562    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   1.0 MiB   36 
GiB  3.7 TiB  77.68  1.05  195  up
563    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.6 MiB   36 
GiB  3.7 TiB  77.68  1.05  195  up
564    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.1 MiB   36 
GiB  3.6 TiB  78.09  1.05  196  up
567    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   4.8 MiB   36 
GiB  3.6 TiB  78.11  1.05  196  up
568    hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.2 MiB   35 
GiB  3.7 TiB  77.68  1.05  195  up


All OSDs are used by the same pool (EC)

I have the same issue on another Ceph cluster with the same setup where 
I was able to make OSDs utilization same by changing reweight from 
1.0  to lower on OSDs with higher utilization and I got a lot of 
free space:


before changing reweight:

--- RAW STORAGE ---
CLASS SIZEAVAIL USED  RAW USED  %RAW USED
hdd3.1 PiB  510 TiB  2.6 PiB   2.6 PiB  83.77
ssd2.6 TiB  2.6 TiB   46 GiB46 GiB   1.70
TOTAL  3.1 PiB  513 TiB  2.6 Pi

[ceph-users] Re: OSDs are not utilized evenly

2022-11-01 Thread Joseph Mundackal
If the GB per pg is high, the balancer module won't be able to help.

Your pg count per osd also looks low (30's), so increasing pgs per pool
would help with both problems.

You can use the pg calculator to determine which pools need what

On Tue, Nov 1, 2022, 08:46 Denis Polom  wrote:

> Hi
>
> I observed on my Ceph cluster running latest Pacific that same size OSDs
> are utilized differently even if balancer is running and reports status
> as perfectly balanced.
>
> {
>  "active": true,
>  "last_optimize_duration": "0:00:00.622467",
>  "last_optimize_started": "Tue Nov  1 12:49:36 2022",
>  "mode": "upmap",
>  "optimize_result": "Unable to find further optimization, or pool(s)
> pg_num is decreasing, or distribution is already perfect",
>  "plans": []
> }
>
> balancer settings for upmap are:
>
>mgr   advanced
> mgr/balancer/mode   upmap
>mgr   advanced mgr/balancer/upmap_max_deviation1
>mgr   advanced mgr/balancer/upmap_max_optimizations
> 20
>
> It's obvious that utilization is not same (difference is about 1TB) from
> command `ceph osd df`. Following is just a partial output:
>
> ID   CLASS  WEIGHTREWEIGHT  SIZE RAW USE  DATA OMAP
> META AVAIL%USE   VAR   PGS  STATUS
>0hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   3.0 MiB
> 37 GiB  3.6 TiB  78.09  1.05  196  up
> 124hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   32
> GiB  4.7 TiB  71.20  0.96  195  up
> 157hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.3 MiB   35
> GiB  3.7 TiB  77.67  1.05  195  up
>1hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.0 MiB
> 35 GiB  3.7 TiB  77.69  1.05  195  up
> 243hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31
> GiB  4.7 TiB  71.16  0.96  195  up
> 244hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31
> GiB  4.7 TiB  71.19  0.96  195  up
> 245hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   32
> GiB  4.7 TiB  71.55  0.96  196  up
> 246hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31
> GiB  4.7 TiB  71.17  0.96  195  up
> 249hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   30
> GiB  4.7 TiB  71.18  0.96  195  up
> 500hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   30
> GiB  4.7 TiB  71.19  0.96  195  up
> 501hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31
> GiB  4.7 TiB  71.57  0.96  196  up
> 502hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31
> GiB  4.7 TiB  71.18  0.96  195  up
> 532hdd  18.00020   1.0   16 TiB   12 TiB   12 TiB   0 B   31
> GiB  4.7 TiB  71.16  0.96  195  up
> 549hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   576 KiB   36
> GiB  3.7 TiB  77.70  1.05  195  up
> 550hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   3.8 MiB   36
> GiB  3.7 TiB  77.67  1.05  195  up
> 551hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.4 MiB   35
> GiB  3.7 TiB  77.68  1.05  195  up
> 552hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.5 MiB   35
> GiB  3.7 TiB  77.69  1.05  195  up
> 553hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.1 MiB   37
> GiB  3.6 TiB  77.71  1.05  195  up
> 554hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   967 KiB   36
> GiB  3.6 TiB  77.71  1.05  195  up
> 555hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   1.3 MiB   36
> GiB  3.6 TiB  78.08  1.05  196  up
> 556hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   4.7 MiB   36
> GiB  3.6 TiB  78.10  1.05  196  up
> 557hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.4 MiB   36
> GiB  3.7 TiB  77.69  1.05  195  up
> 558hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   4.5 MiB   36
> GiB  3.6 TiB  77.72  1.05  195  up
> 559hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   1.5 MiB   35
> GiB  3.6 TiB  78.09  1.05  196  up
> 560hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.2 MiB   35
> GiB  3.7 TiB  77.69  1.05  195  up
> 561hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.8 MiB   35
> GiB  3.7 TiB  77.69  1.05  195  up
> 562hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   1.0 MiB   36
> GiB  3.7 TiB  77.68  1.05  195  up
> 563hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   2.6 MiB   36
> GiB  3.7 TiB  77.68  1.05  195  up
> 564hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.1 MiB   36
> GiB  3.6 TiB  78.09  1.05  196  up
> 567hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   4.8 MiB   36
> GiB  3.6 TiB  78.11  1.05  196  up
> 568hdd  18.00020   1.0   16 TiB   13 TiB   13 TiB   5.2 MiB   35
> GiB  3.7 TiB  77.68  1.05  195  up
>
> All OSDs are used by the same pool (EC)
>
> I have the same issue on another Ceph cluster w

[ceph-users] No active PG; No disk activity

2022-11-01 Thread Murilo Morais
Good morning everyone!

Today there was an atypical situation in our Cluster where the three
machines came to shut down.

On powering up the cluster went up and formed quorum with no problems, but
the PGs are all in Working, I don't see any disk activity on the machines.
No PG is active.




[ceph: root@dcs1 /]# ceph osd tree
ID  CLASS  WEIGHTTYPE NAME   STATUS  REWEIGHT  PRI-AFF
-1 98.24359  root default
-3 32.74786  host dcs1
 0hdd   2.72899  osd.0   up   1.0  1.0
 1hdd   2.72899  osd.1   up   1.0  1.0
 2hdd   2.72899  osd.2   up   1.0  1.0
 3hdd   2.72899  osd.3   up   1.0  1.0
 4hdd   2.72899  osd.4   up   1.0  1.0
 5hdd   2.72899  osd.5   up   1.0  1.0
 6hdd   2.72899  osd.6   up   1.0  1.0
 7hdd   2.72899  osd.7   up   1.0  1.0
 8hdd   2.72899  osd.8   up   1.0  1.0
 9hdd   2.72899  osd.9   up   1.0  1.0
10hdd   2.72899  osd.10  up   1.0  1.0
11hdd   2.72899  osd.11  up   1.0  1.0
-5 32.74786  host dcs2
12hdd   2.72899  osd.12  up   1.0  1.0
13hdd   2.72899  osd.13  up   1.0  1.0
14hdd   2.72899  osd.14  up   1.0  1.0
15hdd   2.72899  osd.15  up   1.0  1.0
16hdd   2.72899  osd.16  up   1.0  1.0
17hdd   2.72899  osd.17  up   1.0  1.0
18hdd   2.72899  osd.18  up   1.0  1.0
19hdd   2.72899  osd.19  up   1.0  1.0
20hdd   2.72899  osd.20  up   1.0  1.0
21hdd   2.72899  osd.21  up   1.0  1.0
22hdd   2.72899  osd.22  up   1.0  1.0
23hdd   2.72899  osd.23  up   1.0  1.0
-7 32.74786  host dcs3
24hdd   2.72899  osd.24  up   1.0  1.0
25hdd   2.72899  osd.25  up   1.0  1.0
26hdd   2.72899  osd.26  up   1.0  1.0
27hdd   2.72899  osd.27  up   1.0  1.0
28hdd   2.72899  osd.28  up   1.0  1.0
29hdd   2.72899  osd.29  up   1.0  1.0
30hdd   2.72899  osd.30  up   1.0  1.0
31hdd   2.72899  osd.31  up   1.0  1.0
32hdd   2.72899  osd.32  up   1.0  1.0
33hdd   2.72899  osd.33  up   1.0  1.0
34hdd   2.72899  osd.34  up   1.0  1.0
35hdd   2.72899  osd.35  up   1.0  1.0




[ceph: root@dcs1 /]# ceph -s
  cluster:
id: 58bbb950-538b-11ed-b237-2c59e53b80cc
health: HEALTH_WARN
4 filesystems are degraded
4 MDSs report slow metadata IOs
Reduced data availability: 1153 pgs inactive, 1101 pgs peering
26 slow ops, oldest one blocked for 563 sec, daemons
[osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]...
have slow ops.

  services:
mon: 3 daemons, quorum dcs1.evocorp,dcs2,dcs3 (age 7m)
mgr: dcs1.evocorp.kyqfcd(active, since 15m), standbys: dcs2.rirtyl
mds: 4/4 daemons up, 4 standby
osd: 36 osds: 36 up (since 6m), 36 in (since 47m); 65 remapped pgs

  data:
volumes: 0/4 healthy, 4 recovering
pools:   10 pools, 1153 pgs
objects: 254.72k objects, 994 GiB
usage:   2.8 TiB used, 95 TiB / 98 TiB avail
pgs: 100.000% pgs not active
 1036 peering
 65   remapped+peering
 52   activating




[ceph: root@dcs1 /]# ceph health detail
HEALTH_WARN 4 filesystems are degraded; 4 MDSs report slow metadata IOs;
Reduced data availability: 1153 pgs inactive, 1101 pgs peering; 26 slow
ops, oldest one blocked for 673 sec, daemons
[osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]...
have slow ops.
[WRN] FS_DEGRADED: 4 filesystems are degraded
fs dc_ovirt is degraded
fs dc_iso is degraded
fs dc_sas is degraded
fs pool_tester is degraded
[WRN] MDS_SLOW_METADATA_IO: 4 MDSs report slow metadata IOs
mds.dc_sas.dcs1.wbyuik(mds.0): 4 slow metadata IOs are blocked > 30
secs, oldest blocked for 1063 secs
mds.dc_ovirt.dcs1.lpcazs(mds.0): 4 slow metadata IOs are blocked > 30
secs, oldest blocked for 1058 secs
mds.pool_tester.dcs1.ixkkfs(mds.0): 4 slow metadata IOs are blocked >
30 secs, oldest blocked for 1058 secs
mds.dc_iso.dcs1.jxqqjd(mds.0): 4 slow metadata IOs are blocked > 30
secs, oldest blocked for 1058 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 1153 pgs inactive, 1101
pgs peering
pg 6.c3 is stuck inactive for 50m, current state peering, last acting
[30,15,11]
pg 6.c4 is stuck peering for 10h, current state peering, last acting
[12,0,26

[ceph-users] cephadm trouble with OSD db- and wal-device placement (quincy)

2022-11-01 Thread Ulrich Pralle

Hej,

we are using ceph version 17.2.0 on Ubuntu 22.04.1 LTS.

We've got several servers with the same setup and are facing a problem 
with OSD deployment and db-/wal-device placement.


Each server consists of ten rotational disks (10TB each) and two NVME 
devices (3TB each).


We would like to deploy each rotational disk with a db- and wal-device.

We want to place the db and wal devices of an osd together on the same 
NVME, to cut the failure of the OSDs in half if one NVME fails.


We tried several osd service type specifications to achieve our 
deployment goal.


Our best approach is:

service_type: osd
service_id: osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme
service_name: osd.osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme
placement:
  host_pattern: '*'
unmanaged: true
spec:
  data_devices:
model: MG[redacted]
rotational: 1
  db_devices:
limit: 1
model: MZ[redacted]
rotational: 0
  filter_logic: OR
  objectstore: bluestore
  wal_devices:
limit: 1
model: MZ[redacted]
rotational: 0

This service spec deploys ten OSDs with all db-devices on one NVME and 
all wal-devices on the second NVME.


If we omit "limit: 1", cephadm deploys ten OSDs with db-devices equally 
distributed on both NVMEs and no wal-devices at all --- although half of 
the NVMEs capacity remains unused.


What's the best way to do it.

Does that even make sense?

Thank you very much and with kind regards
Uli
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm trouble with OSD db- and wal-device placement (quincy)

2022-11-01 Thread Fox, Kevin M
I haven't done it, but had to read through the documentation a couple months 
ago and what I gathered was:
1. if you have a db device specified but no wal device, it will put the wal on 
the same volume as the db.
2. the recommendation seems to be to not have a separate volume for db and wal 
if on the same physical device?

So, that should allow you to have the failure mode you want I think?

Can anyone else confirm this or knows that it is incorrect?

Thanks,
Kevin


From: Ulrich Pralle 
Sent: Tuesday, November 1, 2022 7:25 AM
To: ceph-users@ceph.io
Subject: [ceph-users] cephadm trouble with OSD db- and wal-device placement 
(quincy)

Check twice before you click! This email originated from outside PNNL.


Hej,

we are using ceph version 17.2.0 on Ubuntu 22.04.1 LTS.

We've got several servers with the same setup and are facing a problem
with OSD deployment and db-/wal-device placement.

Each server consists of ten rotational disks (10TB each) and two NVME
devices (3TB each).

We would like to deploy each rotational disk with a db- and wal-device.

We want to place the db and wal devices of an osd together on the same
NVME, to cut the failure of the OSDs in half if one NVME fails.

We tried several osd service type specifications to achieve our
deployment goal.

Our best approach is:

service_type: osd
service_id: osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme
service_name: osd.osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme
placement:
   host_pattern: '*'
unmanaged: true
spec:
   data_devices:
 model: MG[redacted]
 rotational: 1
   db_devices:
 limit: 1
 model: MZ[redacted]
 rotational: 0
   filter_logic: OR
   objectstore: bluestore
   wal_devices:
 limit: 1
 model: MZ[redacted]
 rotational: 0

This service spec deploys ten OSDs with all db-devices on one NVME and
all wal-devices on the second NVME.

If we omit "limit: 1", cephadm deploys ten OSDs with db-devices equally
distributed on both NVMEs and no wal-devices at all --- although half of
the NVMEs capacity remains unused.

What's the best way to do it.

Does that even make sense?

Thank you very much and with kind regards
Uli
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it a bug that OSD crashed when it's full?

2022-11-01 Thread Tony Liu
The actual question is that, is crash expected when OSD is full?
My focus is more on how to prevent this from happening.
My expectation is that OSD rejects write request when it's full, but not crash.
Otherwise, no point to have ratio threshold.
Please let me know if this is the design or a bug.

Thanks!
Tony

From: Tony Liu 
Sent: October 31, 2022 05:46 PM
To: ceph-users@ceph.io; d...@ceph.io
Subject: [ceph-users] Is it a bug that OSD crashed when it's full?

Hi,

Based on doc, Ceph prevents you from writing to a full OSD so that you don’t 
lose data.
In my case, with v16.2.10, OSD crashed when it's full. Is this expected or some 
bug?
I'd expect write failure instead of OSD crash. It keeps crashing when tried to 
bring it up.
Is there any way to bring it back?

-7> 2022-10-31T22:52:57.426+ 7fe37fd94200  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", 
"log_files": [23300]}
-6> 2022-10-31T22:52:57.426+ 7fe37fd94200  4 rocksdb: 
[db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2
-5> 2022-10-31T22:52:57.529+ 7fe37fd94200  3 rocksdb: 
[le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) 
bits/key. Dramatic filter space and/or accuracy improvement is available with 
format_version>=5.
-4> 2022-10-31T22:52:57.592+ 7fe37fd94200  1 bluefs _allocate unable to 
allocate 0x9 on bdev 1, allocator name block, allocator type hybrid, 
capacity 0x6fc840, block size 0x1000, free 0x57acbc000, fragmentation 
0.359784, allocated 0x0
-3> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _allocate 
allocation failed, needed 0x8064a
-2> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _flush_range 
allocated: 0x0 offset: 0x0 length: 0x8064a
-1> 2022-10-31T22:52:57.604+ 7fe37fd94200 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc:
 In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, 
uint64_t)' thread 7fe37fd94200 time 2022-10-31T22:52:57.593873+
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc:
 2768: ceph_abort_msg("bluefs enospc")

 ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific 
(stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, 
std::__cxx11::basic_string, std::allocator > 
const&)+0xe5) [0x55858d7e2e7c]
 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x1131) [0x55858dee8cc1]
 3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55858dee8fa0]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool, 
std::unique_lock&)+0x32) [0x55858defa0b2]
 5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) 
[0x55858df129eb]
 6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, 
rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55858e3ae55f]
 7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned 
long)+0x58a) [0x55858e4c02aa]
 8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) 
[0x55858e4c1700]
 9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, 
rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x55858e5dce86]
 10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, 
rocksdb::BlockHandle*, bool)+0x26c) [0x55858e5dd7cc]
 11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, 
rocksdb::BlockHandle*, bool)+0x3c) [0x55858e5ddecc]
 12: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x55858e5ddf5d]
 13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, 
rocksdb::Slice const&)+0x2b8) [0x55858e5e13c8]
 14: (rocksdb::BuildTable(std::__cxx11::basic_string, std::allocator > const&, rocksdb::Env*, 
rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, 
rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, 
rocksdb::TableCache*, rocksdb::InternalIteratorBase*, 
std::vector >, 
std::allocator > > >, 
rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, 
std::vector >, 
std::allocator > > > const*, unsigned 
int, std::__cxx11::basic_string, 
std::allocator > const&, std::vector >, unsigned long, rocksdb::SnapshotChecker*, 
rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, 
bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, 
rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, 
rocksdb::TableProperties*, int, unsigned long, unsigned long, 
rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55858e58be45]
 15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, 
rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) 
[0x5

[ceph-users] Re: ceph-volume claiming wrong device

2022-11-01 Thread Oleksiy Stashok
It looks like I hit some flavour of https://tracker.ceph.com/issues/51034.
Since when I set `bluefs_buffered_io=false` the issue (that I could
reproduce pretty consistently) disappeared.

Oleksiy

On Tue, Nov 1, 2022 at 3:02 AM Eugen Block  wrote:

> As I said, I would recommend to really wipe the OSDs clean
> (ceph-volume lvm zap --destroy /dev/sdX), maybe reboot (on VMs it was
> sometimes necessary during my tests if I had too many failed
> attempts). And then also make sure you don't have any leftovers in the
> filesystem (under /var/lib/ceph) just to make sure you have a clean
> start.
>
> Zitat von Oleksiy Stashok :
>
> > Hey Eugen,
> >
> > valid points, I first tried to provision OSDs via ceph-ansible (later
> > excluded), which does run the batch command with all 4 disk devices, but
> it
> > often failed with the same issue I mentioned earlier, something like:
> > ```
> > bluefs _replay 0x0: stop: uuid e2f72ec9-2747-82d7-c7f8-41b7b6d41e1b !=
> > super.uuid 0110ddb3-d4bf-4c1e-be11-654598c71db0
> > ```
> > that's why I abandoned that idea and tried to provision OSDs manually one
> > by one.
> > As I mentioned I used ceph-ansible, not cephadm for legacy reasons, but I
> > suspect the problem I'm seeing is related to ceph-volume, so I suspect
> > cephadm won't change it.
> >
> > I did more investigation in 1-by-1 OSD creation flow and it seems like
> the
> > fact that `ceph-volume lvm list` shows me 2 devices belonging to the same
> > OSD can be explained by the following flow:
> >
> > 1. ceph-volume lvm create --bluestore --dmcrypt --data /dev/sdd
> > 2. trying to create osd.2
> > 3. fails with uuid != super.uuid issue
> > 4. ceph-volume lvm list returns /dev/sdd belong to osd.2 (even though it
> > failed)
> > 5. ceph-volume lvm create --bluestore --dmcrypt --data /dev/sde
> > 6. trying to create osd.2 (*again*)
> > 7. succeeds
> > 8. ceph-volume lvm list returns both /dev/sdd and /dev/sde belonging to
> > osd.2
> >
> > osd.2 is reported to be up and running.
> >
> > Any idea why this is happening?
> > Thank you!
> > Oleksiy
> >
> > On Thu, Oct 27, 2022 at 12:11 AM Eugen Block  wrote:
> >
> >> Hi,
> >>
> >> first of all, if you really need to issue ceph-volume manually,
> >> there's a batch command:
> >>
> >> cephadm ceph-volume lvm batch /dev/sdb /dev/sdc /dev/sdd /dev/sde
> >>
> >> Second, are you using cephadm? Maybe your manual intervention
> >> conflicts with the automatic osd setup (all available devices). You
> >> could look into /var/log/ceph/cephadm.log on each node and see if
> >> cephadm already tried to setup the OSDs for you. What does 'ceph orch
> >> ls' show?
> >> Did you end up having online OSDs or did it fail? In that case I would
> >> purge all OSDs from the crushmap, then wipe all devices (ceph-volume
> >> lvm zap --destroy /dev/sdX) and either let cephadm create the OSDs for
> >> you or you disable that (unmanaged=true) and run the manual steps
> >> again (although it's not really necessary).
> >>
> >> Regards,
> >> Eugen
> >>
> >> Zitat von Oleksiy Stashok :
> >>
> >> > Hey guys,
> >> >
> >> > I ran into a weird issue, hope you can explain what I'm observing. I'm
> >> > testing* Ceph 16.2.10* on *Ubuntu 20.04* in *Google Cloud VMs*, I
> >> created 3
> >> > instances and attached 4 persistent SSD disks to each instance. I can
> see
> >> > these disks attached as `/dev/sdb, /dev/sdc, /dev/sdd, /dev/sde`
> devices.
> >> >
> >> > As a next step I used ceph-ansible to bootstrap the ceph cluster on 3
> >> > instances, however I intentionally skipped OSD setup. So I ended up
> with
> >> a
> >> > Ceph cluster w/o any OSD.
> >> >
> >> > I ssh'ed into each VM and ran:
> >> >
> >> > ```
> >> >   sudo -s
> >> >   for dev in sdb sdc sdd sde; do
> >> > /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore
> >> > --dmcrypt --data "/dev/$dev"
> >> >   done
> >> > ```
> >> >
> >> > The operation above randomly fails on random instances/devices with
> >> > something like:
> >> > ```
> >> > bluefs _replay 0x0: stop: uuid e2f72ec9-2747-82d7-c7f8-41b7b6d41e1b !=
> >> > super.uuid 0110ddb3-d4bf-4c1e-be11-654598c71db0
> >> > ```
> >> >
> >> > The interesting this is that when I do
> >> > ```
> >> > /usr/sbin/ceph-volume lvm ls
> >> > ```
> >> >
> >> > I can see that the device for which OSD creation failed actually
> belongs
> >> to
> >> > a different OSD that was previously created for a different device.
> For
> >> > example the failure I mentioned above happened on the `/dev/sde`
> device,
> >> so
> >> > when I list lvms I see this:
> >> > ```
> >> > == osd.2 ===
> >> >
> >> >   [block]
> >> >
> >>
> /dev/ceph-103a4373-dbe0-43d6-a9e0-34db4e1b257c/osd-block-9af542ba-fd65-4355-ad17-7293856acaeb
> >> >
> >> >   block device
> >> >
> >> >
> >>
> /dev/ceph-103a4373-dbe0-43d6-a9e0-34db4e1b257c/osd-block-9af542ba-fd65-4355-ad17-7293856acaeb
> >> >   block uuidFfFnLt-h33F-F73V-tY45-VuZM-scj7-C3dg1K
> >> >   cephx lockbox secret
> AQAlelljqNPoMhAA5

[ceph-users] Re: cephadm trouble with OSD db- and wal-device placement (quincy)

2022-11-01 Thread Eugen Block
That is correct, just omit the wal_devices and they will be placed on  
the db_devices automatically.


Zitat von "Fox, Kevin M" :

I haven't done it, but had to read through the documentation a  
couple months ago and what I gathered was:
1. if you have a db device specified but no wal device, it will put  
the wal on the same volume as the db.
2. the recommendation seems to be to not have a separate volume for  
db and wal if on the same physical device?


So, that should allow you to have the failure mode you want I think?

Can anyone else confirm this or knows that it is incorrect?

Thanks,
Kevin


From: Ulrich Pralle 
Sent: Tuesday, November 1, 2022 7:25 AM
To: ceph-users@ceph.io
Subject: [ceph-users] cephadm trouble with OSD db- and wal-device  
placement (quincy)


Check twice before you click! This email originated from outside PNNL.


Hej,

we are using ceph version 17.2.0 on Ubuntu 22.04.1 LTS.

We've got several servers with the same setup and are facing a problem
with OSD deployment and db-/wal-device placement.

Each server consists of ten rotational disks (10TB each) and two NVME
devices (3TB each).

We would like to deploy each rotational disk with a db- and wal-device.

We want to place the db and wal devices of an osd together on the same
NVME, to cut the failure of the OSDs in half if one NVME fails.

We tried several osd service type specifications to achieve our
deployment goal.

Our best approach is:

service_type: osd
service_id: osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme
service_name: osd.osd_spec_10x10tb-dsk_db_and_wal_on_2x3tb-nvme
placement:
   host_pattern: '*'
unmanaged: true
spec:
   data_devices:
 model: MG[redacted]
 rotational: 1
   db_devices:
 limit: 1
 model: MZ[redacted]
 rotational: 0
   filter_logic: OR
   objectstore: bluestore
   wal_devices:
 limit: 1
 model: MZ[redacted]
 rotational: 0

This service spec deploys ten OSDs with all db-devices on one NVME and
all wal-devices on the second NVME.

If we omit "limit: 1", cephadm deploys ten OSDs with db-devices equally
distributed on both NVMEs and no wal-devices at all --- although half of
the NVMEs capacity remains unused.

What's the best way to do it.

Does that even make sense?

Thank you very much and with kind regards
Uli
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph status usage doesn't match bucket totals

2022-11-01 Thread Wilson,Thaddeus C
I have a ceph cluster that shows a different space utilization in its status 
screen than in its bucket stats. When I copy the contents of this cluster to a 
different ceph cluster, the bucket stats totals are as expected and match the 
bucket stats totals.

output of "ceph -s" on the first ceph cluster:

  cluster:
id: 9de06f94-a841-4e17-ac7c-9a0d8d8791b8
health: HEALTH_OK

  services:
mon: 3 daemons, quorum idb-ceph4,idb-ceph5,idb-ceph6
mgr: idb-ceph4(active), standbys: idb-ceph5, idb-ceph6
osd: 172 osds: 168 up, 168 in
rgw: 5 daemons active

  data:
pools:   16 pools, 9430 pgs
objects: 272.4 M objects, 360 TiB
usage:   1.1 PiB used, 333 TiB / 1.4 PiB avail
pgs: 9410 active+clean
 11   active+clean+scrubbing
 9active+clean+scrubbing+deep

  io:
client:   151 MiB/s rd, 838 op/s rd, 0 op/s wr

taking the sum of all the sizes of the buckets gives:

526,773,819,875,727 bytes or 526.7TB.

How can the status usage of objects be 333TiB and the bucket totals be 479TiB?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No active PG; No disk activity

2022-11-01 Thread Murilo Morais
I managed to solve this problem.

To document the resolution: The firewall was blocking communication. After
disabling everything related to it and restarting the machine everything
went back to normal.

Em ter., 1 de nov. de 2022 às 10:46, Murilo Morais 
escreveu:

> Good morning everyone!
>
> Today there was an atypical situation in our Cluster where the three
> machines came to shut down.
>
> On powering up the cluster went up and formed quorum with no problems, but
> the PGs are all in Working, I don't see any disk activity on the machines.
> No PG is active.
>
>
>
>
> [ceph: root@dcs1 /]# ceph osd tree
> ID  CLASS  WEIGHTTYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -1 98.24359  root default
> -3 32.74786  host dcs1
>  0hdd   2.72899  osd.0   up   1.0  1.0
>  1hdd   2.72899  osd.1   up   1.0  1.0
>  2hdd   2.72899  osd.2   up   1.0  1.0
>  3hdd   2.72899  osd.3   up   1.0  1.0
>  4hdd   2.72899  osd.4   up   1.0  1.0
>  5hdd   2.72899  osd.5   up   1.0  1.0
>  6hdd   2.72899  osd.6   up   1.0  1.0
>  7hdd   2.72899  osd.7   up   1.0  1.0
>  8hdd   2.72899  osd.8   up   1.0  1.0
>  9hdd   2.72899  osd.9   up   1.0  1.0
> 10hdd   2.72899  osd.10  up   1.0  1.0
> 11hdd   2.72899  osd.11  up   1.0  1.0
> -5 32.74786  host dcs2
> 12hdd   2.72899  osd.12  up   1.0  1.0
> 13hdd   2.72899  osd.13  up   1.0  1.0
> 14hdd   2.72899  osd.14  up   1.0  1.0
> 15hdd   2.72899  osd.15  up   1.0  1.0
> 16hdd   2.72899  osd.16  up   1.0  1.0
> 17hdd   2.72899  osd.17  up   1.0  1.0
> 18hdd   2.72899  osd.18  up   1.0  1.0
> 19hdd   2.72899  osd.19  up   1.0  1.0
> 20hdd   2.72899  osd.20  up   1.0  1.0
> 21hdd   2.72899  osd.21  up   1.0  1.0
> 22hdd   2.72899  osd.22  up   1.0  1.0
> 23hdd   2.72899  osd.23  up   1.0  1.0
> -7 32.74786  host dcs3
> 24hdd   2.72899  osd.24  up   1.0  1.0
> 25hdd   2.72899  osd.25  up   1.0  1.0
> 26hdd   2.72899  osd.26  up   1.0  1.0
> 27hdd   2.72899  osd.27  up   1.0  1.0
> 28hdd   2.72899  osd.28  up   1.0  1.0
> 29hdd   2.72899  osd.29  up   1.0  1.0
> 30hdd   2.72899  osd.30  up   1.0  1.0
> 31hdd   2.72899  osd.31  up   1.0  1.0
> 32hdd   2.72899  osd.32  up   1.0  1.0
> 33hdd   2.72899  osd.33  up   1.0  1.0
> 34hdd   2.72899  osd.34  up   1.0  1.0
> 35hdd   2.72899  osd.35  up   1.0  1.0
>
>
>
>
> [ceph: root@dcs1 /]# ceph -s
>   cluster:
> id: 58bbb950-538b-11ed-b237-2c59e53b80cc
> health: HEALTH_WARN
> 4 filesystems are degraded
> 4 MDSs report slow metadata IOs
> Reduced data availability: 1153 pgs inactive, 1101 pgs peering
> 26 slow ops, oldest one blocked for 563 sec, daemons
> [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]...
> have slow ops.
>
>   services:
> mon: 3 daemons, quorum dcs1.evocorp,dcs2,dcs3 (age 7m)
> mgr: dcs1.evocorp.kyqfcd(active, since 15m), standbys: dcs2.rirtyl
> mds: 4/4 daemons up, 4 standby
> osd: 36 osds: 36 up (since 6m), 36 in (since 47m); 65 remapped pgs
>
>   data:
> volumes: 0/4 healthy, 4 recovering
> pools:   10 pools, 1153 pgs
> objects: 254.72k objects, 994 GiB
> usage:   2.8 TiB used, 95 TiB / 98 TiB avail
> pgs: 100.000% pgs not active
>  1036 peering
>  65   remapped+peering
>  52   activating
>
>
>
>
> [ceph: root@dcs1 /]# ceph health detail
> HEALTH_WARN 4 filesystems are degraded; 4 MDSs report slow metadata IOs;
> Reduced data availability: 1153 pgs inactive, 1101 pgs peering; 26 slow
> ops, oldest one blocked for 673 sec, daemons
> [osd.10,osd.13,osd.14,osd.15,osd.16,osd.18,osd.20,osd.21,osd.24,osd.25]...
> have slow ops.
> [WRN] FS_DEGRADED: 4 filesystems are degraded
> fs dc_ovirt is degraded
> fs dc_iso is degraded
> fs dc_sas is degraded
> fs pool_tester is degraded
> [WRN] MDS_SLOW_METADATA_IO: 4 MDSs report slow metadata IOs
> mds.dc_sas.dcs1.wbyuik(mds.0): 4 slow metadata IOs are blocked > 30
> secs, oldest blocked for 1063 secs
> mds.dc_ovirt.dcs1.lpcazs(mds.0): 4 slow metadata IOs are blocked > 30
> secs, oldest blocked for 1058 secs
> mds.po

[ceph-users] Re: Is it a bug that OSD crashed when it's full?

2022-11-01 Thread Igor Fedotov

Hi Tony,

first of all let me share my understanding of the issue you're facing. 
This recalls me an upstream ticket and I presume my root cause analysis 
from there (https://tracker.ceph.com/issues/57672#note-9) is applicable 
in your case as well.


So generally speaking your OSD isn't 100% full - from the log output one 
can see that 0x57acbc000 of 0x6fc840 bytes are free. But there are 
not enough contiguous 64K chunks for BlueFS to proceed operating..


As a result OSD managed to escape any *full* sentries and reached the 
state when it's crashed - these safety means just weren't designed to  
take that additional free space fragmentation factor into account...


Similarly the lack of available 64K chunks prevents OSD from starting up 
- it needs to write out some more data to BlueFS during startup recovery.


I'm currently working on enabling BlueFS functioning with default main 
device allocation unit (=4K) which will hopefully fix the above issue.



Meanwhile you might want to workaround the current  OSD's state by 
setting bluefs_shared_allocat_size to 32K - this might have some 
operational and performance effects but highly likely OSD should be able 
to startup afterwards. Please do not use 4K for now - it's known for 
causing more problems in some circumstances. And I'd highly recommend to 
redeploy the OSD ASAP as you drained all the data off it - I presume 
that's the reason why you want to bring it up instead of letting the 
cluster to recover using regular means applied on OSD loss.


Alternative approach would be to add standalone DB volume and migrate 
BlueFS there - ceph-volume should be able to do that even in the current 
OSD state. Expanding main volume (if backed by LVM and extra spare space 
is available) is apparently a valid option too



Thanks,

Igor


On 11/1/2022 8:09 PM, Tony Liu wrote:

The actual question is that, is crash expected when OSD is full?
My focus is more on how to prevent this from happening.
My expectation is that OSD rejects write request when it's full, but not crash.
Otherwise, no point to have ratio threshold.
Please let me know if this is the design or a bug.

Thanks!
Tony

From: Tony Liu 
Sent: October 31, 2022 05:46 PM
To: ceph-users@ceph.io; d...@ceph.io
Subject: [ceph-users] Is it a bug that OSD crashed when it's full?

Hi,

Based on doc, Ceph prevents you from writing to a full OSD so that you don’t 
lose data.
In my case, with v16.2.10, OSD crashed when it's full. Is this expected or some 
bug?
I'd expect write failure instead of OSD crash. It keeps crashing when tried to 
bring it up.
Is there any way to bring it back?

 -7> 2022-10-31T22:52:57.426+ 7fe37fd94200  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1667256777427646, 
"job": 1, "event": "recovery_started", "log_files": [23300]}
 -6> 2022-10-31T22:52:57.426+ 7fe37fd94200  4 rocksdb: 
[db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2
 -5> 2022-10-31T22:52:57.529+ 7fe37fd94200  3 rocksdb: 
[le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high (20) 
bits/key. Dramatic filter space and/or accuracy improvement is available with 
format_version>=5.
 -4> 2022-10-31T22:52:57.592+ 7fe37fd94200  1 bluefs _allocate unable 
to allocate 0x9 on bdev 1, allocator name block, allocator type hybrid, 
capacity   , block size 0x1000, free 0x57acbc000, fragmentation 0.359784, 
allocated 0x0
 -3> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _allocate 
allocation failed, needed 0x8064a
 -2> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _flush_range 
allocated: 0x0 offset: 0x0 length: 0x8064a
 -1> 2022-10-31T22:52:57.604+ 7fe37fd94200 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc:
 In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' 
thread 7fe37fd94200 time 2022-10-31T22:52:57.593873+
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc:
 2768: ceph_abort_msg("bluefs enospc")

  ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific 
(stable)
  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string, std::allocator > const&)+0xe5) [0x55858d7e2e7c]
  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
long)+0x1131) [0x55858dee8cc1]
  3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55858dee8fa0]
  4: (BlueFS::_flush(BlueFS::FileWriter*, bool, 
std::unique_lock&)+0x32) [0x55858defa0b2]
  5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) 
[0x55858df129eb]
  6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, 
rocksdb::IOOptions cons

[ceph-users] Re: Is it a bug that OSD crashed when it's full?

2022-11-01 Thread Fox, Kevin M
If its the same issue, I'd check the fragmentation score on the entire cluster 
asap. You may have other osds close to the limit and its harder to fix when all 
your osds cross the line at once. If you drain this one, it may push the other 
ones into the red zone if your too close, making the problem much worse.

Our cluster has been stable after splitting all the db's to their own volumes.

Really looking forward to the 4k fix.  :) But the workaround seems solid.

Thanks,
Kevin



From: Igor Fedotov 
Sent: Tuesday, November 1, 2022 4:34 PM
To: Tony Liu; ceph-users@ceph.io; d...@ceph.io
Subject: [ceph-users] Re: Is it a bug that OSD crashed when it's full?

Check twice before you click! This email originated from outside PNNL.


Hi Tony,

first of all let me share my understanding of the issue you're facing.
This recalls me an upstream ticket and I presume my root cause analysis
from there 
(https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F57672%23note-9&data=05%7C01%7Ckevin.fox%40pnnl.gov%7C2e3ff73019a7475ade5e08dabc627f6d%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638029428500885214%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2F8iqR9bo0Yg4WZDA8TqI4d8HywpEoChEUCoiXLEE9TM%3D&reserved=0)
 is applicable
in your case as well.

So generally speaking your OSD isn't 100% full - from the log output one
can see that 0x57acbc000 of 0x6fc840 bytes are free. But there are
not enough contiguous 64K chunks for BlueFS to proceed operating..

As a result OSD managed to escape any *full* sentries and reached the
state when it's crashed - these safety means just weren't designed to
take that additional free space fragmentation factor into account...

Similarly the lack of available 64K chunks prevents OSD from starting up
- it needs to write out some more data to BlueFS during startup recovery.

I'm currently working on enabling BlueFS functioning with default main
device allocation unit (=4K) which will hopefully fix the above issue.


Meanwhile you might want to workaround the current  OSD's state by
setting bluefs_shared_allocat_size to 32K - this might have some
operational and performance effects but highly likely OSD should be able
to startup afterwards. Please do not use 4K for now - it's known for
causing more problems in some circumstances. And I'd highly recommend to
redeploy the OSD ASAP as you drained all the data off it - I presume
that's the reason why you want to bring it up instead of letting the
cluster to recover using regular means applied on OSD loss.

Alternative approach would be to add standalone DB volume and migrate
BlueFS there - ceph-volume should be able to do that even in the current
OSD state. Expanding main volume (if backed by LVM and extra spare space
is available) is apparently a valid option too


Thanks,

Igor


On 11/1/2022 8:09 PM, Tony Liu wrote:
> The actual question is that, is crash expected when OSD is full?
> My focus is more on how to prevent this from happening.
> My expectation is that OSD rejects write request when it's full, but not 
> crash.
> Otherwise, no point to have ratio threshold.
> Please let me know if this is the design or a bug.
>
> Thanks!
> Tony
> 
> From: Tony Liu 
> Sent: October 31, 2022 05:46 PM
> To: ceph-users@ceph.io; d...@ceph.io
> Subject: [ceph-users] Is it a bug that OSD crashed when it's full?
>
> Hi,
>
> Based on doc, Ceph prevents you from writing to a full OSD so that you don’t 
> lose data.
> In my case, with v16.2.10, OSD crashed when it's full. Is this expected or 
> some bug?
> I'd expect write failure instead of OSD crash. It keeps crashing when tried 
> to bring it up.
> Is there any way to bring it back?
>
>  -7> 2022-10-31T22:52:57.426+ 7fe37fd94200  4 rocksdb: EVENT_LOG_v1 
> {"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", 
> "log_files": [23300]}
>  -6> 2022-10-31T22:52:57.426+ 7fe37fd94200  4 rocksdb: 
> [db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2
>  -5> 2022-10-31T22:52:57.529+ 7fe37fd94200  3 rocksdb: 
> [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high 
> (20) bits/key. Dramatic filter space and/or accuracy improvement is available 
> with format_version>=5.
>  -4> 2022-10-31T22:52:57.592+ 7fe37fd94200  1 bluefs _allocate unable 
> to allocate 0x9 on bdev 1, allocator name block, allocator type hybrid, 
> capacity, block size 0x1000, free 0x57acbc000, fragmentation 0.359784, 
> allocated 0x0
>  -3> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _allocate 
> allocation failed, needed 0x8064a
>  -2> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _flush_range 
> allocated: 0x0 offset: 0x0 length: 0x8064a
>  -1> 2022-10-31T22:52:57.604+ 7fe37fd94200 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_6

[ceph-users] Re: Is it a bug that OSD crashed when it's full?

2022-11-01 Thread Tony Liu
Thank you Igor!
Tony

From: Igor Fedotov 
Sent: November 1, 2022 04:34 PM
To: Tony Liu; ceph-users@ceph.io; d...@ceph.io
Subject: Re: [ceph-users] Re: Is it a bug that OSD crashed when it's full?

Hi Tony,

first of all let me share my understanding of the issue you're facing.
This recalls me an upstream ticket and I presume my root cause analysis
from there (https://tracker.ceph.com/issues/57672#note-9) is applicable
in your case as well.

So generally speaking your OSD isn't 100% full - from the log output one
can see that 0x57acbc000 of 0x6fc840 bytes are free. But there are
not enough contiguous 64K chunks for BlueFS to proceed operating..

As a result OSD managed to escape any *full* sentries and reached the
state when it's crashed - these safety means just weren't designed to
take that additional free space fragmentation factor into account...

Similarly the lack of available 64K chunks prevents OSD from starting up
- it needs to write out some more data to BlueFS during startup recovery.

I'm currently working on enabling BlueFS functioning with default main
device allocation unit (=4K) which will hopefully fix the above issue.


Meanwhile you might want to workaround the current  OSD's state by
setting bluefs_shared_allocat_size to 32K - this might have some
operational and performance effects but highly likely OSD should be able
to startup afterwards. Please do not use 4K for now - it's known for
causing more problems in some circumstances. And I'd highly recommend to
redeploy the OSD ASAP as you drained all the data off it - I presume
that's the reason why you want to bring it up instead of letting the
cluster to recover using regular means applied on OSD loss.

Alternative approach would be to add standalone DB volume and migrate
BlueFS there - ceph-volume should be able to do that even in the current
OSD state. Expanding main volume (if backed by LVM and extra spare space
is available) is apparently a valid option too


Thanks,

Igor


On 11/1/2022 8:09 PM, Tony Liu wrote:
> The actual question is that, is crash expected when OSD is full?
> My focus is more on how to prevent this from happening.
> My expectation is that OSD rejects write request when it's full, but not 
> crash.
> Otherwise, no point to have ratio threshold.
> Please let me know if this is the design or a bug.
>
> Thanks!
> Tony
> 
> From: Tony Liu 
> Sent: October 31, 2022 05:46 PM
> To: ceph-users@ceph.io; d...@ceph.io
> Subject: [ceph-users] Is it a bug that OSD crashed when it's full?
>
> Hi,
>
> Based on doc, Ceph prevents you from writing to a full OSD so that you don’t 
> lose data.
> In my case, with v16.2.10, OSD crashed when it's full. Is this expected or 
> some bug?
> I'd expect write failure instead of OSD crash. It keeps crashing when tried 
> to bring it up.
> Is there any way to bring it back?
>
>  -7> 2022-10-31T22:52:57.426+ 7fe37fd94200  4 rocksdb: EVENT_LOG_v1 
> {"time_micros": 1667256777427646, "job": 1, "event": "recovery_started", 
> "log_files": [23300]}
>  -6> 2022-10-31T22:52:57.426+ 7fe37fd94200  4 rocksdb: 
> [db_impl/db_impl_open.cc:760] Recovering log #23300 mode 2
>  -5> 2022-10-31T22:52:57.529+ 7fe37fd94200  3 rocksdb: 
> [le/block_based/filter_policy.cc:584] Using legacy Bloom filter with high 
> (20) bits/key. Dramatic filter space and/or accuracy improvement is available 
> with format_version>=5.
>  -4> 2022-10-31T22:52:57.592+ 7fe37fd94200  1 bluefs _allocate unable 
> to allocate 0x9 on bdev 1, allocator name block, allocator type hybrid, 
> capacity, block size 0x1000, free 0x57acbc000, fragmentation 0.359784, 
> allocated 0x0
>  -3> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _allocate 
> allocation failed, needed 0x8064a
>  -2> 2022-10-31T22:52:57.592+ 7fe37fd94200 -1 bluefs _flush_range 
> allocated: 0x0 offset: 0x0 length: 0x8064a
>  -1> 2022-10-31T22:52:57.604+ 7fe37fd94200 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc:
>  In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, 
> uint64_t)' thread 7fe37fd94200 time 2022-10-31T22:52:57.593873+
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/os/bluestore/BlueFS.cc:
>  2768: ceph_abort_msg("bluefs enospc")
>
>   ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific 
> (stable)
>   1: (ceph::__ceph_abort(char const*, int, char const*, 
> std::__cxx11::basic_string, std::allocator 
> > const&)+0xe5) [0x55858d7e2e7c]
>   2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
> long)+0x1131) [0x55858dee8cc1]
>   3: (BlueFS::_flush(BlueFS