[ceph-users] Corrupted and inconsistent reads from CephFS on EC pool

2023-12-15 Thread aschmitz

Hi everyone,

I'm seeing different results from reading files, depending on which OSDs 
are running, including some incorrect reads with all OSDs running, in 
CephFS from a pool with erasure coding. I'm running Ceph 17.2.6.


# More detail

In particular, I have a relatively large backup of some files, combined 
with SHA-256 hashes of the files (which were verified when the backup 
was created, approximately 7 months ago). Verifying these hashes 
currently gives several errors, both in large and small files, but 
somewhat tilted towards larger files.


Investigating which PGs stored the relevant files (using 
cephfs-data-scan pg_files) didn't show the problem to be isolated to one 
PG, but did show several PGs that contained OSD 15 as an active member.


Taking OSD 15 offline leads to *better* reads (more files with correct 
SHA-256 hashes), but not completely correct reads. Further investigation 
implicated OSD 34 as another potential issue, but taking it offline also 
results in more correct files but not completely.


Bringing the stopped OSDs (15 and 34) back online results in the earlier 
(incorrect) hashes when reading files, as might be expected, but this 
seems to demonstrate that the correct information (or at least more 
correct information) is still on the drives.


The hashes I receive for a given corrupted file are consistent from read 
to read (including on different hosts, to avoid caching as an issue), 
but obviously sometimes change if I take an affected OSD offline.


# Recent history

I have Ceph configured with a deep scrub interval of approximately 30 
days, and they have completed regularly with no issues identified. 
However, within the past two weeks I added two additional drives to the 
cluster, and rebalancing took about two weeks to complete: the placement 
groups I took notice of having issues were not deep scrubbed since the 
replacement completed, so it is possible something got corrupted during 
the rebalance.


Neither OSD 15 nor 34 is a new drive, and as far as I have experienced 
(and Ceph's health indications have shown), all of the existing OSDs 
have behaved correctly up to this point.


# Configuration

I created an erasure coding profile for the pool in question using the 
following command:


ceph osd erasure-code-profile set erasure_k4_m2 \
  plugin=jerasure \
  k=4 m=2 \
  technique=blaum_roth \
  crush-device-class=hdd

And the following CRUSH rule is used for the pool:

  rule erasure_k4_m2_hdd_rule {
id 3
type erasure
min_size 4
max_size 6
step take default class hdd
step choose indep 3 type host
step chooseleaf indep 2 type osd
step emit
  }

# Questions

1. Does this behavior ring a bell to anyone? Is there something obvious 
I'm missing or should do?


2. Is deep scrubbing likely to help the situation? Hurt it? (Hopefully 
not hurt: I've prioritized deep scrubbing of the PGs on OSD 15 and 34, 
and will likely follow up with the rest of the pool.)


3. Is there a way to force "full reads" or otherwise to use all of the 
EC chunks (potentially in tandem with on-disk checksums) to identify the 
correct data, rather than a combination of the data from the primary OSDs?


Thanks for any insights you might have,
aschmitz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-12-15 Thread Frank Schilder
Hi all,

another quick update: please use this link to download the script: 
https://github.com/frans42/ceph-goodies/blob/main/scripts/pool-scrub-report

The one I sent originally does not follow latest.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rbd trash: snapshot id is protected from removal

2023-12-15 Thread Eugen Block

Hi,

I've been searching and trying things but to no avail yet.
This is uncritical because it's a test cluster only, but I'd still  
like to have a solution in case this somehow will make it into our  
production clusters.
It's an Openstack Victoria Cloud with Ceph backend. If one tries to  
remove a glance image (openstack image delete {UUID}' which usually  
has a protected snapshot it will fail to do so, but apparently the  
snapshot is actually moved to the trash namespace. And since it is  
protected, I can't remove it:


storage01:~ # rbd -p images snap ls 278ffe2b-67a7-40d0-87b7-903f2fc9c3b4 --all
SNAPID  NAME  SIZEPROTECTED   
TIMESTAMP NAMESPACE
   159  1a97db13-307e-4820-8dc2-8549e9ba1ad7  39 MiB Thu  
Dec 14 08:29:56 2023  trash (snap)


storage01:~ # rbd snap rm --snap-id 159  
images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4

rbd: snapshot id 159 is protected from removal.

storage01:~ # rbd snap ls images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4
storage01:~ #

This is a small image and only a test environment, but these orphans  
could potentially fill up lots of space. In a newer openstack version  
(I tried with Antelope) this doesn't seem to work like that anymore,  
so that's good. But how would I get rid of that trash snapshot in this  
cluster?


Thanks!
Eugen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd trash: snapshot id is protected from removal

2023-12-15 Thread Ilya Dryomov
On Fri, Dec 15, 2023 at 12:52 PM Eugen Block  wrote:
>
> Hi,
>
> I've been searching and trying things but to no avail yet.
> This is uncritical because it's a test cluster only, but I'd still
> like to have a solution in case this somehow will make it into our
> production clusters.
> It's an Openstack Victoria Cloud with Ceph backend. If one tries to
> remove a glance image (openstack image delete {UUID}' which usually
> has a protected snapshot it will fail to do so, but apparently the
> snapshot is actually moved to the trash namespace. And since it is
> protected, I can't remove it:
>
> storage01:~ # rbd -p images snap ls 278ffe2b-67a7-40d0-87b7-903f2fc9c3b4 --all
> SNAPID  NAME  SIZEPROTECTED
> TIMESTAMP NAMESPACE
> 159  1a97db13-307e-4820-8dc2-8549e9ba1ad7  39 MiB Thu
> Dec 14 08:29:56 2023  trash (snap)
>
> storage01:~ # rbd snap rm --snap-id 159
> images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4
> rbd: snapshot id 159 is protected from removal.
>
> storage01:~ # rbd snap ls images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4
> storage01:~ #
>
> This is a small image and only a test environment, but these orphans
> could potentially fill up lots of space. In a newer openstack version
> (I tried with Antelope) this doesn't seem to work like that anymore,
> so that's good. But how would I get rid of that trash snapshot in this
> cluster?

Hi Eugen,

This means that there is at least one clone based off of that snapshot.
You should be able to identify it with:

$ rbd children --all --snap-id 159 images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4

Get rid of the clone(s) and the snapshot should get removed
automatically.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd trash: snapshot id is protected from removal [solved]

2023-12-15 Thread Eugen Block
Ah of course, thanks for pointing that out, I somehow didn't think of  
the remaining clones.


Thanks a lot!

Zitat von Ilya Dryomov :


On Fri, Dec 15, 2023 at 12:52 PM Eugen Block  wrote:


Hi,

I've been searching and trying things but to no avail yet.
This is uncritical because it's a test cluster only, but I'd still
like to have a solution in case this somehow will make it into our
production clusters.
It's an Openstack Victoria Cloud with Ceph backend. If one tries to
remove a glance image (openstack image delete {UUID}' which usually
has a protected snapshot it will fail to do so, but apparently the
snapshot is actually moved to the trash namespace. And since it is
protected, I can't remove it:

storage01:~ # rbd -p images snap ls  
278ffe2b-67a7-40d0-87b7-903f2fc9c3b4 --all

SNAPID  NAME  SIZEPROTECTED
TIMESTAMP NAMESPACE
159  1a97db13-307e-4820-8dc2-8549e9ba1ad7  39 MiB Thu
Dec 14 08:29:56 2023  trash (snap)

storage01:~ # rbd snap rm --snap-id 159
images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4
rbd: snapshot id 159 is protected from removal.

storage01:~ # rbd snap ls images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4
storage01:~ #

This is a small image and only a test environment, but these orphans
could potentially fill up lots of space. In a newer openstack version
(I tried with Antelope) this doesn't seem to work like that anymore,
so that's good. But how would I get rid of that trash snapshot in this
cluster?


Hi Eugen,

This means that there is at least one clone based off of that snapshot.
You should be able to identify it with:

$ rbd children --all --snap-id 159  
images/278ffe2b-67a7-40d0-87b7-903f2fc9c3b4


Get rid of the clone(s) and the snapshot should get removed
automatically.

Thanks,

Ilya



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephfs MDS tunning for deep-learning data-flow

2023-12-15 Thread mhnx
Hello everyone! How are you doing?
I wasn't around for two years but I'm back and working on a new development.

I deployed 2x ceph cluster:
1- user_data:5x node [8x4TB Sata SSD, 2x 25Gbit network],
2- data-gen: 3x node [8x4TB Sata SSD, 2x 25Gbit network],

note: hardware is not my choice and I know I have TRIM issue and also I
couldn't use any PCI-E nvme for wal+db because 1u servers and no empty slots
-

At test phase everything was good, I reached 1GB/s for 18 clients at
the same time.
But when I migrate to production (60 GPU server client + 40 CPU server
client) the speed issue begin because of the default parameters as usual
and now I'm working on adaptation by debugging current data work flow I
have and I'm researching how can I improve my environment.

So far, I couldn't find useful guide or informations in one place and I
just wanted to share my findings, benchmarks and ideas with the community
and if I'm lucky enough, maybe I will get awesome recommendations from some
old friends and enjoy get in touch after a while. :)

Starting from here, I will only share technical information about my
environment:

1- Cluster user_data: 5x node [8x4TB Sata SSD, 2x 25Gbit network] =
Replication 2
- A: I only have 1 pool in this cluster and information is below:
- ceph df
--- RAW STORAGE ---
CLASS SIZEAVAILUSED  RAW USED  %RAW USED
ssd146 TiB  106 TiB  40 TiB40 TiB  27.50
TOTAL  146 TiB  106 TiB  40 TiB40 TiB  27.50

--- POOLS ---
POOL ID   PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
.mgr  1 1  286 MiB   73  859 MiB  0 32 TiB
cephfs.ud-data.meta   9   512   65 GiB2.87M  131 GiB   0.13 48 TiB
cephfs.ud-data.data  10  2048   23 TiB   95.34M   40 TiB  29.39 48 TiB


- B: In this cluster, every user(50) has a subvolume and the quota is
1TB/for each users
- C: In each subvolume, users has "home and data" directory.
- D: home directory size 5-10GB and client uses it for docker home
directory at each login
- E: I'm also storing users personal or development data around 2TB/each
user
- F: I only have 1x active MDS server and 4x standby as below.

- ceph fs status

> ud-data - 84 clients
> ===
> RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS
> CAPS
>  0active  ud-data.ud-04.seggyv  Reqs:  372 /s  4343k  4326k  69.7k
>  2055k
> POOL   TYPE USED  AVAIL
> cephfs.ud-data.meta  metadata   130G  47.5T
> cephfs.ud-data.datadata39.5T  47.5T
> STANDBY MDS
> ud-data.ud-01.uatjle
> ud-data.ud-02.xcoojt
> ud-data.ud-05.rnhcfe
> ud-data.ud-03.lhwkml
> MDS version: ceph version 17.2.6
> (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)



- What is my issue?
2023-12-15T21:07:47.175542+ mon.ud-01 [WRN] Health check failed: 1
clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
2023-12-15T21:09:35.002112+ mon.ud-01 [INF] MDS health message cleared
(mds.?): Client gpu-server-11 failing to respond to cache pressure
2023-12-15T21:09:35.391235+ mon.ud-01 [INF] Health check cleared:
MDS_CLIENT_RECALL (was: 1 clients failing to respond to cache pressure)
2023-12-15T21:09:35.391304+ mon.ud-01 [INF] Cluster is now healthy
2023-12-15T21:10:00.000169+ mon.ud-01 [INF] overall HEALTH_OK

For every read and write in client's trying to reach ceph MDS server and
requests some data:
issue 1: home data is around 5-10GB and users need all the time. I need to
store it one time and prevent new requests.

issue 2: users process generates new data by only reading some data one
time and they write generated data one time. No need to cache this data at
all.

What I want to do ???

1- I want to deploy 2x active MDS server for only "home" directory in each
subvolume:
- These 2x home MDS servers must send the data to client and cache in the
client to reduce new requests even for simple "ls" command

2-  I want to deploy 2x active MDS server for only "data" directory in each
subvolume:
- These 2x MDS servers must be configured to not hold any CACHE if it is
not required constantly. The cache life time must be short and must be
independent.
- Constantly requested data from one client must be cached locally in that
client to reduce requests and load on the MDS server.


I believe you understand my data-flow and my needs. Let's talk what we can
do about it.

Note: I'm still researching and these are my finding and my plan so far. it
is not completed, and this is the main reason why I'm writing this mail.

ceph fs set $MYFS max_mds 4
mds_cache_memory_limit  | default 4GiB --> 16GiB
mds_cache_reservation  | default 0.05 --> ??
mds_health_cache_threshold | default 1.5 --> ??
mds_cache_trim_threshold | default 256KiB --> ??
mds_cache_trim_decay_rate   | default 1.0 --> ??
mds_cache_mid
mds_decay_halflife
mds_client_prealloc_inos
mds_dirstat_min_interval
mds_session_cache_livenes

[ceph-users] Re: Cephfs MDS tunning for deep-learning data-flow

2023-12-15 Thread mhnx
I found something useful and I think I need to dig and use this %100

https://docs.ceph.com/en/reef/cephfs/multimds/#dynamic-subtree-partitioning-with-balancer-on-specific-ranks

DYNAMIC SUBTREE PARTITIONING WITH BALANCER ON SPECIFIC RANKS

The CephFS file system provides the bal_rank_mask option to enable the
balancer to dynamically rebalance subtrees within particular active
MDS ranks. This allows administrators to employ both the dynamic
subtree partitioning and static pining schemes in different active MDS
ranks so that metadata loads are optimized based on user demand. For
instance, in realistic cloud storage environments, where a lot of
subvolumes are allotted to multiple computing nodes (e.g., VMs and
containers), some subvolumes that require high performance are managed
by static partitioning, whereas most subvolumes that experience a
moderate workload are managed by the balancer. As the balancer evenly
spreads the metadata workload to all active MDS ranks, performance of
static pinned subvolumes inevitably may be affected or degraded. If
this option is enabled, subtrees managed by the balancer are not
affected by static pinned subtrees.

This option can be configured with the ceph fs set command. For example:

ceph fs set  bal_rank_mask 

Each bitfield of the  number represents a dedicated rank. If the
 is set to 0x3, the balancer runs on active 0 and 1 ranks. For
example:

ceph fs set  bal_rank_mask 0x3

If the bal_rank_mask is set to -1 or all, all active ranks are masked
and utilized by the balancer. As an example:

ceph fs set  bal_rank_mask -1

On the other hand, if the balancer needs to be disabled, the
bal_rank_mask should be set to 0x0. For example:

ceph fs set  bal_rank_mask 0x0



mhnx , 16 Ara 2023 Cmt, 03:43 tarihinde şunu yazdı:
>
> Hello everyone! How are you doing?
> I wasn't around for two years but I'm back and working on a new development.
>
> I deployed 2x ceph cluster:
> 1- user_data:5x node [8x4TB Sata SSD, 2x 25Gbit network],
> 2- data-gen: 3x node [8x4TB Sata SSD, 2x 25Gbit network],
>
> note: hardware is not my choice and I know I have TRIM issue and also I 
> couldn't use any PCI-E nvme for wal+db because 1u servers and no empty slots
> -
>
> At test phase everything was good, I reached 1GB/s for 18 clients at the same 
> time.
> But when I migrate to production (60 GPU server client + 40 CPU server 
> client) the speed issue begin because of the default parameters as usual and 
> now I'm working on adaptation by debugging current data work flow I have and 
> I'm researching how can I improve my environment.
>
> So far, I couldn't find useful guide or informations in one place and I just 
> wanted to share my findings, benchmarks and ideas with the community and if 
> I'm lucky enough, maybe I will get awesome recommendations from some old 
> friends and enjoy get in touch after a while. :)
>
> Starting from here, I will only share technical information about my 
> environment:
>
> 1- Cluster user_data: 5x node [8x4TB Sata SSD, 2x 25Gbit network] = 
> Replication 2
> - A: I only have 1 pool in this cluster and information is below:
> - ceph df
> --- RAW STORAGE ---
> CLASS SIZEAVAILUSED  RAW USED  %RAW USED
> ssd146 TiB  106 TiB  40 TiB40 TiB  27.50
> TOTAL  146 TiB  106 TiB  40 TiB40 TiB  27.50
>
> --- POOLS ---
> POOL ID   PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
> .mgr  1 1  286 MiB   73  859 MiB  0 32 TiB
> cephfs.ud-data.meta   9   512   65 GiB2.87M  131 GiB   0.13 48 TiB
> cephfs.ud-data.data  10  2048   23 TiB   95.34M   40 TiB  29.39 48 TiB
>
>
> - B: In this cluster, every user(50) has a subvolume and the quota is 1TB/for 
> each users
> - C: In each subvolume, users has "home and data" directory.
> - D: home directory size 5-10GB and client uses it for docker home directory 
> at each login
> - E: I'm also storing users personal or development data around 2TB/each user
> - F: I only have 1x active MDS server and 4x standby as below.
>
> - ceph fs status
>>
>> ud-data - 84 clients
>> ===
>> RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS   CAPS
>>  0active  ud-data.ud-04.seggyv  Reqs:  372 /s  4343k  4326k  69.7k  2055k
>> POOL   TYPE USED  AVAIL
>> cephfs.ud-data.meta  metadata   130G  47.5T
>> cephfs.ud-data.datadata39.5T  47.5T
>> STANDBY MDS
>> ud-data.ud-01.uatjle
>> ud-data.ud-02.xcoojt
>> ud-data.ud-05.rnhcfe
>> ud-data.ud-03.lhwkml
>> MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) 
>> quincy (stable)
>
>
>
> - What is my issue?
> 2023-12-15T21:07:47.175542+ mon.ud-01 [WRN] Health check failed: 1 
> clients failing to respond to cache pressure (MDS_CLIENT_RECALL)
> 2023-12-15T21:09:35.002112+ mon.ud-01 [INF] MDS health message cleared 
> (mds.?): Client gpu-server-11 failing to respond to cache pressure
> 2023-12-15T21:09:35.391235+ mo