[ceph-users] Re: No MDS No FS after update and restart - respectfully request help to rebuild FS and maps

2022-03-14 Thread GoZippy
Ran

sudo systemctl status ceph\*.service ceph\*.target on all monitor nodes
from cli
All showed

*root@node7:~# sudo systemctl status ceph\*.service ceph\*.target*
● ceph-mds.target - ceph target allowing to start/stop all ceph-mds@.service
instances at once
 Loaded: loaded (/lib/systemd/system/ceph-mds.target; enabled; vendor
preset: enabled)
 Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 58s ago

Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mds@.service instances at once.

● *ceph-mgr.target* - ceph target allowing to start/stop all ceph-mgr@.service
instances at once
 Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor
preset: enabled)
 *Active: active *since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago

Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mgr@.service instances at once.

● c*eph-mds@node7.service* - Ceph metadata server daemon
 Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor
preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mds@.service.d
 └─ceph-after-pve-cluster.conf
 *Active: active (running)* since Mon 2022-03-14 00:34:34 CDT; 1min 59s
ago
   Main PID: 864832 (ceph-mds)
  Tasks: 9
 Memory: 9.7M
CPU: 96ms
 CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds@node7.service
 └─864832 /usr/bin/ceph-mds -f --cluster ceph --id node7
--setuser ceph --setgroup ceph

Mar 14 00:34:34 node7 systemd[1]: Started Ceph metadata server daemon.
● ceph-mds.target - ceph target allowing to start/stop all ceph-mds@.service
instances at once
 Loaded: loaded (/lib/systemd/system/ceph-mds.target; enabled; vendor
preset: enabled)
 Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 58s ago

Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mds@.service instances at once.

● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service
instances at once
 Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor
preset: enabled)
 Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago

Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mgr@.service instances at once.

● ceph-mds@node7.service - Ceph metadata server daemon
 Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor
preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mds@.service.d
 └─ceph-after-pve-cluster.conf
 Active: active (running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s
ago
   Main PID: 864832 (ceph-mds)
  Tasks: 9
 Memory: 9.7M
CPU: 96ms
 CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds@node7.service
 └─864832 /usr/bin/ceph-mds -f --cluster ceph --id node7
--setuser ceph --setgroup ceph

Mar 14 00:34:34 node7 systemd[1]: Started Ceph metadata server daemon.

● *ceph-mgr@node7.service - Ceph cluster manager daemon*
 Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor
preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mgr@.service.d
 └─ceph-after-pve-cluster.conf
 *Active: active *(running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s
ago
   Main PID: 864833 (ceph-mgr)
  Tasks: 9 (limit: 19118)
 Memory: 10.2M
CPU: 99ms
 CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@node7.service
 └─864833 /usr/bin/ceph-mgr -f --cluster ceph --id node7
--setuser ceph --setgroup ceph

Mar 14 00:34:34 node7 systemd[1]: Started Ceph cluster manager daemon.

● *ceph-osd@6.service - Ceph object storage daemon osd.6*
 Loaded: loaded (/lib/systemd/system/ceph-osd@.service;
enabled-runtime; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
 └─ceph-after-pve-cluster.conf
 *Active: active *(running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s
ago
   Main PID: 864839 (ceph-osd)
  Tasks: 9
 Memory: 10.1M
CPU: 100ms
 CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@6.service
 └─864839 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser
ceph --setgroup ceph

Mar 14 00:34:34 node7 systemd[1]: Starting Ceph object storage daemon
osd.6...



found node900 had mon down, ceph-osd daemon down
ceph crash dump was working however lol
osd.11 was active

not seeing the other osd though..l wondering what is going on...

*root@node900:/etc/ceph# sudo systemctl status ceph\*.service ceph\*.target*
● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service
instances at once
 Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor
preset: enabled)
 Active: active since Mon 2022-03-14 00:34:25 CDT; 3min 37s ago

Mar 14 00:34:25 node900 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mgr@.service instances at once.

● ceph-o

[ceph-users] Ceph-CSI and OpenCAS

2022-03-14 Thread Martin Plochberger
Hello, ceph-users community

I have watched the recording of "Ceph Performance Meeting 2022-03-03" (in
the Ceph channel, link https://www.youtube.com/watch?v=syq_LTg25T4) about
OpenCAS and block caching yesterday and it was really informative to me (I
especially liked the part where the filtering options are talked about ;)).

My question after watching the meeting:

Is there some documentation on compatibility/problems with ceph-csi (
https://github.com/ceph/ceph-csi)?

I started searching for further reading material (OpenCAS and ceph-csi)
online today but was unsuccessful so far :).

Any pointers/links would be appreciated while I continue my search.

CherrsM
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing

2022-03-14 Thread Ray Cunningham
Thank you Dennis! We have made most of these changes and are waiting to see 
what happens. 

Thank you,
Ray 

-Original Message-
From: Denis Polom  
Sent: Saturday, March 12, 2022 1:40 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Scrubbing

Hi,

I had similar problem on my larce cluster.

What I found and helped me to solve it:

Due to bad drives and replacing drives too often due to scrub error there was 
always some recovery operations going on.

I did set this:

osd_scrub_during_recovery true

and it basically solved my issue.

If not then you can try change the interval.

I did it also from default once per week to two weeks:

osd_deep_scrub_interval 1209600

and if you want or need to speed it up to get rid of not scrubbed in time PGs 
take a look into

osd_max_scrubs

default is 1 and if I need to speed it up I set it to 3 and I didn't recognize 
any performance impact.


dp

On 3/11/22 17:32, Ray Cunningham wrote:
> That's what I thought. We looked at the cluster storage nodes and found them 
> all to be less than .2 normalized maximum load.
>
> Our 'normal' BW for client IO according to ceph -s is around 60MB/s-100MB/s. 
> I don't usually look at the IOPs so I don't have that number right now. We 
> have seen GB/s numbers during repairs, so the cluster can get up there when 
> the workload requires.
>
> We discovered that this system never got the auto repair setting configured 
> to true and since we turned that on, we have been repairing PGs for the past 
> 24 hours. So, maybe we've been bottlenecked by those?
>
> Thank you,
> Ray
>   
>
> -Original Message-
> From: norman.kern 
> Sent: Thursday, March 10, 2022 9:27
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: Scrubbing
>
> Ray,
>
> You can use node-exporter+prom+grafana  to collect the load of CPUs 
> statistics. You can use uptime command to get the current statistics.
>
> On 3/10/22 10:51 PM, Ray Cunningham wrote:
>> From:
>>
>> osd_scrub_load_threshold
>> The normalized maximum load. Ceph will not scrub when the system load (as 
>> defined by getloadavg() / number of online CPUs) is higher than this number. 
>> Default is 0.5.
>>
>> Does anyone know how I can run getloadavg() / number of online CPUs so I can 
>> see what our load is? Is that a ceph command, or an OS command?
>>
>> Thank you,
>> Ray
>>
>>
>> -Original Message-
>> From: Ray Cunningham
>> Sent: Thursday, March 10, 2022 7:59 AM
>> To: norman.kern 
>> Cc: ceph-users@ceph.io
>> Subject: RE: [ceph-users] Scrubbing
>>
>>
>> We have 16 Storage Servers each with 16TB HDDs and 2TB SSDs for DB/WAL, so 
>> we are using bluestore. The system is running Nautilus 14.2.19 at the 
>> moment, with an upgrade scheduled this month. I can't give you a complete 
>> ceph config dump as this is an offline customer system, but I can get 
>> answers for specific questions.
>>
>> Off the top of my head, we have set:
>>
>> osd_max_scrubs 20
>> osd_scrub_auto_repair true
>> osd _scrub_load_threashold 0.6
>> We do not limit srub hours.
>>
>> Thank you,
>> Ray
>>
>>
>>
>>
>> -Original Message-
>> From: norman.kern 
>> Sent: Wednesday, March 9, 2022 7:28 PM
>> To: Ray Cunningham 
>> Cc: ceph-users@ceph.io
>> Subject: Re: [ceph-users] Scrubbing
>>
>> Ray,
>>
>> Can you  provide more information about your cluster(hardware and software 
>> configs)?
>>
>> On 3/10/22 7:40 AM, Ray Cunningham wrote:
>>> make any difference. Do
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-14 Thread Sebastian Mazza
Hallo Igor,

I'm glad I could be of help. Thank you for your explanation!

>  And I was right this is related to deferred write procedure  and apparently 
> fast shutdown mode.

Does that mean I can prevent the error in the meantime, before you can fix the 
root cause, by disabling osd_fast_shutdown?
Can it be disabled by the following statements in the ceph.conf?
```
[osd]
fast shutdown = false
```
(I wasn’t able to find any documentation for “osd_fast_shutdown”, e.g.: 
https://docs.ceph.com/en/pacific/rados/configuration/osd-config-ref/ does not 
contain “osd_fast_shutdown”)


> 1) From the log I can see you're using RBD over EC pool for the repro, right? 
> What's the EC profile?
Yes, one of the two EC pools contains one RBD image and the other EC pool is 
used for cephFS.

Rules from my crush map:
```
rule ec4x2hdd {
id 0
type erasure
min_size 8
max_size 8
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 4 type host
step choose indep 2 type osd
step emit
}
rule c3nvme {
id 1
type replicated
min_size 1
max_size 10
step take default class nvme
step chooseleaf firstn 0 type host
step emit
}
```
The EC rule requires 4 hosts, but as I have already mentioned the cluster does 
only have 3 nodes. Since I did never add the fourth node after the problem 
occurred the first time.

I created the EC profile by: 
```
$ ceph osd erasure-code-profile set ec4x2hdd-profile \
k=5 \
m=3 \
crush-root=ec4x2hdd
```

And the EC pools by:
```
$ ceph osd pool create fs.data-hdd.ec-pool 64 64 erasure ec4x2hdd-profile 
ec4x2hdd warn 36  # 36TB
$ ceph osd pool create block-hdd.ec-pool 16 16 erasure ec4x2hdd-profile 
ec4x2hdd warn 1 #  1TB
```

Replicated pools on NVMes
```
ceph osd pool create block-nvme.c-pool 32 32 replicated c3nvme 0 warn 
549755813888  # 512 GB
ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 
4294967296  # 4 GB
ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 
137438953472  # 128 GB
```

enable overwrite support:
```
$ ceph osd pool set fs.data-hdd.ec-pool allow_ec_overwrites true
$ ceph osd pool set block-hdd.ec-pool allow_ec_overwrites true
```

Set Application:
```
$ ceph osd pool application enable fs.data-hdd.ec-pool cephfs
$ ceph osd pool application enable "block-hdd.ec-pool" rbd
```

Use the rbd tool to initialize the pool for use by RBD:
```
$ rbd pool init "block-nvme.c-pool"
$ rbd pool init "block-hdd.ec-pool"
```

Create RBD image:
```
$ rbd --pool "block-nvme.c-pool" create vm-300-disk-1 --data-pool 
"block-hdd.ec-pool" --size 50G
```

Create cephFS:
```
$ ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 
4294967296# 4 GB
$ ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 
137438953472# 128 GB
$ ceph fs new cephfs fs.metadata-root-pool fs.data-root-pool
```

Add the EC pool to the cephFS:
```
$ ceph fs add_data_pool cephfs fs.data-hdd.ec-pool
```

Set the Layout of the directory “shared” so that files that are created inside 
the folder “shares” uses the EC pool:
```
$ setfattr -n ceph.dir.layout.pool -v fs.data-hdd.ec-pool 
/mnt/cephfs-olymp/shares/
```



> 2) How did you generate the load during last experiments? Was it some 
> benchmarking tool or any other artificial load generator? If so could you 
> share job desriptions or scripts if any?

I did not use a benchmarking tool or any other artificial load generator. The 
RBD image “vm-300-disk-1” is formatted as BTRFS. I used rsync to copy daily 
website backups to the volume. The backups are versioned and go back 10 days. 
If a new version is created, a new folder is added and all files that did not 
change are hard linked to the file in the folder of the last day. A new or 
updated file is copied form the production server. 
This backup procedure was not done on the ceph cluster. It is done on another 
server, I just copied the web backup directory of that server to the RBD image. 
However, I used rsync with the "-H” flag to copy the data from the “productive” 
backup server to the RBD image, therefore the hard links are preserved.

I copied shared group folders and apple time machine backups to CephFS, also 
with rsync from another server. The shared user folder contains typical office 
date, text document, CSV, files, JPEG images,…. The  apple time machine backups 
are basically folders with lots of 8MB or 32MB files that contains encrypted 
data.


In order to reproduce the rocksDB corruption I started the rsync process 
whenever I did find some time. I did wait until rsync found at least some files 
that has change and copied it to the RBD image or cephFS. Then, I triggered a 
shutdown on all 3 cluster nodes.

However, best to my knowledge, there was no IO at all between the last 3 
resta

[ceph-users] Re: Ceph-CSI and OpenCAS

2022-03-14 Thread Mark Nelson

Hi Martin,


I believe RH's reference architecture team has deployed ceph with CAS 
(and perhaps open CAS when it was open sourced), but I'm not sure if 
there's been any integration work done yet with ceph-csi. Theoretically 
it should be fairly easy though since the OSD will just treat it as 
generic block device.  As far as the underlying daemons go, I don't 
think there's anything really preventing it from working.  It's 
primarily setup/orchestration/ui/testing that's holding it back.  The 
testing piece is probably the biggest hurdle (the same goes for dm-cache 
and other options that might be worth considering).


One of the reasons I wanted to have that meeting was to get the ball 
rolling and start thinking about what caching will look if we move 
further away from rados level cache tiering.



Mark


On 3/14/22 07:19, Martin Plochberger wrote:

Hello, ceph-users community

I have watched the recording of "Ceph Performance Meeting 2022-03-03" (in
the Ceph channel, link https://www.youtube.com/watch?v=syq_LTg25T4) about
OpenCAS and block caching yesterday and it was really informative to me (I
especially liked the part where the filtering options are talked about ;)).

My question after watching the meeting:

Is there some documentation on compatibility/problems with ceph-csi (
https://github.com/ceph/ceph-csi)?

I started searching for further reading material (OpenCAS and ceph-csi)
online today but was unsuccessful so far :).

Any pointers/links would be appreciated while I continue my search.

CherrsM
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-14 Thread Igor Fedotov

Hi Sebastian,

the proper parameter name is 'osd fast shutdown".

As with any other OSD config parameter one can use either ceph.conf or 
'ceph config set osd.N osd_fast_shutdown false' command to adjust it.


I'd recommend the latter form.

And yeah from my last experiments it looks like setting it to false it's 
a workaround indeed...


I've managed to reproduce the bug locally and it doesn't occur without 
fast shutdown.


Thanks for all the information you shared, highly appreciated!

FYI: I've created another ticket (missed I already had one): 
https://tracker.ceph.com/issues/54547


So please expect all the related new there.


Kind regards,

Igor



On 3/14/2022 5:54 PM, Sebastian Mazza wrote:

Hallo Igor,

I'm glad I could be of help. Thank you for your explanation!


  And I was right this is related to deferred write procedure  and apparently 
fast shutdown mode.

Does that mean I can prevent the error in the meantime, before you can fix the 
root cause, by disabling osd_fast_shutdown?
Can it be disabled by the following statements in the ceph.conf?
```
[osd]
fast shutdown = false
```
(I wasn’t able to find any documentation for “osd_fast_shutdown”, e.g.: 
https://docs.ceph.com/en/pacific/rados/configuration/osd-config-ref/ does not 
contain “osd_fast_shutdown”)



1) From the log I can see you're using RBD over EC pool for the repro, right? 
What's the EC profile?

Yes, one of the two EC pools contains one RBD image and the other EC pool is 
used for cephFS.

Rules from my crush map:
```
rule ec4x2hdd {
id 0
type erasure
min_size 8
max_size 8
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 4 type host
step choose indep 2 type osd
step emit
}
rule c3nvme {
id 1
type replicated
min_size 1
max_size 10
step take default class nvme
step chooseleaf firstn 0 type host
step emit
}
```
The EC rule requires 4 hosts, but as I have already mentioned the cluster does 
only have 3 nodes. Since I did never add the fourth node after the problem 
occurred the first time.

I created the EC profile by:
```
$ ceph osd erasure-code-profile set ec4x2hdd-profile \
k=5 \
m=3 \
crush-root=ec4x2hdd
```

And the EC pools by:
```
$ ceph osd pool create fs.data-hdd.ec-pool 64 64 erasure ec4x2hdd-profile 
ec4x2hdd warn 36  # 36TB
$ ceph osd pool create block-hdd.ec-pool 16 16 erasure ec4x2hdd-profile 
ec4x2hdd warn 1 #  1TB
```

Replicated pools on NVMes
```
ceph osd pool create block-nvme.c-pool 32 32 replicated c3nvme 0 warn 
549755813888  # 512 GB
ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 
4294967296  # 4 GB
ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 
137438953472  # 128 GB
```

enable overwrite support:
```
$ ceph osd pool set fs.data-hdd.ec-pool allow_ec_overwrites true
$ ceph osd pool set block-hdd.ec-pool allow_ec_overwrites true
```

Set Application:
```
$ ceph osd pool application enable fs.data-hdd.ec-pool cephfs
$ ceph osd pool application enable "block-hdd.ec-pool" rbd
```

Use the rbd tool to initialize the pool for use by RBD:
```
$ rbd pool init "block-nvme.c-pool"
$ rbd pool init "block-hdd.ec-pool"
```

Create RBD image:
```
$ rbd --pool "block-nvme.c-pool" create vm-300-disk-1 --data-pool 
"block-hdd.ec-pool" --size 50G
```

Create cephFS:
```
$ ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 
4294967296# 4 GB
$ ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 
137438953472# 128 GB
$ ceph fs new cephfs fs.metadata-root-pool fs.data-root-pool
```

Add the EC pool to the cephFS:
```
$ ceph fs add_data_pool cephfs fs.data-hdd.ec-pool
```

Set the Layout of the directory “shared” so that files that are created inside 
the folder “shares” uses the EC pool:
```
$ setfattr -n ceph.dir.layout.pool -v fs.data-hdd.ec-pool 
/mnt/cephfs-olymp/shares/
```




2) How did you generate the load during last experiments? Was it some 
benchmarking tool or any other artificial load generator? If so could you share 
job desriptions or scripts if any?

I did not use a benchmarking tool or any other artificial load generator. The 
RBD image “vm-300-disk-1” is formatted as BTRFS. I used rsync to copy daily 
website backups to the volume. The backups are versioned and go back 10 days. 
If a new version is created, a new folder is added and all files that did not 
change are hard linked to the file in the folder of the last day. A new or 
updated file is copied form the production server.
This backup procedure was not done on the ceph cluster. It is done on another 
server, I just copied the web backup directory of that server to the RBD image. 
However, I used rsync with the "-H” flag to copy the data from the “productive” 
backup server to the RBD image, therefore th

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-03-14 Thread Sebastian Mazza
Hi Igor,

great that you was able to reproduce it!

I did read your comments at the issue #54547. Am I right that I probably have 
hundreds of corrupted objects on my EC pools (cephFSD and RBD)? But I only ever 
noticed when a rocksDB was damaged. A deep scrub should find the other errors, 
right? But the PGs of my EC pools was never scrubbed because the are all 
“undersized" and “degraded”. Is it possible to scrub a “undersized" / 
“degraded” PG? I’m not able to (deep) scrub any of the PGs on my EC pools. I 
did already set osd_scrub_during_recovery = true, but this does not help. 
Scrubbing and deep scrubbing works as expected on the replicated pools.


Thank you very much for helping me with my problem so patiently within the last 
3 month!


Thanks,
Sebastian


> On 14.03.2022, at 17:07, Igor Fedotov  wrote:
> 
> Hi Sebastian,
> 
> the proper parameter name is 'osd fast shutdown".
> 
> As with any other OSD config parameter one can use either ceph.conf or 'ceph 
> config set osd.N osd_fast_shutdown false' command to adjust it.
> 
> I'd recommend the latter form.
> 
> And yeah from my last experiments it looks like setting it to false it's a 
> workaround indeed...
> 
> I've managed to reproduce the bug locally and it doesn't occur without fast 
> shutdown.
> 
> Thanks for all the information you shared, highly appreciated!
> 
> FYI: I've created another ticket (missed I already had one): 
> https://tracker.ceph.com/issues/54547
> 
> So please expect all the related new there.
> 
> 
> Kind regards,
> 
> Igor
> 
> 
> 
> On 3/14/2022 5:54 PM, Sebastian Mazza wrote:
>> Hallo Igor,
>> 
>> I'm glad I could be of help. Thank you for your explanation!
>> 
>>>  And I was right this is related to deferred write procedure  and 
>>> apparently fast shutdown mode.
>> Does that mean I can prevent the error in the meantime, before you can fix 
>> the root cause, by disabling osd_fast_shutdown?
>> Can it be disabled by the following statements in the ceph.conf?
>> ```
>> [osd]
>>  fast shutdown = false
>> ```
>> (I wasn’t able to find any documentation for “osd_fast_shutdown”, e.g.: 
>> https://docs.ceph.com/en/pacific/rados/configuration/osd-config-ref/ does 
>> not contain “osd_fast_shutdown”)
>> 
>> 
>>> 1) From the log I can see you're using RBD over EC pool for the repro, 
>>> right? What's the EC profile?
>> Yes, one of the two EC pools contains one RBD image and the other EC pool is 
>> used for cephFS.
>> 
>> Rules from my crush map:
>> ```
>> rule ec4x2hdd {
>>  id 0
>>  type erasure
>>  min_size 8
>>  max_size 8
>>  step set_chooseleaf_tries 5
>>  step set_choose_tries 100
>>  step take default class hdd
>>  step choose indep 4 type host
>>  step choose indep 2 type osd
>>  step emit
>> }
>> rule c3nvme {
>>  id 1
>>  type replicated
>>  min_size 1
>>  max_size 10
>>  step take default class nvme
>>  step chooseleaf firstn 0 type host
>>  step emit
>> }
>> ```
>> The EC rule requires 4 hosts, but as I have already mentioned the cluster 
>> does only have 3 nodes. Since I did never add the fourth node after the 
>> problem occurred the first time.
>> 
>> I created the EC profile by:
>> ```
>> $ ceph osd erasure-code-profile set ec4x2hdd-profile \
>>  k=5 \
>>  m=3 \
>>  crush-root=ec4x2hdd
>> ```
>> 
>> And the EC pools by:
>> ```
>> $ ceph osd pool create fs.data-hdd.ec-pool 64 64 erasure ec4x2hdd-profile 
>> ec4x2hdd warn 36   # 36TB
>> $ ceph osd pool create block-hdd.ec-pool 16 16 erasure ec4x2hdd-profile 
>> ec4x2hdd warn 1  #  1TB
>> ```
>> 
>> Replicated pools on NVMes
>> ```
>> ceph osd pool create block-nvme.c-pool 32 32 replicated c3nvme 0 warn 
>> 549755813888   # 512 GB
>> ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 
>> 4294967296   # 4 GB
>> ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 
>> 137438953472   # 128 GB
>> ```
>> 
>> enable overwrite support:
>> ```
>> $ ceph osd pool set fs.data-hdd.ec-pool allow_ec_overwrites true
>> $ ceph osd pool set block-hdd.ec-pool allow_ec_overwrites true
>> ```
>> 
>> Set Application:
>> ```
>> $ ceph osd pool application enable fs.data-hdd.ec-pool cephfs
>> $ ceph osd pool application enable "block-hdd.ec-pool" rbd
>> ```
>> 
>> Use the rbd tool to initialize the pool for use by RBD:
>> ```
>> $ rbd pool init "block-nvme.c-pool"
>> $ rbd pool init "block-hdd.ec-pool"
>> ```
>> 
>> Create RBD image:
>> ```
>> $ rbd --pool "block-nvme.c-pool" create vm-300-disk-1 --data-pool 
>> "block-hdd.ec-pool" --size 50G
>> ```
>> 
>> Create cephFS:
>> ```
>> $ ceph osd pool create fs.metadata-root-pool 8 8 replicated c3nvme 0 warn 
>> 4294967296 # 4 GB
>> $ ceph osd pool create fs.data-root-pool 16 16 replicated c3nvme 0 warn 
>> 137438953472 # 128 GB
>> $ ceph fs new cephfs fs.metadata-root-pool fs.data-root-pool
>> ```
>> 
>> Add the EC pool to the cephFS:
>> ```
>> $ ceph fs add_data_pool cephfs fs.data-hdd

[ceph-users] Re: How often should I scrub the filesystem ?

2022-03-14 Thread Milind Changire
I've created a tracker https://tracker.ceph.com/issues/54557 to track this
issue.
Thanks Chris, for bringing this to my attention.

Regards,
Milind


On Sun, Mar 13, 2022 at 1:11 AM Chris Palmer  wrote:

> Hi Miland (or anyone else who can help...)
>
> Reading this thread made me realise I had overlooked cephfs scrubbing, so
> i tried it on a small 16.2.7 cluster. The normal forward scrub showed
> nothing. However "ceph tell mds.0 scrub start ~mdsdir recursive" did find
> one backtrace error (putting the cluster into HEALTH_ERR). I then did a
> repair which according to the log did rewrite the inode, and subsequent
> scrubs have not found it.
>
> However the cluster health is still ERR, and the MDS still shows the
> damage:
>
> ceph@1:~$ ceph tell mds.0 damage ls
> 2022-03-12T18:42:01.609+ 7f1b817fa700  0 client.173985213 ms_handle_reset 
> on v2:192.168.80.121:6824/939134894
> 2022-03-12T18:42:01.625+ 
>  
> 7f1b817fa700  0 client.173985219 ms_handle_reset on 
> v2:192.168.80.121:6824/939134894
> [
> {
> "damage_type": "backtrace",
> "id": 3308827822,
> "ino": 256,
> "path": "~mds0"
> }
> ]
>
> What are the right steps from here? Has the error actually been corrected
> but just needs clearing or is it still there?
>
> In case it is relevant: there is one active and two standby MDS. The log
> is from the node currently hosting rank 0.
>
> From the mds log:
>
> 2022-03-12T18:13:41.593+ 7f61d30c1700  1 mds.1 asok_command: scrub 
> start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
> 2022-03-12T18:13:41.593+ 7f61cb0b1700  0 log_channel(cluster) log [INF] : 
> scrub queued for path: ~mds0
> 2022-03-12T18:13:41.593+ 7f61cb0b1700  0 log_channel(cluster) log [INF] : 
> scrub summary: idle+waiting paths [~mds0]
> 2022-03-12T18:13:41.593+ 7f61cb0b1700  0 log_channel(cluster) log [INF] : 
> scrub summary: active paths [~mds0]
> 2022-03-12T18:13:41.601+ 7f61cb0b1700  0 log_channel(cluster) log [WRN] : 
> Scrub error on inode 0x100 (~mds0) see mds.1 log and `damage ls` output 
> for details
> 2022-03-12T18:13:41.601+ 7f61cb0b1700 -1 mds.0.scrubstack 
> _validate_inode_done scrub error on inode [inode 0x100 [...2,head] ~mds0/ 
> auth v6798 ap=1 snaprealm=0x55d59548
> 4800 f(v0 10=0+10) n(v1815 rc2022-03-12T16:01:44.218294+ b1017620718 
> 375=364+11)/n(v0 rc2019-10-29T10:52:34.302967+ 11=0+11) (inest lock) 
> (iversion lock) | dirtysca
> ttered=0 lock=0 dirfrag=1 openingsnapparents=0 dirty=1 authpin=1 scrubqueue=0 
> 0x55d595486000]: 
> {"performed_validation":true,"passed_validation":false,"backtrace":{"checked"
> :true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//[]","memoryvalue":"(11)0x100:[]//[]","error_str":"failed
>  to read off disk; see retval"},"raw_stats":{"ch
> ecked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 
> 10=0+10)","ondisk_value.rstat":"n(v0 rc2022-03-12T16:01:44.218294+ 
> b1017620718 375=364+11)","mem
> ory_value.dirstat":"f(v0 10=0+10)","memory_value.rstat":"n(v1815 
> rc2022-03-12T16:01:44.218294+ b1017620718 
> 375=364+11)","error_str":""},"return_code":-61}
> 2022-03-12T18:13:41.601+ 7f61cb0b1700  0 log_channel(cluster) log [INF] : 
> scrub summary: idle+waiting paths [~mds0]
> 2022-03-12T18:13:45.317+ 7f61cf8ba700  0 log_channel(cluster) log [INF] : 
> scrub summary: idle
>
> 2022-03-12T18:13:52.881+ 7f61d30c1700  1 mds.1 asok_command: scrub 
> start {path=~mdsdir,prefix=scrub start,scrubops=[recursive,repair]} 
> (starting...)
> 2022-03-12T18:13:52.881+ 7f61cb0b1700  0 log_channel(cluster) log [INF] : 
> scrub queued for path: ~mds0
> 2022-03-12T18:13:52.881+ 7f61cb0b1700  0 log_channel(cluster) log [INF] : 
> scrub summary: idle+waiting paths [~mds0]
> 2022-03-12T18:13:52.881+ 7f61cb0b1700  0 log_channel(cluster) log [INF] : 
> scrub summary: active paths [~mds0]
> 2022-03-12T18:13:52.881+ 7f61cb0b1700  0 log_channel(cluster) log [WRN] : 
> bad backtrace on inode 0x100(~mds0), rewriting it
> 2022-03-12T18:13:52.881+ 7f61cb0b1700  0 log_channel(cluster) log [INF] : 
> Scrub repaired inode 0x100 (~mds0)
> 2022-03-12T18:13:52.881+ 7f61cb0b1700 -1 mds.0.scrubstack 
> _validate_inode_done scrub error on inode [inode 0x100 [...2,head] ~mds0/ 
> auth v6798 ap=1 snaprealm=0x55d595484800 DIRTYPARENT f(v0 10=0+10) n(v1815 
> rc2022-03-12T16:01:44.218294+ b1017620718 375=364+11)/n(v0 
> rc2019-10-29T10:52:34.302967+ 11=0+11) (inest lock) (iversion lock) | 
> dirtyscattered=0 lock=0 dirfrag=1 openingsnapparents=0 dirtyparent=1 dirty=1 
> authpin=1 scrubqueue=0 0x55d595486000]: 
> {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//[]","memoryvalue":"(11)0x100:[]//[]","error_str":"failed
>  to read off disk; see 
> retval"},"raw_stats":{"che