[ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

2022-11-23 Thread Eugen Block

Hi,

I can confirm the behavior for Pacific version 16.2.7. I checked with  
a Nautilus test cluster and there it seems to work as expected. I  
tried to zap a db device and then restarted one of the OSDs,  
successfully. So there seems to be a regression somewhere. I didn't  
search for tracker issues yet, but this seems to be worth one, right?


Zitat von Frank Schilder :


Hi all,

on our octopus-latest cluster I accidentally destroyed an up+in OSD  
with the command line


  ceph-volume lvm zap /dev/DEV

It executed the dd command and then failed at the lvm commands with  
"device busy". Problem number one is, that the OSD continued working  
fine. Hence, there is no indication of a corruption, its a silent  
corruption. Problem number two - the real one - is, why is  
ceph-colume not checking if the OSD that device belongs to is still  
up+in? "ceph osd destroy" does that, for example. I believe to  
remember that "ceph-volume lvm zap --osd-id" also checks, but I'm  
not sure.


Has this been changed in versions later than octopus?

I think it is extremely dangerous to provide a tool that allows the  
silent corruption of an entire ceph cluster. The corruption is only  
discovered on restart and then it would be too late (unless there is  
an in-official recovery procedure somewhere).


I would prefer that ceph-volume lvm zap employs the same strict  
sanity checks as other ceph-commands to avoid accidents. In my case  
it was a typo, one wrong letter.


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS internal op exportdir despite ephemeral pinning

2022-11-23 Thread Frank Schilder
Hi Patrick and everybody,

I wrote a small script that pins the immediate children of 3 sub-dirs on our 
file system in a round-robin way to our 8 active ranks. I think the experience 
is worth reporting here. In any case, Patrick, if you can help me get 
distributed ephemeral pinning to work, this would be great as the automatic pin 
updates when changing the size of the MDS cluster would simplify life as an 
admin a lot.

Before starting the script, the load-balancer had created and distributed about 
30K sub-trees over the MDSes. Running the script and setting the pins (with a 
sleep 1 in between) immediately triggered a re-distribution and consolidation 
of sub-trees. They were consolidated on the MDSes they were pinned to. During 
this process no health issues. The process took a few minutes to complete.

After that, we ended up with very few sub-trees. Today, the distribution looks 
like this (ceph tell mds.$mds get subtrees | grep '"path":' | wc -l):

ceph-14: 27
ceph-16: 107
ceph-23: 39
ceph-13: 32
ceph-17: 27
ceph-11: 55
ceph-12: 49
ceph-10: 24

Rank 1 (ceph-16) has a few more pinned to by hand, but these are not very 
active.

After the sub-tree consolidation completed, there was suddenly very low load on 
the MDS cluster and the meta-data pools. Also, the MDS daemons went down in CPU 
load to 5-10% compared with the usual 80-140%.

At first I thought things went bad, but logging in to a client showed there 
were no problems. I did a standard benchmark and noticed a 3 to 4 times 
increased single thread IOP/s performance! What I also see is that the MDS 
cache allocation is very stable now, they need much less RAM compared with 
before and they don't trash much. No file-system related slow OPS/requests 
warning in the logs any more! I used to have exportdir/rejoin/behind on 
trimming a lot, its all gone.

Conclusion: The build-in dynamic load balancer seems to have been responsible 
for 90-95% of the FS load - completely artificial internal load that was 
greatly limiting client performance. I think making the internal load balancer 
much less aggressive would help a lot. Default could be round-robin pin of 
low-depth sub-dirs and then changing a pin every few hours based on a number of 
activity metrics over, say 7 days, 1 day and 4 hours to aim for a long-term 
stable pin distribution.

For example, on our cluster if the most busy 2-3 high-level sub-tree pins are 
considered for moving every 24h it would be completely sufficient. Also, 
considering sub-trees very deep in the hierarchy seems pointless. A balancer 
sub-tree max-depth setting to limit the depth the load balancer looks at would 
probably improve things. I had a high-level sub-dir distributed over 10K 
sub-trees, which really didn't help performance at all.

If anyone has the dynamic balancer in action, intentionally or not, it might be 
worth trying to pin everything up to a depth of 2-3 in the FS tree.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Patrick Donnelly 
Sent: 19 November 2022 01:52:02
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] MDS internal op exportdir despite ephemeral pinning

On Fri, Nov 18, 2022 at 2:32 PM Frank Schilder  wrote:
>
> Hi Patrick,
>
> we plan to upgrade next year. Can't do any faster. However, distributed 
> ephemeral pinning was introduced with octopus. It was one of the major new 
> features and is explained in the octopus documentation in detail.
>
> Are you saying that it is actually not implemented?
> If so, how much of the documentation can I trust?

Generally you can trust the documentation. There are configurations
gating these features, as you're aware. While the documentation didn't
say as much, that indicates they are "previews".

> If it is implemented, I would like to get it working - if this is possible at 
> all. Would you still take a look at the data?

I'm willing to look.

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: *****SPAM***** Re: CephFS performance

2022-11-23 Thread Marc
> 
> That said, if you've been happy using CephFS with hard drives and
> gigabit ethernet, it will be much faster if you store the metadata on
> SSD and can increase the size of the MDS cache in memory

Is using multiple adapters already being supported? That seems wanted using 
1Gbit.

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

2022-11-23 Thread Frank Schilder
Hi Eugen,

can you confirm that the silent corruption happens also on a collocated OSDc 
(everything on the same device) on pacific? The zap command should simply exit 
with "osd not down+out" or at least not do anything.

If this accidentally destructive behaviour is still present, I think it is 
worth a ticket. Since I can't test on versions higher than octopus yet, could 
you then open the ticket?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 23 November 2022 09:27:22
To: ceph-users@ceph.io
Subject: [ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

Hi,

I can confirm the behavior for Pacific version 16.2.7. I checked with
a Nautilus test cluster and there it seems to work as expected. I
tried to zap a db device and then restarted one of the OSDs,
successfully. So there seems to be a regression somewhere. I didn't
search for tracker issues yet, but this seems to be worth one, right?

Zitat von Frank Schilder :

> Hi all,
>
> on our octopus-latest cluster I accidentally destroyed an up+in OSD
> with the command line
>
>   ceph-volume lvm zap /dev/DEV
>
> It executed the dd command and then failed at the lvm commands with
> "device busy". Problem number one is, that the OSD continued working
> fine. Hence, there is no indication of a corruption, its a silent
> corruption. Problem number two - the real one - is, why is
> ceph-colume not checking if the OSD that device belongs to is still
> up+in? "ceph osd destroy" does that, for example. I believe to
> remember that "ceph-volume lvm zap --osd-id" also checks, but I'm
> not sure.
>
> Has this been changed in versions later than octopus?
>
> I think it is extremely dangerous to provide a tool that allows the
> silent corruption of an entire ceph cluster. The corruption is only
> discovered on restart and then it would be too late (unless there is
> an in-official recovery procedure somewhere).
>
> I would prefer that ceph-volume lvm zap employs the same strict
> sanity checks as other ceph-commands to avoid accidents. In my case
> it was a typo, one wrong letter.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw-admin bucket check --fix returns a lot of errors (unable to find head object data)

2022-11-23 Thread Boris Behrens
Hi,
we have a customer that got some _multipart_ files in his bucket, but the
bucket got no unfinished multipart objects.
So I tried to remove them via

$ radosgw-admin object rm --bucket BUCKET
--object=_multipart_OBJECT.qjqyT8bXiWW5jdbxpVqHxXnLWOG3koUi.1
ERROR: object remove returned: (2) No such file or directory

Doing this with --debug_ms=1 I see this line:
 osd_op_reply(108
ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297644274.57___multipart_OBJECT.qjqyT8bXiWW5jdbxpVqHxXnLWOG3koUi.1
[getxattrs,stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8
 345+0+0 (crc 0 0 0) 0x7f35ac01c040 con 0x55a8b470fba0

I then tried to remove the leading _ from the object name, but this also
did not work.
Then I proceeded to remove the rados object and just do a
$ bucket check --fix --bucket BUCKET
...
2022-11-23T11:21:14.214+ 7fd50a8fd980  0 int
RGWRados::check_disk_state(librados::v14_2_0::IoCtx, const RGWBucketInfo&,
rgw_bucket_dir_entry&, rgw_bucket_dir_entry&, ceph::bufferlist&,
optional_yield) WARNING: unable to find head object data pool for
"BUCKET:SOME_AVAILABLE_OBJECT", not updating version pool/epoch
...

And the calculated bucket size, is still the same and the index will not
get updated.

What to do now?

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] filesystem became read only after Quincy upgrade

2022-11-23 Thread Adrien Georget

Hi,

We upgraded this morning a Pacific Ceph cluster to the last Quincy version.
The cluster was healthy before the upgrade, everything was done 
according to the upgrade procedure (non-cephadm) [1], all services have 
restarted correctly but the filesystem switched to read only mode when 
it became active.

|
||HEALTH_WARN 1 MDSs are read only||
||[WRN] MDS_READ_ONLY: 1 MDSs are read only||
||    mds.cccephadm32(mds.0): MDS in read-only mode|

This is the only warning we got on the cluster.
In the MDS log, this error "failed to commit dir 0x1 object, errno -22" 
seems to be the root cause :

|
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 log_channel(cluster) log 
[ERR] : failed to commit dir 0x1 object, errno -22||
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 mds.0.11963 unhandled 
write error (22) Invalid argument, force readonly...||
||2022-11-23T12:41:09.843+0100 7f930f56d700  1 mds.0.cache force file 
system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700  0 log_channel(cluster) log 
[WRN] : force file system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700 10 mds.0.server 
force_clients_readonly|


I couldn't get more info with ceph config set mds.x debug_mds 20

|ceph fs status||
||cephfs - 17 clients||
||==||
||RANK  STATE   MDS ACTIVITY DNS INOS   DIRS   CAPS ||
|| 0    active  cccephadm32  Reqs:    0 /s  12.9k 12.8k   673   1538 ||
||  POOL TYPE USED  AVAIL ||
||cephfs_metadata  metadata   513G  48.6T ||
||  cephfs_data  data    2558M  48.6T ||
||  cephfs_data2 data 471G  48.6T ||
||  cephfs_data3 data 433G  48.6T ||
||STANDBY MDS ||
||cccephadm30 ||
||cccephadm31 ||
||MDS version: ceph version 17.2.5 
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)|


Any idea what could go wrong and how to solve it before starting a 
disaster recovery procedure?


Cheers,
Adrien

[1] 
https://ceph.com/en/news/blog/2022/v17-2-0-quincy-released/#upgrading-non-cephadm-clusters

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW Forcing buckets to be encrypted (SSE-S3) by default (via a global bucket encryption policy)?

2022-11-23 Thread Christian Rohmann

Hey ceph-users,

loosely related to my question about client-side encryption in the Cloud 
Sync module 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/I366AIAGWGXG3YQZXP6GDQT4ZX2Y6BXM/)


I am wondering if there are other options to ensure data is encrypted at 
rest and also only replicated as encrypted data ...



My thoughts / findings so far:

AWS S3 supports setting a bucket encryption policy 
(https://docs.aws.amazon.com/AmazonS3/latest/userguide/default-bucket-encryption.html) 
to "ApplyServerSideEncryptionByDefault" - so automatically apply SSE to 
all objects without the clients to explicitly request this per object.


Ceph RGW has received support for such policy via the bucket encryption 
API with 
https://github.com/ceph/ceph/commit/95acefb2f5e5b1a930b263bbc7d18857d476653c.


I am now just wondering if there is any way to not only allow bucket 
creators to apply such a policy themselves, but to apply this as a 
global default in RGW, forcing all buckets to have SSE enabled - 
transparently.


If there is no way to achieve this just yet, what are your thoughts 
about adding such an option to RGW?



Regards


Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW Forcing buckets to be encrypted (SSE-S3) by default (via a global bucket encryption policy)?

2022-11-23 Thread Christian Rohmann

On 23/11/2022 13:36, Christian Rohmann wrote:


I am wondering if there are other options to ensure data is encrypted 
at rest and also only replicated as encrypted data ...


I should have referenced thread 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TNA3MK2C744BN5OJQ4FMLWDK7WJBFH77/#J2VYTUSWZQBQMLN2GQ7L7ZLDYNHVEZZQ 
which muses about enforcing encryption as REST as well.


But as discussed there, using the "automatic encryption" 
(https://docs.ceph.com/en/latest/radosgw/encryption/#automatic-encryption-for-testing-only) 
using a static key stored in the config is likely not a good base for 
this endeavor.



Regards


Christian

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

2022-11-23 Thread Eugen Block
I can confirm the same behavior for all-in-one OSDs, it starts to wipe  
then aborts, but the OSD can't be restarted. I'll create a tracker  
issue, maybe not today though.


Zitat von Frank Schilder :


Hi Eugen,

can you confirm that the silent corruption happens also on a  
collocated OSDc (everything on the same device) on pacific? The zap  
command should simply exit with "osd not down+out" or at least not  
do anything.


If this accidentally destructive behaviour is still present, I think  
it is worth a ticket. Since I can't test on versions higher than  
octopus yet, could you then open the ticket?


Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 23 November 2022 09:27:22
To: ceph-users@ceph.io
Subject: [ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

Hi,

I can confirm the behavior for Pacific version 16.2.7. I checked with
a Nautilus test cluster and there it seems to work as expected. I
tried to zap a db device and then restarted one of the OSDs,
successfully. So there seems to be a regression somewhere. I didn't
search for tracker issues yet, but this seems to be worth one, right?

Zitat von Frank Schilder :


Hi all,

on our octopus-latest cluster I accidentally destroyed an up+in OSD
with the command line

  ceph-volume lvm zap /dev/DEV

It executed the dd command and then failed at the lvm commands with
"device busy". Problem number one is, that the OSD continued working
fine. Hence, there is no indication of a corruption, its a silent
corruption. Problem number two - the real one - is, why is
ceph-colume not checking if the OSD that device belongs to is still
up+in? "ceph osd destroy" does that, for example. I believe to
remember that "ceph-volume lvm zap --osd-id" also checks, but I'm
not sure.

Has this been changed in versions later than octopus?

I think it is extremely dangerous to provide a tool that allows the
silent corruption of an entire ceph cluster. The corruption is only
discovered on restart and then it would be too late (unless there is
an in-official recovery procedure somewhere).

I would prefer that ceph-volume lvm zap employs the same strict
sanity checks as other ceph-commands to avoid accidents. In my case
it was a typo, one wrong letter.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Requesting recommendations for Ceph multi-cluster management

2022-11-23 Thread Thomas Eckert
I'm looking for guidance/recommendations on how to approach below topic. As I'm 
fairly new to Ceph as a whole, I might be using or looking for terms/solutions 
incorrectly or simply missing some obvious puzzle pieces. Please do not assume 
advanced Ceph knowledge on my part (-:

We are looking at building up multiple Ceph clusters, each possibly consisting 
of multiple different pools / different configurations (repair prio, etc.). 
This is not about multi-site clusters but about multiple individual clusters 
which have no direct knowledge about each other what-so-ever.

Albeit still being in an investigate/research project phase, we are realizing 
already that we will (and already do) need a system to maintain our clusters' 
high-level information such as which nodes are (currently) associated with 
which cluster and general meta-information for each cluster like stage 
(live/qa/dev), name/id, Ceph version, etc.

Seeing how we will want to connect to this service from multiple other systems 
(Puppet/Ansible/etc) we are looking for a service with a sensible API.

As any such undertaking is prone to have, there are plenty of additional 
requirements we have in mind such as, full encryption (in-transport, at-rest), 
exchangeable storage layer (not hardwired to one DB/etc), versioned data 
storage (so we can query "the past" and not just the current state) and (a 
possibly fine-grained) access permission system. The entire list is quite 
lengthy and it probably won't help to list out each and every item here. 
Suffice it to say we are looking for a "holistic multi-cluster management" 
solution.

One important note is that we need to be able to run it ourselves, "as a 
service" offerings are not an option for us. I suppose we are looking for an 
OSS project, though it might also be several ones pieced together.

One particularly noteworthy find while searching for Ceph multi-cluster 
management: https://ceph.io/en/news/blog/2022/multi-cluster-mgmt-survey/
Unfortunately, I could not find any derivations or similar following this 
survey and, at time of writing, it is the only article on ceph.io labeled 
"multi-cluster".

Any recommendations or pointers where to look would be appreciated!

Regards,
  Thomas

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Issues during Nautilus Pacific upgrade

2022-11-23 Thread Ana Aviles

Hi,

We would like to share our experience upgrading one of our clusters from 
Nautilus (14.2.22-1bionic) to Pacific (16.2.10-1bionic) a few weeks ago.
To start with, we had to convert our monitors databases to rockdb in 
order to continue with the upgrade. Also, we had to migrate all our OSDs
to bluestore, otherwise they wouldn't come up after upgrading. That 
being said and done, once we finalized the upgrade, we bumped
into big performance issues with snaptrims. The I/O of the cluster was 
nearly stalled when our regular snaptrim tasks run. IcePic
pointed us to try compacting the OSDs. This solved it for us. It seems 
this was an already known issue and a fix is already integrated

in the upgrade process, however it didn't work for us.

https://tracker.ceph.com/issues/51710
https://github.com/ceph/ceph/pull/42426

We just wanted to share in case other people bump into same issues.

Greetings,

--
Ana Avilés
Greenhost - sustainable hosting & digital security
E: a...@greenhost.nl
T: +31 20 4890444
W: https://greenhost.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Multi site alternative

2022-11-23 Thread Szabo, Istvan (Agoda)
Hi,

Due to the lack of documentation and issues with multisite bucket sync I’m 
looking for an alternative solution where I can put some sla around the sync 
like I can guarantee that the file will be available in x minutes.

Which solution you guys are using which works fine with huge amount of objects?

Cloud sync?
Rclone?
…

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-23 Thread Adrien Georget
This bug looks very similar to this issue opened last year and closed 
without any solution : https://tracker.ceph.com/issues/52260


Adrien

Le 23/11/2022 à 12:49, Adrien Georget a écrit :

Hi,

We upgraded this morning a Pacific Ceph cluster to the last Quincy 
version.
The cluster was healthy before the upgrade, everything was done 
according to the upgrade procedure (non-cephadm) [1], all services 
have restarted correctly but the filesystem switched to read only mode 
when it became active.

|
||HEALTH_WARN 1 MDSs are read only||
||[WRN] MDS_READ_ONLY: 1 MDSs are read only||
||    mds.cccephadm32(mds.0): MDS in read-only mode|

This is the only warning we got on the cluster.
In the MDS log, this error "failed to commit dir 0x1 object, errno 
-22" seems to be the root cause :

|
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 log_channel(cluster) 
log [ERR] : failed to commit dir 0x1 object, errno -22||
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 mds.0.11963 unhandled 
write error (22) Invalid argument, force readonly...||
||2022-11-23T12:41:09.843+0100 7f930f56d700  1 mds.0.cache force file 
system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700  0 log_channel(cluster) 
log [WRN] : force file system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700 10 mds.0.server 
force_clients_readonly|


I couldn't get more info with ceph config set mds.x debug_mds 20

|ceph fs status||
||cephfs - 17 clients||
||==||
||RANK  STATE   MDS ACTIVITY DNS INOS   DIRS CAPS ||
|| 0    active  cccephadm32  Reqs:    0 /s  12.9k 12.8k   673 1538 ||
||  POOL TYPE USED  AVAIL ||
||cephfs_metadata  metadata   513G  48.6T ||
||  cephfs_data  data    2558M  48.6T ||
||  cephfs_data2 data 471G  48.6T ||
||  cephfs_data3 data 433G  48.6T ||
||STANDBY MDS ||
||cccephadm30 ||
||cccephadm31 ||
||MDS version: ceph version 17.2.5 
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)|


Any idea what could go wrong and how to solve it before starting a 
disaster recovery procedure?


Cheers,
Adrien

[1] 
https://ceph.com/en/news/blog/2022/v17-2-0-quincy-released/#upgrading-non-cephadm-clusters

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: hw failure, osd lost, stale+active+clean, pool size 1, recreate lost pgs?

2022-11-23 Thread Clyso GmbH - Ceph Foundation Member

Hi Jelle,

did you try:

ceph  osd  force-create-pg   
https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#pool-size-1 
Regards, Joachim


___
Clyso GmbH - Ceph Foundation Member

Am 22.11.22 um 11:33 schrieb Jelle de Jong:

Hello everybody,

Someone that can help me in the right direction?

Kind regards,

Jelle

On 11/21/22 17:14, Jelle de Jong wrote:

Hello everybody,

I had an HW failure and had to take an osd out however I now got 
stale+active+clean.


I am okay with having zeros as replacement for the lost blocks, I 
want the filesystem of the virtual machine that is using the pool to 
recover if possible.


I can not find what to do in the documentation. Can someone help me out?

https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/ 



systemctl stop ceph-osd@11
ceph osd out 11
ceph osd lost 11 --yes-i-really-mean-it
ceph osd crush remove osd.11
ceph auth del osd.11
ceph osd rm 11
ceph pg dump_stuck stale

root@ceph03:~# ceph pg dump_stuck stale
ok
PG_STAT STATE  UP   UP_PRIMARY ACTING ACTING_PRIMARY
2.91    stale+active+clean [11] 11   [11] 11
2.ca    stale+active+clean [11] 11   [11] 11
2.3e    stale+active+clean [11] 11   [11] 11
2.e0    stale+active+clean [11] 11   [11] 11
2.57    stale+active+clean [11] 11   [11] 11
2.59    stale+active+clean [11] 11   [11] 11
2.89    stale+active+clean [11] 11   [11] 11

# ceph osd pool get libvirt-pool-backup size
size: 1

root@ceph03:~# ceph pg map 2.91
osdmap e105091 pg 2.91 (2.91) -> up [17] acting [17]
root@ceph03:~# ceph pg map 2.ca
osdmap e105091 pg 2.ca (2.ca) -> up [8] acting [8]
root@ceph03:~# ceph pg map 2.3e
osdmap e105091 pg 2.3e (2.3e) -> up [14] acting [14]
root@ceph03:~# ceph pg map 2.e0
osdmap e105091 pg 2.e0 (2.e0) -> up [14] acting [14]
root@ceph03:~# ceph pg map 2.57
osdmap e105091 pg 2.57 (2.57) -> up [17] acting [17]
root@ceph03:~# ceph pg map 2.59
osdmap e105091 pg 2.59 (2.59) -> up [8] acting [8]
root@ceph03:~# ceph pg map 2.89
osdmap e105091 pg 2.89 (2.89) -> up [2] acting [2]

root@ceph03:~# ceph pg 2.91 query
Error ENOENT: i don't have pgid 2.91

root@ceph03:~# ceph pg force_create_pg 2.91
Error ENOTSUP: this command is obsolete

Kind regards,

Jelle
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Multi site alternative

2022-11-23 Thread Matthew Leonard (BLOOMBERG/ 120 PARK)
Hey Ivan,

I think the answer would be multisite. I know there is a lot of effort 
currently to work out the last few kinks.

This tracker might be of interest as it sounds like an already identified 
issue, https://tracker.ceph.com/issues/57562#change-228263

Matt

From: istvan.sz...@agoda.com At: 11/23/22 10:49:00 UTC-5:00To:  
ceph-users@ceph.io
Subject: [ceph-users] Multi site alternative

Hi,

Due to the lack of documentation and issues with multisite bucket sync I’m 
looking for an alternative solution where I can put some sla around the sync 
like I can guarantee that the file will be available in x minutes.

Which solution you guys are using which works fine with huge amount of objects?

Cloud sync?
Rclone?
…

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS performance

2022-11-23 Thread quag...@bol.com.br
Hi David,
 First of all, thanks for your reply!

 The resiliency of BeeGFS is in doing RAID on disks (by hardware or software) at the same node as the storage. If there is a need for greater resilience, the maximum possible is through buddy (which would be another storage machine as a fail over).
 Ceph Case is different, as the concept of replicas is more advantageous in my point of view. Replicas can be on any other machine in the cluster (and not just on a specific machines).

 But on the other hand, the performance of BeeGFS was MUCH higher. Based on where I work (where performance is needed) BeeGFS has that valued "feature".
 I'm not defending BeeGFS. I see several advantages in CephFS (which I configured in smaller clusters and also for administrative storage).

 But in the larger cluster, Ceph performance has been a bottleneck.
 That's why I did this test in another environment and got these values ​​that I posted in the other email.

 Another situation, if I can comment my demands for Ceph...
 Are there any plans for native Infiniband (RDMA) support in Ceph?

 Regarding the replica issue you commented on, I had already sent that the cluster is configured with size=2 and min_size=1 for the data and metadata pools.

 If I have any more information to contribute, please let me know!

Obrigado
Rafael
 


De: "David C" 
Enviada: 2022/11/22 12:27:24
Para: quag...@bol.com.br
Cc:  ceph-users@ceph.io
Assunto:  Re: [ceph-users] CephFS performance
 
My understanding is BeeGFS doesn't offer data redundancy by default,
you have to configure mirroring. You've not said how your Ceph cluster
is configured but my guess is you have the recommended 3x replication
- I wouldn't be surprised if BeeGFS was much faster than Ceph in this
case. I'd be interested to see your results after ensuring equivalent
data redundancy between the platforms.

On Thu, Oct 20, 2022 at 9:02 PM quag...@bol.com.br  wrote:
>
> Hello everyone,
> I have some considerations and doubts to ask...
>
> I work at an HPC center and my doubts stem from performance in this environment. All clusters here was suffering from NFS performance and also problems of a single point of failure it has. We were suffering from the performance of NFS and also the single point of failure it has.
>
> At that time, we decided to evaluate some available SDS and the chosen one was Ceph (first for its resilience and later for its performance).
> I deployed CephFS in a small cluster: 6 nodes and 1 HDD per machine with 1Gpbs connection.
> The performance was as good as a large NFS we have on another cluster (spending much less). In addition, it was able to evaluate all the benefits of resiliency that Ceph offers (such as activating an OSD, MDS, MON or MGR server) and the objects/services to settle on other nodes. All this in a way that the user did not even notice.
>
> Given this information, a new storage cluster was acquired last year with 6 machines and 22 disks (HDDs) per machine. The need was for the amount of available GBs. The amount of IOPs was not so important at that time.
>
> Right at the beginning, I had a lot of work to optimize the performance in the cluster (the main deficiency was in the performance in the access/write of metadata). The problem was not at the job execution, but the user's perception of slowness when executing interactive commands (my perception was in the slowness of Ceph metadata).
> There were a few months of high loads in which storage was the bottleneck of the environment.
>
> After a lot of research in documentation, I made several optimizations on the available parameters and currently CephFS is able to reach around 10k IOPS (using size=2).
>
> Anyway, my boss asked for other solutions to be evaluated to verify the performance issue.
> First of all, it was suggested to put the metadata on SSD disks for a higher amount of IOPS.
> In addition, a test environment was set up and the solution that made the most difference in performance was with BeeGFS.
>
> In some situations, BeeGFS is many times faster than Ceph in the same tests and under the same hardware conditions. This happens in both the throuput (BW) and IOPS.
>
> We tested it using io500 as follows:
> 1-) An individual process
> 2-) 8 processes (4 processes on 2 different machines)
> 3-) 16 processes (8 processes on 2 different machines)
>
> I did tests configuring CephFS to use:
> * HDD only (for both data and metadata)
> * Metadata on SSD
> * Using Linux FSCache features
> * With some optimizations (increasing MDS memory, client memory, inflight parameters, etc)
> * Cache tier with SSD
>
> Even so, the benchmark score was lower than the BeeGFS installed without any optimization. This difference becomes even more evident as the number of simultaneous accesses increases.
>
> The two best results of CephFS were using metadata on SSD and also doing TIER on SSD.
>
> Here is the result of Ceph's performance when compared to BeeGFS:
>
> Bandwit

[ceph-users] Re: CephFS performance

2022-11-23 Thread quaglio
Hi Gregory,
 Thanks for your reply!
 We are evaluating possibilities to increase storage performance.

 I understand that Ceph has has better capability in data resiliency. This has been one of the arguments I use to keep this tool in our storage.
 I say this mainly in failure events (in the case of disks or even machines crashes). In the case of BeeGFS, if there is a problem on any machine, the whole cluster becomes inconsistent (at the point of my tests, I'm not working with that).

 One of the foreseen situations is precisely to put the metadata in SSD (as you said).
 Another situation is to put an entire filesystem on SSD (scratch area for the HPC area) or even a cache tier.

 With that in mind, my manager is weighing the costs of maintaining Ceph.

 However, in all tests, CephFS performance is inferior to BeeGFS.
 I'm running out of arguments to keep Ceph storage solution here where I work.

 The benchmark tests I did were as follows:

1-) Ceph with data and metadata on HDD
2-) Ceph with data on HDD and metadata on SSD
3-) Ceph with data and metadata on SSD
4-) Nowsync, fscache mount parameters
5-) Ceph with tier cache enabled (cache on SSD)
6-) Ceph with OS adjustments and configuration optimizations in OSD, MDS, MON, client, more buffering in the network interface.

On the larger cluster configuration, we have 50G interfaces on each of the 6 disk servers (each with 22 disks).

Obrigado
Rafael.  
 



De: "Gregory Farnum" 
Enviada: 2022/11/22 14:49:12
Para: dcsysengin...@gmail.com
Cc:  quag...@bol.com.br, ceph-users@ceph.io
Assunto:  Re: [ceph-users] Re: CephFS performance
 
In addition to not having resiliency by default, my recollection is
that BeeGFS also doesn't guarantee metadata durability in the event of
a crash or hardware failure like CephFS does. There's not really a way
for us to catch up to their "in-memory metadata IOPS" with our
"on-disk metadata IOPS". :(

If that kind of cached performance is your main concern, CephFS is
probably not going to make you happy.

That said, if you've been happy using CephFS with hard drives and
gigabit ethernet, it will be much faster if you store the metadata on
SSD and can increase the size of the MDS cache in memory. More
specific tuning options than that would depend on your workload.
-Greg

On Tue, Nov 22, 2022 at 7:28 AM David C  wrote:
>
> My understanding is BeeGFS doesn't offer data redundancy by default,
> you have to configure mirroring. You've not said how your Ceph cluster
> is configured but my guess is you have the recommended 3x replication
> - I wouldn't be surprised if BeeGFS was much faster than Ceph in this
> case. I'd be interested to see your results after ensuring equivalent
> data redundancy between the platforms.
>
> On Thu, Oct 20, 2022 at 9:02 PM quag...@bol.com.br  wrote:
> >
> > Hello everyone,
> > I have some considerations and doubts to ask...
> >
> > I work at an HPC center and my doubts stem from performance in this environment. All clusters here was suffering from NFS performance and also problems of a single point of failure it has. We were suffering from the performance of NFS and also the single point of failure it has.
> >
> > At that time, we decided to evaluate some available SDS and the chosen one was Ceph (first for its resilience and later for its performance).
> > I deployed CephFS in a small cluster: 6 nodes and 1 HDD per machine with 1Gpbs connection.
> > The performance was as good as a large NFS we have on another cluster (spending much less). In addition, it was able to evaluate all the benefits of resiliency that Ceph offers (such as activating an OSD, MDS, MON or MGR server) and the objects/services to settle on other nodes. All this in a way that the user did not even notice.
> >
> > Given this information, a new storage cluster was acquired last year with 6 machines and 22 disks (HDDs) per machine. The need was for the amount of available GBs. The amount of IOPs was not so important at that time.
> >
> > Right at the beginning, I had a lot of work to optimize the performance in the cluster (the main deficiency was in the performance in the access/write of metadata). The problem was not at the job execution, but the user's perception of slowness when executing interactive commands (my perception was in the slowness of Ceph metadata).
> > There were a few months of high loads in which storage was the bottleneck of the environment.
> >
> > After a lot of research in documentation, I made several optimizations on the available parameters and currently CephFS is able to reach around 10k IOPS (using size=2).
> >
> > Anyway, my boss asked for other solutions to be evaluated to verify the performance issue.
> > First of all, it was suggested to put the metadata on SSD disks for a higher amount of IOPS.
> > In addition, a test environment was set up and the solution that made the most difference in performance was with BeeGFS.
> >
> > In some situations, BeeGFS is many tim

[ceph-users] failure resharding radosgw bucket

2022-11-23 Thread Jan Horstmann
Hi list,
I am completely lost trying to reshard a radosgw bucket which fails
with the error:

process_single_logshard: Error during resharding bucket
68ddc61c613a4e3096ca8c349ee37f56/snapshotnfs:(2) No such file or
directory

But let me start from the beginning. We are running a ceph cluster
version 15.2.17. Recently we received a health warning because of
"large omap objects". So I grepped through the logs to get more
information about the object and then mapped that to a radosgw bucket
instance ([1]).
I believe this should normally be handled by dynamic resharding of the
bucket, which has already been done 23 times for this bucket ([2]).
For recent resharding tries the radosgw is logging the error mentioned
at the beginning. I tried to reshard manually by following the process
in [3], but that consequently leads to the same error.
When running the reshard with debug options ( --debug-rgw=20 --debug-
ms=1) I can get some additional insight on where exactly the failure
occurs:

2022-11-23T10:41:20.754+ 7f58cf9d2080  1 --
10.38.128.3:0/1221656497 -->
[v2:10.38.128.6:6880/44286,v1:10.38.128.6:6881/44286] --
osd_op(unknown.0.0:46 5.6 5:66924383:reshard::reshard.05:head
[call rgw.reshard_get in=149b] snapc 0=[]
ondisk+read+known_if_redirected e44374) v8 -- 0x56092dd46a10 con
0x56092dcfd7a0
2022-11-23T10:41:20.754+ 7f58bb889700  1 --
10.38.128.3:0/1221656497 <== osd.210 v2:10.38.128.6:6880/44286 4 
osd_op_reply(46 reshard.05 [call] v0'0 uv1180019 ondisk = -2
((2) No such file or directory)) v8  162+0+0 (crc 0 0 0)
0x7f58b00dc020 con 0x56092dcfd7a0


I am not sure how to interpret this and how to debug this any further.
Of course I can provide the full output if that helps.

Thanks and regards,
Jan

[1]
root@ceph-mon1:~# grep -r 'Large omap object found. Object'
/var/log/ceph/ceph.log 
2022-11-15T14:47:28.900679+ osd.47 (osd.47) 10890 : cluster [WRN]
Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
336457 Size (bytes): 117560231
2022-11-17T04:51:43.593811+ osd.50 (osd.50) 90 : cluster [WRN]
Large omap object found. Object: 3:0de49b75:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.10:head PG: 3.aed927b0 (3.30) Key count:
205346 Size (bytes): 71669614
2022-11-18T02:55:07.182419+ osd.47 (osd.47) 10917 : cluster [WRN]
Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
449776 Size (bytes): 157310435
2022-11-19T09:56:47.630679+ osd.29 (osd.29) 114 : cluster [WRN]
Large omap object found. Object: 3:61ad76c5:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.12:head PG: 3.a36eb586 (3.6) Key count:
213843 Size (bytes): 74703544
2022-11-20T13:04:39.979349+ osd.72 (osd.72) 83 : cluster [WRN]
Large omap object found. Object: 3:2b3227e7:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.22:head PG: 3.e7e44cd4 (3.14) Key count:
326676 Size (bytes): 114453145
2022-11-21T02:53:32.410698+ osd.50 (osd.50) 151 : cluster [WRN]
Large omap object found. Object: 3:0de49b75:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.10:head PG: 3.aed927b0 (3.30) Key count:
216764 Size (bytes): 75674839
2022-11-22T18:04:09.757825+ osd.47 (osd.47) 10964 : cluster [WRN]
Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
449776 Size (bytes): 157310435
2022-11-23T00:44:55.316254+ osd.29 (osd.29) 163 : cluster [WRN]
Large omap object found. Object: 3:61ad76c5:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.12:head PG: 3.a36eb586 (3.6) Key count:
213843 Size (bytes): 74703544
2022-11-23T09:10:07.842425+ osd.55 (osd.55) 13968 : cluster [WRN]
Large omap object found. Object: 3:3fa378c9:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.20:head PG: 3.931ec5fc (3.3c) Key count:
219204 Size (bytes): 76509687
2022-11-23T09:11:15.516973+ osd.72 (osd.72) 112 : cluster [WRN]
Large omap object found. Object: 3:2b3227e7:::.dir.ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19.22:head PG: 3.e7e44cd4 (3.14) Key count:
326676 Size (bytes): 114453145
root@ceph-mon1:~# radosgw-admin metadata list "bucket.instance" | grep
ee3fa6a3-4af3-4ac2-86c2-d2c374080b54.63073818.19
"68ddc61c613a4e3096ca8c349ee37f56/snapshotnfs:ee3fa6a3-4af3-4ac2-
86c2-d2c374080b54.63073818.19",

[2]
root@ceph-mon1:~# radosgw-admin bucket stats --bucket
68ddc61c613a4e3096ca8c349ee37f56/snapshotnfs 
{
"bucket": "snapshotnfs",
"num_shards": 23,
"tenant": "68ddc61c613a4e3096ca8c349ee37f56",
"zonegroup": "bf22bf53-c135-450b-946f-97e16d1bc326",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "ee3fa6a3-4af3-4ac2-86c2-d2c374080b54.63073818.19",
"marker": "ee3fa6a3-4af3-4ac2-86c2-d2c

[ceph-users] Re: *****SPAM***** Re: CephFS performance

2022-11-23 Thread Marc


> crashes). In the case of BeeGFS, if there is a problem on any machine, the
> whole cluster becomes inconsistent (at the point of my tests, I'm not working
> with that).
> 

But the first question you should ask yourself is, can you afford to be having 
these down hours, or do you want to have a solution that still offers 
availabiltiy after a node failure. The second question could be, do you even 
need this faster local performance? Then you have the available source license 
(like elastic search?) so you will have issues with cloud providers or a long 
term availability? Do you want to convert PB's of storage in 5 years time, 
because BeeGFS changed their earning model?




 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: failure resharding radosgw bucket

2022-11-23 Thread Casey Bodley
hi Jan,

On Wed, Nov 23, 2022 at 12:45 PM Jan Horstmann  wrote:
>
> Hi list,
> I am completely lost trying to reshard a radosgw bucket which fails
> with the error:
>
> process_single_logshard: Error during resharding bucket
> 68ddc61c613a4e3096ca8c349ee37f56/snapshotnfs:(2) No such file or
> directory
>
> But let me start from the beginning. We are running a ceph cluster
> version 15.2.17. Recently we received a health warning because of
> "large omap objects". So I grepped through the logs to get more
> information about the object and then mapped that to a radosgw bucket
> instance ([1]).
> I believe this should normally be handled by dynamic resharding of the
> bucket, which has already been done 23 times for this bucket ([2]).
> For recent resharding tries the radosgw is logging the error mentioned
> at the beginning. I tried to reshard manually by following the process
> in [3], but that consequently leads to the same error.
> When running the reshard with debug options ( --debug-rgw=20 --debug-
> ms=1) I can get some additional insight on where exactly the failure
> occurs:
>
> 2022-11-23T10:41:20.754+ 7f58cf9d2080  1 --
> 10.38.128.3:0/1221656497 -->
> [v2:10.38.128.6:6880/44286,v1:10.38.128.6:6881/44286] --
> osd_op(unknown.0.0:46 5.6 5:66924383:reshard::reshard.05:head
> [call rgw.reshard_get in=149b] snapc 0=[]
> ondisk+read+known_if_redirected e44374) v8 -- 0x56092dd46a10 con
> 0x56092dcfd7a0
> 2022-11-23T10:41:20.754+ 7f58bb889700  1 --
> 10.38.128.3:0/1221656497 <== osd.210 v2:10.38.128.6:6880/44286 4 
> osd_op_reply(46 reshard.05 [call] v0'0 uv1180019 ondisk = -2
> ((2) No such file or directory)) v8  162+0+0 (crc 0 0 0)
> 0x7f58b00dc020 con 0x56092dcfd7a0
>
>
> I am not sure how to interpret this and how to debug this any further.
> Of course I can provide the full output if that helps.
>
> Thanks and regards,
> Jan
>
> [1]
> root@ceph-mon1:~# grep -r 'Large omap object found. Object'
> /var/log/ceph/ceph.log
> 2022-11-15T14:47:28.900679+ osd.47 (osd.47) 10890 : cluster [WRN]
> Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
> 336457 Size (bytes): 117560231
> 2022-11-17T04:51:43.593811+ osd.50 (osd.50) 90 : cluster [WRN]
> Large omap object found. Object: 3:0de49b75:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.10:head PG: 3.aed927b0 (3.30) Key count:
> 205346 Size (bytes): 71669614
> 2022-11-18T02:55:07.182419+ osd.47 (osd.47) 10917 : cluster [WRN]
> Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
> 449776 Size (bytes): 157310435
> 2022-11-19T09:56:47.630679+ osd.29 (osd.29) 114 : cluster [WRN]
> Large omap object found. Object: 3:61ad76c5:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.12:head PG: 3.a36eb586 (3.6) Key count:
> 213843 Size (bytes): 74703544
> 2022-11-20T13:04:39.979349+ osd.72 (osd.72) 83 : cluster [WRN]
> Large omap object found. Object: 3:2b3227e7:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.22:head PG: 3.e7e44cd4 (3.14) Key count:
> 326676 Size (bytes): 114453145
> 2022-11-21T02:53:32.410698+ osd.50 (osd.50) 151 : cluster [WRN]
> Large omap object found. Object: 3:0de49b75:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.10:head PG: 3.aed927b0 (3.30) Key count:
> 216764 Size (bytes): 75674839
> 2022-11-22T18:04:09.757825+ osd.47 (osd.47) 10964 : cluster [WRN]
> Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
> 449776 Size (bytes): 157310435
> 2022-11-23T00:44:55.316254+ osd.29 (osd.29) 163 : cluster [WRN]
> Large omap object found. Object: 3:61ad76c5:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.12:head PG: 3.a36eb586 (3.6) Key count:
> 213843 Size (bytes): 74703544
> 2022-11-23T09:10:07.842425+ osd.55 (osd.55) 13968 : cluster [WRN]
> Large omap object found. Object: 3:3fa378c9:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.20:head PG: 3.931ec5fc (3.3c) Key count:
> 219204 Size (bytes): 76509687
> 2022-11-23T09:11:15.516973+ osd.72 (osd.72) 112 : cluster [WRN]
> Large omap object found. Object: 3:2b3227e7:::.dir.ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19.22:head PG: 3.e7e44cd4 (3.14) Key count:
> 326676 Size (bytes): 114453145
> root@ceph-mon1:~# radosgw-admin metadata list "bucket.instance" | grep
> ee3fa6a3-4af3-4ac2-86c2-d2c374080b54.63073818.19
> "68ddc61c613a4e3096ca8c349ee37f56/snapshotnfs:ee3fa6a3-4af3-4ac2-
> 86c2-d2c374080b54.63073818.19",
>
> [2]
> root@ceph-mon1:~# radosgw-admin bucket stats --bucket
> 68ddc61c613a4e3096ca8c349ee37f56/snapshotnfs
> {
> "bucket": "snapshotnfs",
> "num_shards": 23,
> "tenant": "68ddc61c613a4e3096ca8c349ee37f56",
> "zonegroup": "bf22bf53-c135-450b-946f-97e16d1bc326",
> "plac

[ceph-users] Ceph Leadership Team Meeting 11-23-2022

2022-11-23 Thread Ernesto Puerta
Hi Cephers,

Short meeting today:

   - The Sepia Lab is gradually coming back to life! Dan Mick & others
   managed to restore the testing environment (some issues could still remain,
   please ping Dan if you experience any).
   - Thanks to that, the release process for the Pacific 16.2.11 version
   has been unblocked.
   - Also, Patrick Donnelly has been working to deduplicate release notes
    from
   the different release branches and place them into a single location (main
   branch).

More details at: https://pad.ceph.com/p/clt-weekly-minutes

Kind Regards,


Ernesto Puerta

He / Him / His

Principal Software Engineer, Ceph

Red Hat 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issues during Nautilus Pacific upgrade

2022-11-23 Thread Marc


> We would like to share our experience upgrading one of our clusters from
> Nautilus (14.2.22-1bionic) to Pacific (16.2.10-1bionic) a few weeks ago.
> To start with, we had to convert our monitors databases to rockdb in

Weirdly I have just one monitor db in leveldb still. Is still recommend 
removing and adding the monitor? Or can this be converted?
cat /var/lib/ceph/mon/ceph-b/kv_backend


> into big performance issues with snaptrims. The I/O of the cluster was
> nearly stalled when our regular snaptrim tasks run. IcePic
> pointed us to try compacting the OSDs. This solved it for us. It seems

How did you do this, can this be done upfront or should this be done after the 
upgrade?
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-15 compact

I tried getting the status, but it failed because the osd was running. Should I 
prepare for stopping/starting all the osd daemons to do this compacting?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS performance

2022-11-23 Thread Robert W. Eckert
Have you tested having the block.db and WAL  for each OSD on a faster SSD/NVME 
device/ partition?

I have a bit smaller environment, but was able to take a 2 Tb SSD, split it 
into 4 partitions and use it for the db and WAL for the 4 Drives. By 
Default if you move the block.db to a different device, the WAL moves there 
too, but I see you can also have the block, block.db and WAL on separate 
devices.


The generic method is outlined  here. – one thing I found I had to do was to 
use the ceph-volume commands from inside “cephadm shell” since I am using the 
containers for running ceph.  Which also required me to copy the ceph keyring 
from the host /var/lib/ceph/bootstrap-osd to the same virtual location in the 
container.  (Using scp :/var/lib/ceph/bootstrap-osd/ceph.keyring 
/var/lib/ceph/bootstrap-osd )


https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/

-Rob

From: quag...@bol.com.br 
Sent: Wednesday, November 23, 2022 12:28 PM
To: gfar...@redhat.com; dcsysengin...@gmail.com
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: CephFS performance

Hi Gregory,
 Thanks for your reply!

 We are evaluating possibilities to increase storage performance.

 I understand that Ceph has has better capability in data resiliency. This 
has been one of the arguments I use to keep this tool in our storage.
 I say this mainly in failure events (in the case of disks or even machines 
crashes). In the case of BeeGFS, if there is a problem on any machine, the 
whole cluster becomes inconsistent (at the point of my tests, I'm not working 
with that).

 One of the foreseen situations is precisely to put the metadata in SSD (as 
you said).
 Another situation is to put an entire filesystem on SSD (scratch area for 
the HPC area) or even a cache tier.

 With that in mind, my manager is weighing the costs of maintaining Ceph.

 However, in all tests, CephFS performance is inferior to BeeGFS.
 I'm running out of arguments to keep Ceph storage solution here where I 
work.

 The benchmark tests I did were as follows:

1-) Ceph with data and metadata on HDD
2-) Ceph with data on HDD and metadata on SSD
3-) Ceph with data and metadata on SSD
4-) Nowsync, fscache mount parameters
5-) Ceph with tier cache enabled (cache on SSD)
6-) Ceph with OS adjustments and configuration optimizations in OSD, MDS, MON, 
client, more buffering in the network interface.

On the larger cluster configuration, we have 50G interfaces on each of the 6 
disk servers (each with 22 disks).

Obrigado
Rafael.



De: "Gregory Farnum" mailto:gfar...@redhat.com>>
Enviada: 2022/11/22 14:49:12
Para: dcsysengin...@gmail.com
Cc: quag...@bol.com.br, 
ceph-users@ceph.io
Assunto: Re: [ceph-users] Re: CephFS performance

In addition to not having resiliency by default, my recollection is
that BeeGFS also doesn't guarantee metadata durability in the event of
a crash or hardware failure like CephFS does. There's not really a way
for us to catch up to their "in-memory metadata IOPS" with our
"on-disk metadata IOPS". :(

If that kind of cached performance is your main concern, CephFS is
probably not going to make you happy.

That said, if you've been happy using CephFS with hard drives and
gigabit ethernet, it will be much faster if you store the metadata on
SSD and can increase the size of the MDS cache in memory. More
specific tuning options than that would depend on your workload.
-Greg

On Tue, Nov 22, 2022 at 7:28 AM David C 
mailto:dcsysengin...@gmail.com>> wrote:
>
> My understanding is BeeGFS doesn't offer data redundancy by default,
> you have to configure mirroring. You've not said how your Ceph cluster
> is configured but my guess is you have the recommended 3x replication
> - I wouldn't be surprised if BeeGFS was much faster than Ceph in this
> case. I'd be interested to see your results after ensuring equivalent
> data redundancy between the platforms.
>
> On Thu, Oct 20, 2022 at 9:02 PM quag...@bol.com.br 
> mailto:quag...@bol.com.br>> wrote:
> >
> > Hello everyone,
> > I have some considerations and doubts to ask...
> >
> > I work at an HPC center and my doubts stem from performance in this 
> > environment. All clusters here was suffering from NFS performance and also 
> > problems of a single point of failure it has. We were suffering from the 
> > performance of NFS and also the single point of failure it has.
> >
> > At that time, we decided to evaluate some available SDS and the chosen one 
> > was Ceph (first for its resilience and later for its performance).
> > I deployed CephFS in a small cluster: 6 nodes and 1 HDD per machine with 
> > 1Gpbs connection.
> > The performance was as good as a large NFS we have on another cluster 
> > (spending much less). In addition, it was able to evaluate all the benefits 
> > of resiliency that Ceph offers 

[ceph-users] Re: radosgw-admin bucket check --fix returns a lot of errors (unable to find head object data)

2022-11-23 Thread Boris Behrens
Hi,
I was able to clean up the objects by hand. I leave my breadcrumbs here in
case someone finds it useful.

1. Get all rados objects via `radosgw-admin bucket radoslist --bucket
$BUCKET` and filter the ones the you need to remove
2. Remove the rados objects via `rados -p $RGWDATAPOOL rm $RADOSOBJECT
3. Check which files is in which omapkey is in which index ($BUCKETID is in
radosgw-admin bucket stats)
# for index in `rados -p $RGWINDEXPOOL ls | grep $BUCKETID`; do
>   for omap in `rados -p eu-central-1.rgw.buckets.index listomapkeys
${index} | grep $FILESYOUREMOVED`; do
> echo "Index: ${index} - OMAP ${omap}"
>   done
> done

4. Remove the omapkeys after you checked the list is what actually should
be removed from the index
# for index in `rados -p $RGWINDEXPOOL ls | grep $BUCKETID`; do
>   for omap in `rados -p eu-central-1.rgw.buckets.index listomapkeys
${index} | grep $FILESYOUREMOVED`; do
> rados -p eu-central-1.rgw.buckets.index rmomapkey ${index} ${omap}"
>   done
> done

5. run `radosgw-admin bucket check --fix --bucket $BUCKET` to have the
correct statistics

The message unable to find head object data pool for is still there, but
now I don't care. (I also have this for a healthy bucket, where I test
stuff like this prior, which gets recreated periodically)


Am Mi., 23. Nov. 2022 um 12:22 Uhr schrieb Boris Behrens :

> Hi,
> we have a customer that got some _multipart_ files in his bucket, but the
> bucket got no unfinished multipart objects.
> So I tried to remove them via
>
> $ radosgw-admin object rm --bucket BUCKET
> --object=_multipart_OBJECT.qjqyT8bXiWW5jdbxpVqHxXnLWOG3koUi.1
> ERROR: object remove returned: (2) No such file or directory
>
> Doing this with --debug_ms=1 I see this line:
>  osd_op_reply(108
> ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297644274.57___multipart_OBJECT.qjqyT8bXiWW5jdbxpVqHxXnLWOG3koUi.1
> [getxattrs,stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8
>  345+0+0 (crc 0 0 0) 0x7f35ac01c040 con 0x55a8b470fba0
>
> I then tried to remove the leading _ from the object name, but this also
> did not work.
> Then I proceeded to remove the rados object and just do a
> $ bucket check --fix --bucket BUCKET
> ...
> 2022-11-23T11:21:14.214+ 7fd50a8fd980  0 int
> RGWRados::check_disk_state(librados::v14_2_0::IoCtx, const RGWBucketInfo&,
> rgw_bucket_dir_entry&, rgw_bucket_dir_entry&, ceph::bufferlist&,
> optional_yield) WARNING: unable to find head object data pool for
> "BUCKET:SOME_AVAILABLE_OBJECT", not updating version pool/epoch
> ...
>
> And the calculated bucket size, is still the same and the index will not
> get updated.
>
> What to do now?
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-23 Thread Xiubo Li

Hi Adrien,

On 23/11/2022 19:49, Adrien Georget wrote:

Hi,

We upgraded this morning a Pacific Ceph cluster to the last Quincy 
version.
The cluster was healthy before the upgrade, everything was done 
according to the upgrade procedure (non-cephadm) [1], all services 
have restarted correctly but the filesystem switched to read only mode 
when it became active.

|
||HEALTH_WARN 1 MDSs are read only||
||[WRN] MDS_READ_ONLY: 1 MDSs are read only||
||    mds.cccephadm32(mds.0): MDS in read-only mode|

This is the only warning we got on the cluster.
In the MDS log, this error "failed to commit dir 0x1 object, errno 
-22" seems to be the root cause :

|
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 log_channel(cluster) 
log [ERR] : failed to commit dir 0x1 object, errno -22||
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 mds.0.11963 unhandled 
write error (22) Invalid argument, force readonly...||
||2022-11-23T12:41:09.843+0100 7f930f56d700  1 mds.0.cache force file 
system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700  0 log_channel(cluster) 
log [WRN] : force file system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700 10 mds.0.server 
force_clients_readonly|


This could happen when the corresponding the object in metadata pool was 
lost or corrupted due to some reasons when upgrading.


BTW, do you unmounted all the clients when upgrading ?

Thanks!

- Xiubo


I couldn't get more info with ceph config set mds.x debug_mds 20

|ceph fs status||
||cephfs - 17 clients||
||==||
||RANK  STATE   MDS ACTIVITY DNS INOS   DIRS CAPS ||
|| 0    active  cccephadm32  Reqs:    0 /s  12.9k 12.8k   673 1538 ||
||  POOL TYPE USED  AVAIL ||
||cephfs_metadata  metadata   513G  48.6T ||
||  cephfs_data  data    2558M  48.6T ||
||  cephfs_data2 data 471G  48.6T ||
||  cephfs_data3 data 433G  48.6T ||
||STANDBY MDS ||
||cccephadm30 ||
||cccephadm31 ||
||MDS version: ceph version 17.2.5 
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)|


Any idea what could go wrong and how to solve it before starting a 
disaster recovery procedure?


Cheers,
Adrien

[1] 
https://ceph.com/en/news/blog/2022/v17-2-0-quincy-released/#upgrading-non-cephadm-clusters

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-23 Thread Xiubo Li


On 23/11/2022 19:49, Adrien Georget wrote:

Hi,

We upgraded this morning a Pacific Ceph cluster to the last Quincy 
version.
The cluster was healthy before the upgrade, everything was done 
according to the upgrade procedure (non-cephadm) [1], all services 
have restarted correctly but the filesystem switched to read only mode 
when it became active.

|
||HEALTH_WARN 1 MDSs are read only||
||[WRN] MDS_READ_ONLY: 1 MDSs are read only||
||    mds.cccephadm32(mds.0): MDS in read-only mode|

This is the only warning we got on the cluster.
In the MDS log, this error "failed to commit dir 0x1 object, errno 
-22" seems to be the root cause :

|
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 log_channel(cluster) 
log [ERR] : failed to commit dir 0x1 object, errno -22||
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 mds.0.11963 unhandled 
write error (22) Invalid argument, force readonly...||
||2022-11-23T12:41:09.843+0100 7f930f56d700  1 mds.0.cache force file 
system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700  0 log_channel(cluster) 
log [WRN] : force file system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700 10 mds.0.server 
force_clients_readonly|


I couldn't get more info with ceph config set mds.x debug_mds 20


If you could reproduce it please try it again by setting the debug:

debug_mds 25

debug_ms 1

And please check whether is there any error logs in osd's log files.

Thanks!



|ceph fs status||
||cephfs - 17 clients||
||==||
||RANK  STATE   MDS ACTIVITY DNS INOS   DIRS CAPS ||
|| 0    active  cccephadm32  Reqs:    0 /s  12.9k 12.8k   673 1538 ||
||  POOL TYPE USED  AVAIL ||
||cephfs_metadata  metadata   513G  48.6T ||
||  cephfs_data  data    2558M  48.6T ||
||  cephfs_data2 data 471G  48.6T ||
||  cephfs_data3 data 433G  48.6T ||
||STANDBY MDS ||
||cccephadm30 ||
||cccephadm31 ||
||MDS version: ceph version 17.2.5 
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)|


Any idea what could go wrong and how to solve it before starting a 
disaster recovery procedure?


Cheers,
Adrien

[1] 
https://ceph.com/en/news/blog/2022/v17-2-0-quincy-released/#upgrading-non-cephadm-clusters

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io