[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-24 Thread Adrien Georget

Hi Xiubo,

We did the upgrade in rolling mode as always, with only few kubernetes 
pods as clients accessing their PVC on CephFS.


I can reproduce the problem everytime I restart the MDS daemon.
You can find the MDS log with debug_mds 25 and debug_ms 1 here : 
https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd
(The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 
mds.0.server force_clients_readonly)


I couldn't find any errors in the OSD logs, anything specific should I 
looking for?


Best,
Adrien

Le 24/11/2022 à 04:05, Xiubo Li a écrit :


On 23/11/2022 19:49, Adrien Georget wrote:

Hi,

We upgraded this morning a Pacific Ceph cluster to the last Quincy 
version.
The cluster was healthy before the upgrade, everything was done 
according to the upgrade procedure (non-cephadm) [1], all services 
have restarted correctly but the filesystem switched to read only 
mode when it became active.

|
||HEALTH_WARN 1 MDSs are read only||
||[WRN] MDS_READ_ONLY: 1 MDSs are read only||
||    mds.cccephadm32(mds.0): MDS in read-only mode|

This is the only warning we got on the cluster.
In the MDS log, this error "failed to commit dir 0x1 object, errno 
-22" seems to be the root cause :

|
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 log_channel(cluster) 
log [ERR] : failed to commit dir 0x1 object, errno -22||
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 mds.0.11963 unhandled 
write error (22) Invalid argument, force readonly...||
||2022-11-23T12:41:09.843+0100 7f930f56d700  1 mds.0.cache force file 
system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700  0 log_channel(cluster) 
log [WRN] : force file system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700 10 mds.0.server 
force_clients_readonly|


I couldn't get more info with ceph config set mds.x debug_mds 20


If you could reproduce it please try it again by setting the debug:

debug_mds 25

debug_ms 1

And please check whether is there any error logs in osd's log files.

Thanks!



|ceph fs status||
||cephfs - 17 clients||
||==||
||RANK  STATE   MDS ACTIVITY DNS INOS   DIRS CAPS ||
|| 0    active  cccephadm32  Reqs:    0 /s  12.9k 12.8k   673 1538 ||
||  POOL TYPE USED  AVAIL ||
||cephfs_metadata  metadata   513G  48.6T ||
||  cephfs_data  data    2558M  48.6T ||
||  cephfs_data2 data 471G  48.6T ||
||  cephfs_data3 data 433G  48.6T ||
||STANDBY MDS ||
||cccephadm30 ||
||cccephadm31 ||
||MDS version: ceph version 17.2.5 
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)|


Any idea what could go wrong and how to solve it before starting a 
disaster recovery procedure?


Cheers,
Adrien

[1] 
https://ceph.com/en/news/blog/2022/v17-2-0-quincy-released/#upgrading-non-cephadm-clusters

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Best practice taking cluster down

2022-11-24 Thread Dominique Ramaekers
Hi,

We are going to do some maintenance on our power grid. I'll need to put my ceph 
cluster down.

My cluster is a simple tree node cluster. Before shutting down the systems, 
I'll take down all virtual machines and other services who depend on the 
cluster storage.

Is it sufficient I set the 'noout' flag before I shutdown the physical servers?

I plan to shutdown in reverse order (one after the other): Server 3, server 2, 
server 1.
I startup the cluster in order: Server 1, server 2, server 3.

Dominique.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] SSE-KMS vs SSE-S3 with per-object-data-keys

2022-11-24 Thread Stefan Schueffler
Hi,

i appreciate a lot the recently added SSE-S3 encryption in radosgw. As far as i 
know, this encryption works very similar to the „original“ design in Amazon S3:

- it uses a per-bucket master key (used solely to encrypt the data-keys), 
stored in rgw_crypt_sse_s3_vault_prefix.
- and it creates a per-object data-key, to encrypt the individual uploaded 
objects, stored encrypted in the objectmetada

In order to do this, ceph depends on hashicorp vault’s transit engine, which 
supports exactly this master-key/data-key scenario.

In contrast to this, the somewhat older implementation of SSE-KMS lacks this 
support of individual data-keys per object. It even lacks the support of an 
„undefined“ key-id - which is a totally fine use-case in Amazon-S3.

Now, since the new SSE-S3-implementation is done, i would like to ask if it 
would be possible to rewrite/enhance the SSE-KMS-implementation (at least when 
combined with vault’s transit engine) to behave like the SSE-S3-implementation 
(in terms of master-key/data-key, and in terms of generating it’s own 
per-bucket-master-key when no key-id is given).

This way, the implementation would be nearly identical to the design 
specification of Amazon S3, and it could be 100% backwards compatible without 
impact for existing setups and already stored data. As an implementation note, 
the „new“ implementation for KMS would simply need to use the same 
functionality / code as the SSE-S3 implementation - and extended to support 
both use-cases with a given key-id and an undefined one.

So, in pseudo-code, the kms-implementation could be like this:

- no key-id given:
currently, it throws an unsupported operation exception. In the future, it 
simply could do the same magic as with S3 (at least when combined with vault 
transit): get (or create a new one on the first request) the per-bucket-key 
(stored in rgw_crypt_vault_prefix - this is the difference to SSE-S3). Then, go 
on as if the key-id was given.

- key-id given in the request:
currently, it pulls the key by id from vault, and encrypts the data. In the 
future, it could create a new data-key based on the given key-id, and use this 
to encrypt the data (exactly as it is in case of SSE-S3). 
In case of not having vault’s transit engine (e.g. pv/pv2-engine, or other 
crypt backend not supporting data-keys), simply continue with the old behavior: 
pull the key and encrypt the data. 
In case of an already stored object: check the object metadata if there is an 
data-key stored alongside: then use the SSE-S3-iike workflow of decrypting the 
data-key and then decryption the object data. If there is no data-key alongside 
the object-metadata, then there should be the „old-workflow“ key-id stored. In 
this case, use the old workflow of pulling the key from vault, and use this to 
decrypt the data.

The changes would not be to complex, and the gains would be that ceph always 
uses a master-key/data-key (instead of just „the key“ given by key-id), and it 
would add the implementation of SSE-KMS without a given key-id (amazon calls 
this SSE-KMS with customer provided key (when the key-id is given) or SSE-KMS 
with amazon managed key (when there is no key-id given) - in both cases the 
user’s vault will be used to store/retrieve the master-keys, in contrast to 
amazons own internal vault in case of SSE-S3).

I would like to help here with ideas.

Best
Stefan


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rook 1.10.6 problem with rgw

2022-11-24 Thread Oğuz Yarımtepe
Hi,

I am not sure whether others having problem with the latest rook
version:  CephObjectStore:
failed to commit RGW configuration period changes (helm install) · Issue
#11333 · rook/rook (github.com) 

I would like to know whether any workaround for it? Seems some changes at
the rgw is done already.



-- 
Oğuz Yarımtepe
http://about.me/oguzy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issues during Nautilus Pacific upgrade

2022-11-24 Thread Ana Aviles



On 11/23/22 19:49, Marc wrote:

We would like to share our experience upgrading one of our clusters from
Nautilus (14.2.22-1bionic) to Pacific (16.2.10-1bionic) a few weeks ago.
To start with, we had to convert our monitors databases to rockdb in

Weirdly I have just one monitor db in leveldb still. Is still recommend 
removing and adding the monitor? Or can this be converted?
cat /var/lib/ceph/mon/ceph-b/kv_backend

Yes, we didn't find anywhere a mention that monitors database should be 
RockDB for Pacific, but in practice after
we upgraded the monitors, they would idle increasing memory usage until 
OOMs happened.
To migrate the database we just recreated them, removing and adding them 
back again. It proved to be the easiest
way and it worked smoothly. Before we tried modifying the |kv_backend| I 
believe , but it was not successful.



into big performance issues with snaptrims. The I/O of the cluster was
nearly stalled when our regular snaptrim tasks run. IcePic
pointed us to try compacting the OSDs. This solved it for us. It seems

How did you do this, can this be done upfront or should this be done after the 
upgrade?
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-15 compact

I tried getting the status, but it failed because the osd was running. Should I 
prepare for stopping/starting all the osd daemons to do this compacting?

We did the compacting of OSDs after the upgrade, and we did the 
compacting live, so no need to stop OSDs.


|ceph daemon osd.0 compact|


This task increases the CPU usage of the OSD at first. This can last a 
while, in average around 10 minutes in our case.
We did the compacting of several OSDs in parallel but not all OSDs at 
once out of precaution. For us it didn't seem to have any

impact on the cluster performance.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph cluster shutdown procedure

2022-11-24 Thread Steven Goodliff
Hi,

Thanks Eugen,I found some similar docs on the Redhat site as well and made
a Ansible playbook to follow the steps.


Cheers


On Thu, 17 Nov 2022 at 13:28, Steven Goodliff  wrote:

> Hi,
>
> Is there a recommended way of shutting a cephadm cluster down completely?
>
> I tried using cephadm to stop all the services but hit the following
> message.
>
> "Stopping entire osd.osd service is prohibited"
>
> Thanks
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Persistent Bucket Notification performance

2022-11-24 Thread Steven Goodliff
Hi,

I'm really struggling with persistent bucket notifications running 17.2.3.
I can't get much more than 600 notifications a second but when changing to
async then i see higher rates using the following metric

sum(rate(ceph_rgw_pubsub_push_ok[$__rate_interval]))

I believe this is mainly down to being throttled by using 1 rgw rather than
all the rgw's the async method allows.

We would prefer to use persistent but can't get the throughput we need, any
suggestions would be much appreciated.

Thanks

Steven
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practice taking cluster down

2022-11-24 Thread Murilo Morais
Hi Dominique!

On this list, there was recently a thread discussing the same subject. [1]
You can follow SUSE's recommendations and it's a success! [2]

Have a good day!

[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QN4GUPPZ5IZYLQ4PD4KV737L5M6DJ4CI/
[2]
https://documentation.suse.com/ses/7.1/single-html/ses-admin/#sec-salt-cluster-reboot

Em qui., 24 de nov. de 2022 às 05:57, Dominique Ramaekers <
dominique.ramaek...@cometal.be> escreveu:

> Hi,
>
> We are going to do some maintenance on our power grid. I'll need to put my
> ceph cluster down.
>
> My cluster is a simple tree node cluster. Before shutting down the
> systems, I'll take down all virtual machines and other services who depend
> on the cluster storage.
>
> Is it sufficient I set the 'noout' flag before I shutdown the physical
> servers?
>
> I plan to shutdown in reverse order (one after the other): Server 3,
> server 2, server 1.
> I startup the cluster in order: Server 1, server 2, server 3.
>
> Dominique.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Persistent Bucket Notification performance

2022-11-24 Thread Yuval Lifshitz
Hi Steven,
When using synchronous (=non-persistent) notifications, the overall rate is
dependent on the latency between the RGW and the endpoint to which you are
sending the notifications. The protocols for sending the
notifications (kafka/amqp) are using batches and are usually very
efficient. However, if there is latency, this would slow down the RGW.

When using asynchronous (=persistent) notifications, they are written to a
RADOS-backed queue by the RGW that received the request, and then pulled
from that queue by some other (or sometimes the same) RGW that sends the
notifications to the endpoint.
Pulling from the queue, and sending the notifications is usually very fast,
however, writing to the notification queue is a RADOS operation. The amount
of information written to the queue is usually small but still has the
RADOS overhead, as the notifications are written one-by-one. So, in this
case, the limiting factor would be the RADOS IOpS.

Please let me know if this clarifies the behavior you observe?

Yuval

On Thu, Nov 24, 2022 at 1:27 PM Steven Goodliff  wrote:

> Hi,
>
> I'm really struggling with persistent bucket notifications running 17.2.3.
> I can't get much more than 600 notifications a second but when changing to
> async then i see higher rates using the following metric
>
> sum(rate(ceph_rgw_pubsub_push_ok[$__rate_interval]))
>
> I believe this is mainly down to being throttled by using 1 rgw rather than
> all the rgw's the async method allows.
>
> We would prefer to use persistent but can't get the throughput we need, any
> suggestions would be much appreciated.
>
> Thanks
>
> Steven
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Persistent Bucket Notification performance

2022-11-24 Thread Steven Goodliff
Hi,

Thanks for the quick response, I have the notifications going to a http
endpoint running on one of the RGW's machines so the latency is as low as I
can make it for both methods. If the limiting factor is at the rados layer
my only tuning options are to put the rgw log pool on the fastest media i
have available ?



On Thu, 24 Nov 2022 at 13:37, Yuval Lifshitz  wrote:

> Hi Steven,
> When using synchronous (=non-persistent) notifications, the overall rate
> is dependent on the latency between the RGW and the endpoint to which you
> are sending the notifications. The protocols for sending the
> notifications (kafka/amqp) are using batches and are usually very
> efficient. However, if there is latency, this would slow down the RGW.
>
> When using asynchronous (=persistent) notifications, they are written to a
> RADOS-backed queue by the RGW that received the request, and then pulled
> from that queue by some other (or sometimes the same) RGW that sends the
> notifications to the endpoint.
> Pulling from the queue, and sending the notifications is usually very
> fast, however, writing to the notification queue is a RADOS operation. The
> amount of information written to the queue is usually small but still has
> the RADOS overhead, as the notifications are written one-by-one. So, in
> this case, the limiting factor would be the RADOS IOpS.
>
> Please let me know if this clarifies the behavior you observe?
>
> Yuval
>
> On Thu, Nov 24, 2022 at 1:27 PM Steven Goodliff 
> wrote:
>
>> Hi,
>>
>> I'm really struggling with persistent bucket notifications running 17.2.3.
>> I can't get much more than 600 notifications a second but when changing to
>> async then i see higher rates using the following metric
>>
>> sum(rate(ceph_rgw_pubsub_push_ok[$__rate_interval]))
>>
>> I believe this is mainly down to being throttled by using 1 rgw rather
>> than
>> all the rgw's the async method allows.
>>
>> We would prefer to use persistent but can't get the throughput we need,
>> any
>> suggestions would be much appreciated.
>>
>> Thanks
>>
>> Steven
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Clean prometheus files in /var/lib/ceph

2022-11-24 Thread Mevludin Blazevic

Hi all,

on my ceph admin machine, a lot of large files are produced by 
prometheus, e.g.:


./var/lib/ceph/8c774934-1535-11ec-973e-525400130e4f/prometheus.cephadm/data/wal/00026165
./var/lib/ceph/8c774934-1535-11ec-973e-525400130e4f/prometheus.cephadm/data/wal/00026166
./var/lib/ceph/8c774934-1535-11ec-973e-525400130e4f/prometheus.cephadm/data/01GJFXMYMVE3ESDG8PR9VAK05N/chunks/01
./var/lib/ceph/8c774934-1535-11ec-973e-525400130e4f/prometheus.cephadm/data/01GJFXMYMVE3ESDG8PR9VAK05N/chunks/02
./var/lib/ceph/8c774934-1535-11ec-973e-525400130e4f/prometheus.cephadm/data/01GJ0F8MS0MT1ENPQZFSQ4SPN9/chunks/01

Is it possible to simple run rm command on these files or is there a 
ceph command for this?


Regards,

Mevludin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Configuring rgw connection timeouts

2022-11-24 Thread Thilo-Alexander Ginkel
Hi Kevin, all,

I tried what you suggested, but AFAICS (and judging from the error
message) supplying these config parameters via the RGW service spec is
not supported right now. Applying it causers an error:

Error EINVAL: Invalid config option request_timeout_ms in spec

The spec looks like this (X.509 cert redacted):

-- 8< --
placement:
count_per_host: 2
label: rgw
service_id: rgw
service_type: rgw
spec:
config:
request_timeout_ms: 12
rgw_frontend_port: 8000
rgw_frontend_ssl_certificate: ''
rgw_realm: myrealm
rgw_zone: myzone
ssl: true
-- 8< --

Having a look at how rgw_frontends is constructed in cephadmservice.py
[1] confirms that this might be unsupported.

Would it make sense to extend the spec so that rgw_frontends can be
configured via cephadm?

Thanks & kind regards,
Thilo

[1] 
https://github.com/ceph/ceph/blob/57111af8f155e00431f30e0be183eb8f4e6c9eac/src/pybind/mgr/cephadm/services/cephadmservice.py#L881


On Thu, Nov 17, 2022 at 7:27 PM Fox, Kevin M  wrote:
>
> I think you can do it like:
> ```
> service_type: rgw
> service_id: main
> service_name: rgw.main
> placement:
>   label: rgwmain
> spec:
>   config:
> rgw_keystone_admin_user: swift
> ```
>
> ?
>
> 
> From: Thilo-Alexander Ginkel 
> Sent: Thursday, November 17, 2022 10:21 AM
> To: Casey Bodley
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: Configuring rgw connection timeouts
>
> Check twice before you click! This email originated from outside PNNL.
>
>
> Hello Casey,
>
> On Thu, Nov 17, 2022 at 6:52 PM Casey Bodley  wrote:
>
> > it doesn't look like cephadm supports extra frontend options during
> > deployment. but these are stored as part of the `rgw_frontends` config
> > option, so you can use a command like 'ceph config set' after
> > deployment to add request_timeout_ms
>
>
> unfortunately that doesn't really seem to work as cephadm is setting the
> config on a service instance level (e.g., client.rgw.rgw.ceph-5.yjgdea), so
> we can't simply override this on a higher hierarchical level. In addition,
> we deploy multiple rgw instances per node (to better utilize available
> resources) which get assigned different HTTP(S) ports by cephadm so they
> can coexist on the same host.
>
> Regards,
> Thilo
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: failure resharding radosgw bucket

2022-11-24 Thread Jan Horstmann
On Wed, 2022-11-23 at 12:57 -0500, Casey Bodley wrote:
> hi Jan,
> 
> On Wed, Nov 23, 2022 at 12:45 PM Jan Horstmann  
> wrote:
> > 
> > Hi list,
> > I am completely lost trying to reshard a radosgw bucket which fails
> > with the error:
> > 
> > process_single_logshard: Error during resharding bucket
> > 68ddc61c613a4e3096ca8c349ee37f56/snapshotnfs:(2) No such file or
> > directory
> > 
> > But let me start from the beginning. We are running a ceph cluster
> > version 15.2.17. Recently we received a health warning because of
> > "large omap objects". So I grepped through the logs to get more
> > information about the object and then mapped that to a radosgw bucket
> > instance ([1]).
> > I believe this should normally be handled by dynamic resharding of the
> > bucket, which has already been done 23 times for this bucket ([2]).
> > For recent resharding tries the radosgw is logging the error mentioned
> > at the beginning. I tried to reshard manually by following the process
> > in [3], but that consequently leads to the same error.
> > When running the reshard with debug options ( --debug-rgw=20 --debug-
> > ms=1) I can get some additional insight on where exactly the failure
> > occurs:
> > 
> > 2022-11-23T10:41:20.754+ 7f58cf9d2080  1 --
> > 10.38.128.3:0/1221656497 -->
> > [v2:10.38.128.6:6880/44286,v1:10.38.128.6:6881/44286] --
> > osd_op(unknown.0.0:46 5.6 5:66924383:reshard::reshard.05:head
> > [call rgw.reshard_get in=149b] snapc 0=[]
> > ondisk+read+known_if_redirected e44374) v8 -- 0x56092dd46a10 con
> > 0x56092dcfd7a0
> > 2022-11-23T10:41:20.754+ 7f58bb889700  1 --
> > 10.38.128.3:0/1221656497 <== osd.210 v2:10.38.128.6:6880/44286 4 
> > osd_op_reply(46 reshard.05 [call] v0'0 uv1180019 ondisk = -2
> > ((2) No such file or directory)) v8  162+0+0 (crc 0 0 0)
> > 0x7f58b00dc020 con 0x56092dcfd7a0
> > 
> > 
> > I am not sure how to interpret this and how to debug this any further.
> > Of course I can provide the full output if that helps.
> > 
> > Thanks and regards,
> > Jan
> > 
> > [1]
> > root@ceph-mon1:~# grep -r 'Large omap object found. Object'
> > /var/log/ceph/ceph.log
> > 2022-11-15T14:47:28.900679+ osd.47 (osd.47) 10890 : cluster [WRN]
> > Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
> > 336457 Size (bytes): 117560231
> > 2022-11-17T04:51:43.593811+ osd.50 (osd.50) 90 : cluster [WRN]
> > Large omap object found. Object: 3:0de49b75:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.10:head PG: 3.aed927b0 (3.30) Key count:
> > 205346 Size (bytes): 71669614
> > 2022-11-18T02:55:07.182419+ osd.47 (osd.47) 10917 : cluster [WRN]
> > Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
> > 449776 Size (bytes): 157310435
> > 2022-11-19T09:56:47.630679+ osd.29 (osd.29) 114 : cluster [WRN]
> > Large omap object found. Object: 3:61ad76c5:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.12:head PG: 3.a36eb586 (3.6) Key count:
> > 213843 Size (bytes): 74703544
> > 2022-11-20T13:04:39.979349+ osd.72 (osd.72) 83 : cluster [WRN]
> > Large omap object found. Object: 3:2b3227e7:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.22:head PG: 3.e7e44cd4 (3.14) Key count:
> > 326676 Size (bytes): 114453145
> > 2022-11-21T02:53:32.410698+ osd.50 (osd.50) 151 : cluster [WRN]
> > Large omap object found. Object: 3:0de49b75:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.10:head PG: 3.aed927b0 (3.30) Key count:
> > 216764 Size (bytes): 75674839
> > 2022-11-22T18:04:09.757825+ osd.47 (osd.47) 10964 : cluster [WRN]
> > Large omap object found. Object: 3:9660022b:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.9:head PG: 3.d4400669 (3.29) Key count:
> > 449776 Size (bytes): 157310435
> > 2022-11-23T00:44:55.316254+ osd.29 (osd.29) 163 : cluster [WRN]
> > Large omap object found. Object: 3:61ad76c5:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.12:head PG: 3.a36eb586 (3.6) Key count:
> > 213843 Size (bytes): 74703544
> > 2022-11-23T09:10:07.842425+ osd.55 (osd.55) 13968 : cluster [WRN]
> > Large omap object found. Object: 3:3fa378c9:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.20:head PG: 3.931ec5fc (3.3c) Key count:
> > 219204 Size (bytes): 76509687
> > 2022-11-23T09:11:15.516973+ osd.72 (osd.72) 112 : cluster [WRN]
> > Large omap object found. Object: 3:2b3227e7:::.dir.ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19.22:head PG: 3.e7e44cd4 (3.14) Key count:
> > 326676 Size (bytes): 114453145
> > root@ceph-mon1:~# radosgw-admin metadata list "bucket.instance" | grep
> > ee3fa6a3-4af3-4ac2-86c2-d2c374080b54.63073818.19
> > "68ddc61c613a4e3096ca8c349ee37f56/snapshotnfs:ee3fa6a3-4af3-4ac2-
> > 86c2-d2c374080b54.63073818.19",
> > 
> > [2]
> > root@ceph-mon1:~# rados

[ceph-users] Upgrade 16.2.10 to 17.2.x: any caveats?

2022-11-24 Thread Zakhar Kirpichenko
Hi!

I'm planning a service window to make some network upgrades, and would like
to use the same window to upgrade our Ceph cluster from 16.2.10 to the
latest 17.2.x available on that date.

The cluster is a fairly simple 6-node setup with a mix of NVME (WAL/DB) and
HDD (block) drives, several replicated pools and 1 testing EC pool. Are
there any known issues or caveats I should consider to better prepare for
such an upgrade?

I would very much appreciate any advice!

Best regards,
Zakhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-24 Thread Xiubo Li

Hi Adren,

Thank you for your logs.

From your logs I found one bug and I have raised one new tracker [1] to 
follow it, and raised a ceph PR [2] to fix this.


More detail please my analysis in the tracker [2].

[1] https://tracker.ceph.com/issues/58082
[2] https://github.com/ceph/ceph/pull/49048

Thanks

- Xiubo


On 24/11/2022 16:33, Adrien Georget wrote:

Hi Xiubo,

We did the upgrade in rolling mode as always, with only few kubernetes 
pods as clients accessing their PVC on CephFS.


I can reproduce the problem everytime I restart the MDS daemon.
You can find the MDS log with debug_mds 25 and debug_ms 1 here : 
https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd 

(The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 
mds.0.server force_clients_readonly)


I couldn't find any errors in the OSD logs, anything specific should I 
looking for?


Best,
Adrien 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade 16.2.10 to 17.2.x: any caveats?

2022-11-24 Thread Zakhar Kirpichenko
Thanks, Stefan. I've read this very well.

My question is whether there's anything not covered by the available
documentation that we should be aware of.

/Z

On Fri, 25 Nov 2022 at 09:11, Stefan Kooman  wrote:

> On 11/24/22 18:53, Zakhar Kirpichenko wrote:
> > Hi!
> >
> > I'm planning a service window to make some network upgrades, and would
> like
> > to use the same window to upgrade our Ceph cluster from 16.2.10 to the
> > latest 17.2.x available on that date.
> >
> > The cluster is a fairly simple 6-node setup with a mix of NVME (WAL/DB)
> and
> > HDD (block) drives, several replicated pools and 1 testing EC pool. Are
> > there any known issues or caveats I should consider to better prepare for
> > such an upgrade?
> >
> > I would very much appreciate any advice!
>
> Read the release notes (also from previous quincy releases, especially
> the "notable changes"), and the upgrade procedure to quincy to the
> letter [1].
>
> Gr. Stefan
>
> [1]:
>
> https://docs.ceph.com/en/latest/releases/quincy/#upgrading-from-octopus-or-pacific
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io