[ceph-users] Re: EC cluster cascade failures and performance problems

2020-11-19 Thread Paul Kramme
Hi Igor, we store 400TB backups (RDB snapshots) on the cluster, depending on the schedule we replace all data every one to two weeks, so we are deleting data every day. Yes, the OSDs are killed with messages like "heartbeat_check: no reply from 10.244.0.27:6852 osd.37 ever...", if that is what yo

[ceph-users] Re: newbie Cephfs auth permissions issues

2020-11-19 Thread Frank Schilder
That's a known issue. You probably did "enable application cephfs" on the pools. This prevents a meta data tag to be applied correctly. If you google for your problem, you will find threads on this with fixes. There was at least one this year. Also, you could just start from scratch one more ti

[ceph-users] Mon's falling out of quorum, require rebuilding. Rebuilt with only V2 address.

2020-11-19 Thread Wesley Dillingham
We have had multiple clusters experiencing the following situation over the past few months on both 14.2.6 and 14.2.11. On a few instances it seemed random , in a second situation we had temporary networking disruption, in a third situation we accidentally made some osd changes which caused certain

[ceph-users] Re: Unable to reshard bucket

2020-11-19 Thread Eric Ivancich
Hey Timothy, Did you ever resolve this issue, and if so, how? > Thank you..I looked through both logs and noticed this in the cancel one: > > osd_op(unknown.0.0:4164 41.2 41:55b0279d:reshard::reshard.09:head > [call > rgw.reshard_remove] snapc 0=[] ondisk+write+known_if_redirected e2498

[ceph-users] Re: EC cluster cascade failures and performance problems

2020-11-19 Thread Igor Fedotov
Hi Paul, any chances you initiated massive data removal recently? Are there any suicide timeouts in OSD logs prior to OSD failures? Any log output containing "slow operation observed" there? Please also note the following PR and tracker comments which might be relevant for your case. https

[ceph-users] Re: BLUEFS_SPILLOVER BlueFS spillover detected

2020-11-19 Thread Igor Fedotov
This is a known issue with RocksDB/BlueFS. Discussed multiple time in this mailing thread... This should improve starting Nautilus v14.2.12 thanks to the following PRs: https://github.com/ceph/ceph/pull/33889 https://github.com/ceph/ceph/pull/37091 Please note these PRs don't fix existing sp

[ceph-users] Re: Can't upgrade from 15.2.5 to 15.2.6... (Cannot calculate service_id: daemon_id='cephfs....')

2020-11-19 Thread Gencer Genç
Hi, I don't know how this happened but it seems second node's hosts file (/etc/hosts) was broken and "host-1" thinks itself as "host". Fixing /etc/hosts also fixed this issue. Thanks, Gencer. On 19.11.2020 17:33:52, "Gencer Genç" wrote: Hi, I ran those commands as usual: $ ceph orch hos

[ceph-users] Slow OSDs

2020-11-19 Thread Kristof Coucke
Hi, We are having slow osd's... A hot topic to search on it... I've tried to dive as deep as I can, but I need to know which debug setting will help me to dive even deeper... Okay: situation: - After expansion lot's of backfill operations are running spread over the osd's. - max_backfills is set

[ceph-users] Can't upgrade from 15.2.5 to 15.2.6... (Cannot calculate service_id: daemon_id='cephfs....')

2020-11-19 Thread Gencer Genç
Hi, I ran those commands as usual: $ ceph orch host ls Result is as expected by hosts names and addresses. $ ceph orch ls Again, Expected result as before. Then I started to upgrade via this command: $ ceph orch upgrade start --ceph-version 15.2.6 It failed with attached logs. Please see log

[ceph-users] newbie Cephfs auth permissions issues

2020-11-19 Thread Jonathan D. Proulx
Hi All, I've been using ceph block and object storage for years but just wandering into cephfs now (Nautilus all servers on 14.2.9 ). I created small data and metadata pools, a new filesystem and used: ceph fs authorize client. / rw creating two new users to mount it, both can one using fuse (

[ceph-users] Re: osd_pglog memory hoarding - another case

2020-11-19 Thread Kalle Happonen
Hello, I thought I'd post an update. Setting the pg_log size to 500, and running the offline trim operation sequentially on all OSDs seems to help. With our current setup, it takes about 12-48h per node, depending on the pgs per osd. The PG amounts per OSD we have are ~180-750, with a majority

[ceph-users] Re: MGR restart loop

2020-11-19 Thread Frank Schilder
Hi all, there seems to be a bug in how beacon time-outs are computed. After waiting for a full time-out period of 86400s=24h, the problem disappeared. It looks like received beacons are only counted properly after a MON was up for the grace period. I have no other explanation. Best regards, ==

[ceph-users] Re: v15.2.6 Octopus released

2020-11-19 Thread Ilya Dryomov
On Thu, Nov 19, 2020 at 3:39 AM David Galloway wrote: > > This is the 6th backport release in the Octopus series. This releases > fixes a security flaw affecting Messenger V2 for Octopus & Nautilus. We > recommend users to update to this release. > > Notable Changes > --- > * CVE 2020-

[ceph-users] Re: EC overwrite

2020-11-19 Thread Konstantin Shalygin
I don't think it's may be a problem. But it also useless. k Sent from my iPhone > On 18 Nov 2020, at 07:06, Szabo, Istvan (Agoda) > wrote: > > Is it s problem if ec_overwrite enabled in the data pool? > https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coding-with-overwr

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-19 Thread Janek Bevendorff
We are doing that as well. But we need to be able to check specific buckets additionally. For that we use this second approach. Since we double-check all output from our script anyway (to see if NoSuchKey actually happens), we can rule out false positives. So far all the files detected this wa

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-19 Thread Janek Bevendorff
I would recommend you get a dump with rados ls -p poolname (can be several GB, mine is 61GB) and grep (or ack, which is faster) for the names there to get an overview of what is there and what isn't. Looking up the names directly can easily give you the wrong picture, because it is kinda compli

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-19 Thread Denis Krienbühl
Thanks, we are currently scanning our object storage. It looks like we can detect the missing objects that return “No Such Key” looking at all “__multipart_” objects returned by radosgw-admin bucket radoslist, and checking if they exist using rados stat. We are currently not looking at shadow ob

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-19 Thread Janek Bevendorff
- The head object had a size of 0. - There was an object with a ’shadow’ in its name, belonging to that path. That is normal. What is not normal is if there are NO shadow objects. On 18/11/2020 10:06, Denis Krienbühl wrote: It looks like a single-part object. But we did replace that object last