[ceph-users] Re: Mismatched object counts between "rados df" and "rados ls" after rbd images removal

2020-05-20 Thread Eugen Block
The rbd_info, rbd_directory objects will remain until you delete the pool, you don't need to clean that up, e.g. if you decide to create new rbd images in there. The number of remaining objects usually slowly decreases depending on the amount of data that was deleted. Just last week I deleted

[ceph-users] Pool full but the user cleaned it up already

2020-05-20 Thread Szabo, Istvan (Agoda)
Hi, I have a health warn regarding pool full: health: HEALTH_WARN 1 pool(s) full This is the pool that is complaining: Ceph df: NAME ID USED%USED MAX AVAIL OBJECTS k8s 8 200GiB 0.22

[ceph-users] Large omap

2020-05-20 Thread Szabo, Istvan (Agoda)
Hi, I have in one of my cluster a large omap object under luminous 12.2.8. HEALTH_WARN 1 large omap objects LARGE_OMAP_OBJECTS 1 large omap objects 1 large objects found in pool 'default.rgw.log' Search the cluster log for 'Large omap object found' for more details. In my setup the ha-pr

[ceph-users] Possible bug in op path?

2020-05-20 Thread Robert LeBlanc
We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed that op behavior has changed. This is an HDD cluster (NVMe journals and NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running WPQ with the high cut-off, it was rock solid. When we had recoveries going on

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

2020-05-20 Thread Dan van der Ster
Hi Andras, To me it looks like the osd.0 is not peering when it starts with crush weight 0. I would try forcing the re-peering with `ceph osd down osd.0` when the PGs are unexpectedly degraded. (e.g start the osd when crush weight is 0, then obverve the PGs are still degraded, then force the re-p

[ceph-users] Re: Large omap

2020-05-20 Thread Janne Johansson
Den ons 20 maj 2020 kl 05:23 skrev Szabo, Istvan (Agoda) < istvan.sz...@agoda.com>: > LARGE_OMAP_OBJECTS 1 large omap objects > 1 large objects found in pool 'default.rgw.log' > When I look for this large omap object, this is the one: > for i in `ceph pg ls-by-pool default.rgw.log | tail -n +2

[ceph-users] Re: Pool full but the user cleaned it up already

2020-05-20 Thread Eugen Block
Okay, so the OSDs are in fact not full, it's strange that the pool still is reported as full. Maybe restart the mgr services? Zitat von "Szabo, Istvan (Agoda)" : Yeah, sorry: ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS 12 ssd 2.29799 1.0 2.30TiB 2.67GiB 2.30TiB 0.

[ceph-users] Re: Possible bug in op path?

2020-05-20 Thread Dan van der Ster
Hi Robert, Since you didn't mention -- are you using osd_op_queue_cut_off low or high? I know you are usually advocating high, but the default is still low and most users don't change this setting. Cheers, Dan On Wed, May 20, 2020 at 9:41 AM Robert LeBlanc wrote: > > We upgraded our Jewel clus

[ceph-users] 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Ashley Merrick
I just upgraded a cephadm cluster from 15.2.1 to 15.2.2. Everything went fine on the upgrade, however after restarting one node that has 3 OSD's for ecmeta two of the 3 ODS's now wont boot with the following error: May 20 08:29:42 sn-m01 bash[6833]: debug 2020-05-20T08:29:42.598+ 7fbcc4

[ceph-users] Re: Aging in S3 or Moving old data to slow OSDs

2020-05-20 Thread Khodayar Doustar
Anyone knows anything about this? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Aging in S3 or Moving old data to slow OSDs

2020-05-20 Thread Janne Johansson
Tiered pools should be able to do this for you. It has been dis-encouraged as a performance gain (ie, the reverse when you have spin drives and want to put a ssd pool in front of it to get ssd perf but hdd price/storage) in some cases, but if you do it for migrations it should probably be worth it

[ceph-users] Re: Pool full but the user cleaned it up already

2020-05-20 Thread Szabo, Istvan (Agoda)
Hello, No, haven't deleted, this warning is quite long time ago. ceph health detail HEALTH_WARN 1 pool(s) full POOL_FULL 1 pool(s) full pool 'k8s' is full (no quota) ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 315TiB 313TiB 2.27TiB 0.72 POOLS: NA

[ceph-users] Re: Aging in S3 or Moving old data to slow OSDs

2020-05-20 Thread Thomas Bennett
Hi Khodayar, Setting placement policies is probably not what you're looking for. I've used placement policies successfully to separate an HDD pool from an SSD pool. However, this policy only applies to new data if it is set. You would have to read it out and write it back in at the s3 level using

[ceph-users] [ceph][nautilus] prformances with db/wal on nvme

2020-05-20 Thread Ignazio Cassano
Hello All, We have 6 servers. Configuration for each server: 1 ssd for mon (only on three servers) 1 ssd 1.9 TB for db/wal 1 nvme 1.6 TB for db/wal 10 SAS hdd 3.6 TB for osd We decided to create a pool of 30 osd (5x6) with db/wal on ssd and a pool of 30 (5x6) osd with db/wal on nvme. S

[ceph-users] Re: [ceph][nautilus] prformances with db/wal on nvme

2020-05-20 Thread Janne Johansson
Den ons 20 maj 2020 kl 12:00 skrev Ignazio Cassano : > Hello All, > We have 6 servers. > Configuration for each server: > 1 ssd for mon (only on three servers) > 1 ssd 1.9 TB for db/wal > 1 nvme 1.6 TB for db/wal > 10 SAS hdd 3.6 TB for osd > We decided to create a pool of 30 osd (5x6) with db/w

[ceph-users] Re: [ceph][nautilus] prformances with db/wal on nvme

2020-05-20 Thread Ignazio Cassano
Hello Janne, so do you think we must move from 10Gbs to 40 or 100GBs to to make the most of nvme ? Thanks Ignazio Il giorno mer 20 mag 2020 alle ore 12:06 Janne Johansson < icepic...@gmail.com> ha scritto: > Den ons 20 maj 2020 kl 12:00 skrev Ignazio Cassano < > ignaziocass...@gmail.com>: > >> H

[ceph-users] Re: Pool full but the user cleaned it up already

2020-05-20 Thread Szabo, Istvan (Agoda)
Yeah, sorry: ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS 12 ssd 2.29799 1.0 2.30TiB 2.67GiB 2.30TiB 0.11 0.16 24 13 ssd 2.29799 1.0 2.30TiB 2.33GiB 2.30TiB 0.10 0.14 21 14 ssd 3.49300 1.0 3.49TiB 2.71GiB 3.49TiB 0.08 0.11 27 27 ssd 2.29799 1.0 2.3

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Ashley Merrick
So reading online it looked a dead end error, so I recreated the 3 OSD's on that node and now working fine after a reboot. However I restarted the next server with 3 OSD's and one of them is now facing the same issue. Let me know if you need any more logs. Thanks On Wed, 20 May 2

[ceph-users] Re: [ceph][nautilus] prformances with db/wal on nvme

2020-05-20 Thread Janne Johansson
Den ons 20 maj 2020 kl 12:14 skrev Ignazio Cassano : > Hello Janne, so do you think we must move from 10Gbs to 40 or 100GBs to > to make the most of nvme ? > I think there are several factors to weigh in, when you need to maximize performance, from putting BIOS into performance mode, having as fa

[ceph-users] Re: [ceph][nautilus] prformances with db/wal on nvme

2020-05-20 Thread Ignazio Cassano
Many thanks, Janne Ignazio Il giorno mer 20 mag 2020 alle ore 12:32 Janne Johansson < icepic...@gmail.com> ha scritto: > Den ons 20 maj 2020 kl 12:14 skrev Ignazio Cassano < > ignaziocass...@gmail.com>: > >> Hello Janne, so do you think we must move from 10Gbs to 40 or 100GBs to >> to make the mo

[ceph-users] Re: Large omap

2020-05-20 Thread Thomas Bennett
Hi, Have you looked at omaps keys to see what's listed there? In our configuration, the radosgw garbage collector uses the *default.rgw.logs* pool for garbage collection (radosgw-admin zone get default | jq .gc_pool). I've seen large omaps in my *default.rgw.logs* pool before when I've deleted l

[ceph-users] OSDs taking too much memory, for buffer_anon

2020-05-20 Thread Harald Staub
As a follow-up to our recent memory problems with OSDs (with high pglog values: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LJPJZPBSQRJN5EFE632CWWPK3UMGG3VF/#XHIWAIFX4AXZK5VEFOEBPS5TGTH33JZO ), we also see high buffer_anon values. E.g. more than 4 GB, with "osd memory target

[ceph-users] Re: Aging in S3 or Moving old data to slow OSDs

2020-05-20 Thread Khodayar Doustar
Thomas, Yes, you are correct. I would have to move objects manually between (more than one) buckets if I use "Pool placements and Storage classes" So you have successfully used this method and it was OK? I may be forced to use this method because clients needs more features than mere cache tierin

[ceph-users] total ceph outage again, need help

2020-05-20 Thread Frank Schilder
Dear cephers, I'm sitting with a major ceph outage again. The mon/mgr hosts suffer from a packet storm of ceph traffic between ceph fs clients and the mons. No idea why this is happening. Main problem is, that I can't get through to the cluster. Admin commands hang forever: [root@gnosis ~]# c

[ceph-users] Re: total ceph outage again, need help

2020-05-20 Thread Frank Schilder
Looks like the immediate danger has passed by: [root@gnosis ~]# ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_WARN nodown,noout flag(s) set 735 slow ops, oldest one blocked for 3573 sec, daemons [mon.ceph-02,mon.ceph-03] have sl

[ceph-users] Re: OSDs taking too much memory, for buffer_anon

2020-05-20 Thread Mark Nelson
Hi Harald, Any idea what the priority_cache_manger perf counters show? (or you can also enable debug osd / debug priority_cache_manager)  The osd memory autotuning works by shrinking the bluestore and rocksdb caches to some target value to try and keep the mapped memory of the process bellow

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Igor Fedotov
Hi Ashley, looks like this is a regression. Neha observed similar error(s) during here QA run, see https://tracker.ceph.com/issues/45613 Please preserve broken OSDs for a while if possible, likely I'll come back to you for more information to troubleshoot. Thanks, Igor On 5/20/2020 1:26

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Ashley Merrick
Thanks, fyi the OSD's that went down back two pools, an Erasure code Meta (RBD) and cephFS Meta. The cephFS Pool does have compresison enabled ( I noticed it mentioned in the ceph tracker) Thanks On Wed, 20 May 2020 20:17:33 +0800 Igor Fedotov wrote Hi Ashley, looks like

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Igor Fedotov
I don't believe compression is related to be honest. Wondering if these OSDs have standalone WAL and/or DB devices or just a single shared main device. Also could you please set debug-bluefs/debug-bluestore to 20 and collect startup log for broken OSD. Kind regards, Igor On 5/20/2020 3:2

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Dan van der Ster
lz4 ? It's not obviously related, but I've seen it involved in really non-obvious ways: https://tracker.ceph.com/issues/39525 -- dan On Wed, May 20, 2020 at 2:27 PM Ashley Merrick wrote: > > Thanks, fyi the OSD's that went down back two pools, an Erasure code Meta > (RBD) and cephFS Meta. The c

[ceph-users] Re: OSDs taking too much memory, for buffer_anon

2020-05-20 Thread Harald Staub
Hi Mark Thank you for you explanations! Some numbers of this example osd below. Cheers Harry From dump mempools: "buffer_anon": { "items": 29012, "bytes": 4584503367 }, From perf dump: "prioritycache": { "target_bytes": 375

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Ashley Merrick
Is a single shared main device. Sadly I had already rebuilt the failed OSD's to bring me back in the green after a while. I have just tried a few restarts and none are failing (seems after a rebuild using 15.2.2 they are stable?) I don't have any other servers/OSD's I am willing to risk no

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Igor Fedotov
Do you still have any original failure logs? On 5/20/2020 3:45 PM, Ashley Merrick wrote: Is a single shared main device. Sadly I had already rebuilt the failed OSD's to bring me back in the green after a while. I have just tried a few restarts and none are failing (seems after a rebuild usin

[ceph-users] ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Gencer W . Genç
Hi, I've 15.2.1 installed on all machines. On primary machine I executed ceph upgrade command: $ ceph orch upgrade start --ceph-version 15.2.2 When I check ceph -s I see this:   progress:     Upgrade to docker.io/ceph/ceph:v15.2.2 (30m)       [=...] (remaining: 8h) It

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Igor Fedotov
Dan, thanks for the info. Good to know. Failed QA run in the ticket uses snappy though. And in fact any stuff writing to process memory can  introduce data corruption in the similar manner. So will keep that in mind but IMO relation to compression is still not evident... Kind regards, Ig

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Ashley Merrick
What does ceph orch upgrade status show? On Wed, 20 May 2020 20:52:39 +0800 Gencer W. Genç wrote Hi, I've 15.2.1 installed on all machines. On primary machine I executed ceph upgrade command: $ ceph orch upgrade start --ceph-version 15.2.2 When I check ceph -s I see

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Ashley Merrick
I attached the log but was too big and got moderated. Here is it in a paste bin : https://pastebin.pl/view/69b2beb9 I have cut the log to start from the point of the original upgrade. Thanks On Wed, 20 May 2020 20:55:51 +0800 Igor Fedotov wrote Dan, thanks for the info. Go

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Gencer W . Genç
Hi Ashley, $ ceph orch upgrade status {     "target_image": "docker.io/ceph/ceph:v15.2.2",     "in_progress": true,     "services_complete": [],     "message": "" } Thanks, Gencer. On 20.05.2020 15:58:34, Ashley Merrick wrote: What does ceph orch upgrade status show? On Wed, 20 May 20

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Ashley Merrick
Does: ceph versions show any services yet running on 15.2.2? On Wed, 20 May 2020 21:01:12 +0800 Gencer W. Genç wrote Hi Ashley,$ ceph orch upgrade status {     "target_image": "docker.io/ceph/ceph:v15.2.2",     "in_progress": true,     "services_complete": [],     "m

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Gencer W . Genç
Ah yes, {     "mon": {         "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 2     },     "mgr": {         "ceph version 15.2.2 (0c857e985a29d90501a285f242ea9c008df49eb8) octopus (stable)": 2     },     "osd": {         "ceph version 15.2.1 (9fd2f65f91d9246fa

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Ashley Merrick
ceph config set mgr mgr/cephadm/log_to_cluster_level debug ceph -W cephadm --watch-debug See if you see anything that stands out as an issue with the update, seems it has completed only the two MGR instances If not: ceph orch upgrade stop ceph orch upgrade start --ceph-version 15.2.2

[ceph-users] [ceph-users][ceph-dev] Upgrade Luminous to Nautilus 14.2.8 mon service crash

2020-05-20 Thread Amit Ghadge
Hi All, While we enable *ceph mon enable-msgr2 *after gateway service upgrade, the one of the mon service getting crash and never come back, it shows, /usr/bin/ceph-mon -f --cluster ceph --id mon01 --setuser ceph --setgroup ceph --debug_monc 20 --debug_ms 5 global_init: error reading config fil

[ceph-users] Re: OSDs taking too much memory, for buffer_anon

2020-05-20 Thread Mark Nelson
Hi Harald, Thanks!  So you can see from the perf dump that the target bytes are a little below 4GB, but the mapped bytes are around 7GB.  The priority cache manager has reacted by setting the "cache_bytes" to 128MB which is the minimum global value and each cache is getting 64MB (the local m

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Gencer W . Genç
Hi Ashley, I see this: [INF] Upgrade: Target is docker.io/ceph/ceph:v15.2.2 with id 4569944bbW86c3f9b5286057a558a3f852156079f759c9734e54d4f64092be9fa [INF] Upgrade: It is NOT safe to stop mon.vx-rg23-rk65-u43-130 Does this meaning anything to you? I've also attached full log. See especially af

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Ashley Merrick
Yes, I think it's because your only running two mons, so the script is halting at a check to stop you being in the position of just one running (no backup). I had the same issue with a single MGR instance and had to add a second to allow to upgrade to continue, can you bring up an extra MON?

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Gencer W . Genç
I have 2 mons and 2 mgrs.   cluster:     id:     7d308992-8899-11ea-8537-7d489fa7c193     health: HEALTH_OK   services:     mon: 2 daemons, quorum vx-rg23-rk65-u43-130,vx-rg23-rk65-u43-130-1 (age 91s)     mgr: vx-rg23-rk65-u43-130.arnvag(active, since 28m), standbys: vx-rg23-rk65-u43-130-1.pxmyi

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Ashley Merrick
Correct, however it will need to stop one to do the upgrade leaving you with only one working MON (this is what I would suggest the error means seeing i had the same thing when I only had a single MGR), normally is suggested to have 3 MONs due to quorum. Do you not have a node you can run a m

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Gencer W . Genç
This is 2 node setup. I have no third node :( I am planning to add more in the future but currently 2 nodes only. At the moment, is there a --force command for such usage? On 20.05.2020 16:32:15, Ashley Merrick wrote: Correct, however it will need to stop one to do the upgrade leaving you with

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Igor Fedotov
Thanks! So for now I can see the following similarities between you case and the ticket: 1) Single main spinner as an OSD backing device. 2) Corruption happens to RocksDB WAL file 3) OSD has user data compression enabled. And one more question. Fro the following line: May 20 06:05:14 sn-m

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Ashley Merrick
Hey Igor, The OSDs only back two metadata pools, so only hold a couple of MB of data (hence they was easy and quick to rebuild), there actually NVME LVM devices passed through QEMU into a VM (hence only 10GB and showing as rotational) I have large 10TB disks that back the EC(RBD/FS) them se

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Igor Fedotov
Hi Cris, could you please share the full log prior to the first failure? Also if possible please set debug-bluestore/debug bluefs to 20 and collect another one for failed OSD startup. Thanks, Igor On 5/20/2020 4:39 PM, Chris Palmer wrote: I'm getting similar errors after rebooting a node.

[ceph-users] Re: Aging in S3 or Moving old data to slow OSDs

2020-05-20 Thread Thomas Bennett
Hi Khodayar, Yes, you are correct. I would have to move objects manually between (more > than one) buckets if I use "Pool placements and Storage classes" > > So you have successfully used this method and it was OK? > After we set up the new placement rule in the zone and zonegroups we modified us

[ceph-users] Re: ceph orch upgrade stuck at the beginning.

2020-05-20 Thread Sebastian Wagner
Hi Gencer, I'm going to need the full mgr log file. Best, Sebastian Am 20.05.20 um 15:07 schrieb Gencer W. Genç: > Ah yes, > > { >     "mon": { >         "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) > octopus (stable)": 2 >     }, >     "mgr": { >         "ceph version 15.2.

[ceph-users] OSD crashes regularely

2020-05-20 Thread Thomas
Hello, I have a pool of +300 OSDs that are identical model (Seagate model: ST1800MM0129 size: 1.64 TiB). Only 1 OSD crashes regularely, however I cannot identify a root cause. Based on the output of smartctl the disk is ok. # smartctl -a -d megaraid,1 /dev/sda

[ceph-users] Re: OSD crashes regularely

2020-05-20 Thread Serkan Çoban
Disk is not ok, look to the output below: SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE you should replace the disk. On Wed, May 20, 2020 at 5:11 PM Thomas <74cmo...@gmail.com> wrote: > > Hello, > > I have a pool of +300 OSDs that are identical model (Seagate model: > ST1800M

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Igor Fedotov
Chris, got them, thanks! Investigating Thanks, Igor On 5/20/2020 5:23 PM, Chris Palmer wrote: Hi Igor I've sent you these directly as they're a bit chunky. Let me know if you haven't got them. Thx, Chris On 20/05/2020 14:43, Igor Fedotov wrote: Hi Cris, could you please share the f

[ceph-users] diskprediction_local prediction granularity

2020-05-20 Thread Vytenis A
Hi list, Looking into diskprediction_local module, and I see that it only predicts a few states: good, warning and bad: ceph/src/pybind/mgr/diskprediction_local/predictor.py: if score > 10: return "Bad" if score > 4: return "Warning" return "Good" The predicted fail date is just a deriva

[ceph-users] Re: diskprediction_local prediction granularity

2020-05-20 Thread Paul Emmerich
On Wed, May 20, 2020 at 5:36 PM Vytenis A wrote: > Is it possible to get any finer prediction date? > related question: did anyone actually observe any correlation between the predicted failure time and the actual time until a failure occurs? Paul -- Paul Emmerich Looking for help with you

[ceph-users] Re: Possible bug in op path?

2020-05-20 Thread Robert LeBlanc
We are using high and the people on the list that have also changed have not seen the improvements that I would expect. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, May 20, 2020 at 1:38 AM Dan van der Ster wrote: > > Hi Robert, > > Sin

[ceph-users] 15.2.2 bluestore issue

2020-05-20 Thread Josh Durgin
Hi folks, at this time we recommend pausing OSD upgrades to 15.2.2. There have been a couple reports of OSDs crashing due to rocksdb corruption after upgrading to 15.2.2 [1] [2]. It's safe to upgrade monitors and mgr, but OSDs and everything else should wait. We're investigating and will get a f

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Chris Palmer
I'm getting similar errors after rebooting a node. Cluster was upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting during upgrade. On the node I just rebooted, 2/4 OSDs won't restart. Similar logs from both. Logs from one below. Neither OSDs have compression enabled, although there

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

2020-05-20 Thread Andras Pataki
Hi Frank, Thanks for the explanation - I wasn't aware of this subtle point. So when some OSDs are down, one has to be very careful with changing the cluster then.  I guess one could even end up with incomplete PGs this way that ceph can't recover from in an automated fashion? Andras On 5/1

[ceph-users] Re: 15.2.2 Upgrade - Corruption: error in middle of record

2020-05-20 Thread Chris Palmer
Hi Igor I've sent you these directly as they're a bit chunky. Let me know if you haven't got them. Thx, Chris On 20/05/2020 14:43, Igor Fedotov wrote: Hi Cris, could you please share the full log prior to the first failure? Also if possible please set debug-bluestore/debug bluefs to 20 and

[ceph-users] Re: Reweighting OSD while down results in undersized+degraded PGs

2020-05-20 Thread Andras Pataki
Hi Dan, Unfortunately 'ceph osd down osd.0' doesn't help - it is marked down and soon after back up, but it doesn't peer still.  I tried reweighting the OSD to half its weight, 4.0 instead of 0.0, and that results in about half the PGs staying degraded.  So this is not specific to zero weight.

[ceph-users] Re: Possible bug in op path?

2020-05-20 Thread Robert LeBlanc
Adding the right dev list. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, May 20, 2020 at 12:40 AM Robert LeBlanc wrote: > > We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed > that op behavior has changed. Thi

[ceph-users] PGS INCONSISTENT - read_error - replace disk or pg repair then replace disk

2020-05-20 Thread Peter Lewis
Hello, I came across a section of the documentation that I don't quite understand. In the section about inconsistent PGs it says if one of the shards listed in `rados list-inconsistent-obj` has a read_error the disk is probably bad. Quote from documentation: https://docs.ceph.com/docs/master/ra

[ceph-users] Re: Mismatched object counts between "rados df" and "rados ls" after rbd images removal

2020-05-20 Thread icy chan
Hi Eugen, Thanks for the suggestion. The object counts of rbd pool are still stay on 430.11K. (all images were deleted 3 days +.) I will keep monitor it and post the results here. Regs, Icy On Wed, 20 May 2020 at 15:12, Eugen Block wrote: > The rbd_info, rbd_directory objects will remain unt

[ceph-users] Re: Large omap

2020-05-20 Thread Szabo, Istvan (Agoda)
Hello, Yes it is, this is the output: "default.rgw.log:gc" From: Thomas Bennett Sent: Wednesday, May 20, 2020 5:44 PM To: Szabo, Istvan (Agoda) Cc: ceph-users Subject: Re: [ceph-users] Re: Large omap Email received from outside the company. If in doubt don't click links nor open attachments

[ceph-users] Re: Pool full but the user cleaned it up already

2020-05-20 Thread Szabo, Istvan (Agoda)
Restarted mgr and mon services, nothing helped :/ -Original Message- From: Eugen Block Sent: Wednesday, May 20, 2020 3:05 PM To: Szabo, Istvan (Agoda) Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Pool full but the user cleaned it up already Email received from outside the company.

[ceph-users] Re: diskprediction_local prediction granularity

2020-05-20 Thread Vytenis A
And a more broader question: is anyone using diskpredictor (local or cloud) ? On Wed, May 20, 2020 at 7:35 PM Paul Emmerich wrote: > > > > On Wed, May 20, 2020 at 5:36 PM Vytenis A wrote: >> >> Is it possible to get any finer prediction date? > > > related question: did anyone actually observe a