My experience was that placing DB+WAL on NVME provided a much better and
much more consistent boost to a HDD-backed pool than a cache tier. My
biggest grief with the cache tier was its unpredictable write performance,
when it would cache some writes and then immediately not cache some others
seemin
On Sun, Sep 19, 2021 at 4:48 PM Andrej Filipcic wrote:
> I have attached a part of the osd log.
Hi Andrej. Did you mean to attach more than the snippets?
Could you also send the log of the first startup in 16.2.6 of an
now-corrupted osd?
Cheers, dan
__
attached it, but did not work, here it is:
https://www-f9.ijs.si/~andrej/ceph/ceph-osd.1049.log-20210920.gz
Cheers,
Andrej
On 9/20/21 9:41 AM, Dan van der Ster wrote:
On Sun, Sep 19, 2021 at 4:48 PM Andrej Filipcic wrote:
I have attached a part of the osd log.
Hi Andrej. Did you mean to
Ah found it.
It was a SSL certificate that was invalid (some PoC which started to mold).
Now the sync is running fine, but there is one bucket that got a ton of
data in the mdlog.
[root@s3db16 ~]# radosgw-admin mdlog list | grep temonitor | wc -l
No --period given, using current period=e8fc96f1-ae
On 17.09.2021 16:10, Eugen Block wrote:
Since I'm trying to test different erasure encoding plugin and
technique I don't want the balancer active.
So I tried setting it to none as Eguene suggested, and to my surprise
I did not get any degraded messages at all, and the cluster was in
HEALTH_O
Hi,
I'm running out of idea why my wal+db nvmes are maxed out always so thinking of
I might missed the cache tiering in front of my 4:2 ec-pool. IS it possible to
add it later?
There are 9 nodes with 6x 15.3TB SAS ssds, 3x nvme drives. Currently out of the
3 nvme 1 is used for index pool and me
On 9/20/21 07:51, Mr. Gecko wrote:
Hello,
I'll start by explaining what I have done. I was adding some new storage
in attempt to setup a cache pool according to
https://docs.ceph.com/en/latest/dev/cache-pool/ by doing the following.
1. I upgraded all servers in cluster to ceph 15.2.14 which
On 9/20/21 06:15, Szabo, Istvan (Agoda) wrote:
Hi,
I'm running out of idea why my wal+db nvmes are maxed out always so thinking of
I might missed the cache tiering in front of my 4:2 ec-pool. IS it possible to
add it later?
Maybe I missed a post where you talked about WAL+DB being maxed out.
These are the processes in the iotop in 1 node. I think it's compacting but it
is always like this, never finish.
59936 be/4 ceph0.00 B/s 10.08 M/s 0.00 % 53.07 % ceph-osd -f
--cluster ceph --id 46 --setuser ceph --setgroup ceph [bstore_kv_sync]
66097 be/4 ceph0.00 B/s
Thank you Stefan!
My problem was that the ruleset I built had the failure domain set to
rack, when I do not have any racks defined. I changed the failure domain
to host as this is just a home lab environment. I reverted the ruleset
on the pool, and it immediately started to recover and storage
We tested Ceph 16.2.6, and indeed, the performances came back to what we expect
for this cluster.
Luis Domingues
‐‐‐ Original Message ‐‐‐
On Saturday, September 11th, 2021 at 9:55 AM, Luis Domingues
wrote:
> Hi Igor,
>
> I have a SSD for the physical DB volume. And indeed it has very
On 9/16/21 13:42, Davíð Steinn Geirsson wrote:
The 4 affected drives are of 3 different types from 2 different vendors:
ST16000NM001G-2KK103
ST12000VN0007-2GS116
WD60EFRX-68MYMN1
They are all connected through an LSI2308 SAS controller in IT mode. Other
drives that did not fail are also connec
I got the exact same error on one of my OSDs when upgrading to 16. I
used it as an exercise on trying to fix a corrupt rocksdb. A spent a few
days of poking with no success. I got mostly tool crashes like you are
seeing with no forward progress.
I eventually just gave up, purged the OSD, did
I also ran into this with v16. In my case, trying to run a repair totally
exhausted the RAM on the box, and was unable to complete.
After removing/recreating the OSD, I did notice that it has a drastically
smaller OMAP size than the other OSDs. I don’t know if that actually means
anything, but j
On Mon, Sep 20, 2021 at 10:38:37AM +0200, Stefan Kooman wrote:
> On 9/16/21 13:42, Davíð Steinn Geirsson wrote:
>
> >
> > The 4 affected drives are of 3 different types from 2 different vendors:
> > ST16000NM001G-2KK103
> > ST12000VN0007-2GS116
> > WD60EFRX-68MYMN1
> >
> > They are all connecte
For clarity, was this on upgrading to 16.2.6 from 16.2.5? Or upgrading
from some other release?
On Mon, Sep 20, 2021 at 8:33 AM Paul Mezzanini wrote:
>
> I got the exact same error on one of my OSDs when upgrading to 16. I
> used it as an exercise on trying to fix a corrupt rocksdb. A spent a fe
Same question here, for clarity, was this on upgrading to 16.2.6 from
16.2.5? Or upgrading
from some other release?
On Mon, Sep 20, 2021 at 8:57 AM Sean wrote:
>
> I also ran into this with v16. In my case, trying to run a repair totally
> exhausted the RAM on the box, and was unable to complete
On 9/20/21 12:00, Davíð Steinn Geirsson wrote:
Does the SAS controller run the latest firmware?
As far as I can tell yes. Avago's website does not seem to list these
anymore, but they are running firmware version 20 which is the latest I
can find references to in a web search.
This machine h
In my case it happened after upgrading from v16.2.4 to v16.2.5 a couple
months ago.
~ Sean
On Sep 20, 2021 at 9:02:45 AM, David Orman wrote:
> Same question here, for clarity, was this on upgrading to 16.2.6 from
> 16.2.5? Or upgrading
> from some other release?
>
> On Mon, Sep 20, 2021 at 8:
On 20/09/2021 16:02, David Orman wrote:
Same question here, for clarity, was this on upgrading to 16.2.6 from
16.2.5? Or upgrading
from some other release?
from 16.2.5. but the OSD services were never restarted after upgrade to
.5, so it could be a leftover of previous issues.
Cheers,
Andrej
Can we please create a bluestore tracker issue for this
(if one does not exist already), where we can start capturing all the
relevant information needed to debug this? Given that this has been
encountered in previous 16.2.* versions, it doesn't sound like a
regression in 16.2.6 to me, rather an is
Hi- after the upgrade to 16.2.6, I am now seeing this error:
9/20/21 10:45:00 AM[ERR]cephadm exited with an error code: 1, stderr:Inferring
config
/var/lib/ceph/fe3a7cb0-69ca-11eb-8d45-c86000d08867/mon.rhel1.robeckert.us/config
ERROR: [Errno 2] No such file or directory:
'/var/lib/ceph/fe3a7cb
FWIW, we've had similar reports in the past:
https://tracker.ceph.com/issues/37282
https://tracker.ceph.com/issues/48002
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2GBK5NJFOSQGMN25GQ3CZNX4W2ZGQV5U/?sort=date
https://www.spinics.net/lists/ceph-users/msg59466.html
https://
Just after I sent, the error message started again:
9/20/21 11:30:00 AM
[WRN]
ERROR: [Errno 2] No such file or directory:
'/var/lib/ceph/fe3a7cb0-69ca-11eb-8d45-c86000d08867/mon.rhel1.robeckert.us/config'
9/20/21 11:30:00 AM
[WRN]
host rhel1.robeckert.us `cephadm ceph-volume` failed: cephadm exit
I was doing a rolling upgrade from 14.2.x -> 15.2.x (wait a week) ->
16.2.5. It was the last jump that had the hiccup. I'm doing the 16.2.5
-> .6 upgrade as I type this. So far, so good.
-paul
On 9/20/21 10:02 AM, David Orman wrote:
For clarity, was this on upgrading to 16.2.6 from 16.2.5?
Hi,
I wonder if anyone could share some experiences in etcd support by Ceph.
My users build Kubernetes cluster in VMs on OpenStack with Ceph.
With HDD (DB/WAL on SSD) volume, etcd performance test fails sometimes
because of latency. With SSD (all SSD) volume, it works fine.
I wonder if there is an
Hi!
It looks exactly the same as the problem I had.
Try the `cephadm ls` command on the `rhel1.robeckert.us` node.
- Original Message -
> From: "Robert W. Eckert"
> To: "ceph-users"
> Sent: Monday, 20 September, 2021 18:28:08
> Subject: [ceph-users] Getting cephadm "stderr:Inferring
That may be pointing in the right direction - I see
{
"style": "legacy",
"name": "mon.rhel1.robeckert.us",
"fsid": "fe3a7cb0-69ca-11eb-8d45-c86000d08867",
"systemd_unit": "ceph-...@rhel1.robeckert.us",
"enabled": false,
"state": "stopped",
"host_
Hi,
7 node, ec 4:2 host based crush, ssd osds with nvme wal+db, what shouldn't
cause any issue with these values?
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
I want to speed it up but haven't really found any reference.
Ty
Okay - I've finally got full debug logs from the flapping OSDs. The raw logs
are both 100M each - I can email them directly if necessary. (Igor I've already
sent these your way.)
Both flapping OSDs are reporting the same "bluefs _allocate failed to allocate"
errors as before. I've also noticed
hi
in general i see nothing you can do except supporting ssd's (like having a
pool of ssd's in your ceph cluster or use other shared storage with ssd's0
option we give (not production wize just for lab testing who ssuffers from
lack of Hardware) - is using memory (of the vm) for etcd
this way the p
you can see example below of changing it on the fly
sudo ceph tell osd.\* injectargs '--osd_max_backfills 4'
sudo ceph tell osd.\* injectargs '--osd_heartbeat_interval 15'
sudo ceph tell osd.\* injectargs '--osd_recovery_max_active 4'
sudo ceph tell osd.\* injectargs '--osd_recovery_op_priority 63'
Hello everyone!
I want to understand the concept and tune my rocksDB options on nautilus
14.2.16.
osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of
50 GiB) to slow device
osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
50 GiB) to slow device
Th
Hi,
Some further investigation on the failed OSDs:
1 out of 8 OSDs actually has hardware issue,
[16841006.029332] sd 0:0:10:0: [sdj] tag#96 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
[16841006.037917] sd 0:0:10:0: [sdj] tag#34 FAILED Result:
hostbyte=DID_SOFT_ERROR dri
Den mån 20 sep. 2021 kl 18:02 skrev Dave Piper :
> Okay - I've finally got full debug logs from the flapping OSDs. The raw logs
> are both 100M each - I can email them directly if necessary. (Igor I've
> already sent these your way.)
> Both flapping OSDs are reporting the same "bluefs _allocate f
35 matches
Mail list logo