Hi,
This seems not very relevant since all ceph components are running in
containers. Any ideas to get over this issue? Any other ideas or
suggestions on this kind of deployment?
sudo ./cephadm --image 10.21.22.1:5000/ceph:v17.2.5-20230316 --docker
bootstrap --mon-ip 10.21.22.1 --skip-monitoring-s
On April 25, 2023 9:03 pm, Peter wrote:
> Dear all,
>
> We are experiencing with Ceph after deploying it by PVE with the network
> backed by a 10G Cisco switch with VPC feature on. We are encountering a slow
> OSD heartbeat and have not been able to identify any network traffic issues.
>
> Upon
Hi, can you share the exact command how you blocked the watcher? To
get the lock list run:
rbd lock list /
There is 1 exclusive lock on this image.
Locker IDAddress
client.1211875 auto 139643345791728 192.168.3.12:0/2259335316
To blacklist the client run:
ceph o
>
> I observed that on an otherwise idle cluster, scrubbing cannot fully
> utilise the speed of my HDDs.
Maybe the configured limit is set like this, because of that once (a part of)
the scrubbing process is started it is not possible/easy to automatically scale
down the performance to benefit
Hi Peter,
2% packet loss is a lot, specifically on such expensive hardware. We observed
the problems you describe with defective networking hardware with NIC/switch
ports in active-active LACP bonding mode. We had periodically failing
transceivers and these fails are not immediately detected by
We know very little about the whole cluster, can you add the usual
information like 'ceph -s' and 'ceph osd df tree'? Scrubbing has
nothing to do with the undersized PGs. Is the balancer and/or
autoscaler on? Please also add 'ceph balancer status' and 'ceph osd
pool autoscale-status'.
Than
Hi,
I don't think increasing the mon_osd_down_out_interval timeout alone
will really help you in this situation, I remember an older thread
about that but couldn't find it. What you could test is setting the
nodown flag (ceph osd set nodown) to prevent flapping OSDs, but that's
not a real
Hi all,
Over the last 2 weeks we have experienced several OSD_TOO_MANY_REPAIRS errors
that we struggle to handle in a non-intrusive manner. Restarting MDS +
hypervisor that accessed the object in question seems to be the only way we can
clear the error so we can repair the PG and recover access
On 26.04.23 13:24, Thomas Hukkelberg wrote:
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
osd.34 had 9936 reads repaired
Are there any messages in the kernel log that indicate this device has
read errors? Have you considered replacing the disk?
Regards
--
Robert Sander
Hello Thomas,
I would strongly recommend you to read the messages on the mailing list
regarding ceph version 16.2.11,16.2.12 and 16.2.13.
Joachim
___
ceph ambassador DACH
ceph consultant since 2012
Clyso GmbH - Premier Ceph Foundation Member
https://www.clyso.
Hi!
There are no kernel log messages that indicate read errors on the disk, and the
error is not tied to one specific OSD. The errors so far have been on 7
different OSDs and when we restart the OSD with errors, the errors appears on
one of the other OSDs in the same PG; as you can see when res
"bucket does not exist" or "permission denied".
Had received similar error messages with another client program. The default
region did not match the region of the cluster.
___
ceph ambassador DACH
ceph consultant since 2012
Clyso GmbH - Premier Ceph Foundation M
Hello all,
today I moved ceph to HEALTH_OK state :-)
1) I had to restart MGR node, then my old c-osdx hostnames goes
definitely away and all of OSDs from old machines are now
orchestrated by 'ceph orch' command.
2) I've updated ceph* packages on the osd2 node to version
17.2.6, then I tried 'cep
Hi,
can you paste the following output:
ceph orch ls osd --export
Maybe you have the "all-available-devices" service set to managed? You
can disable that with [1]:
ceph orch apply osd --all-available-devices --unmanaged=true
Please also add your osd yaml configuration, you can test that wi
I would second Joachim's suggestion - this is exactly what we're in the
process of doing for a client, i.e migrating from Luminous to Quincy.
However below would also work if you're moving to Nautilus.
The only catch with this plan would be if you plan to reuse any hardware -
i.e the hosts running
Good morning, i found a bug on ceph reef
After installing ceph and deploying 9 osds with a cephfs layer. I got
this error after many writing and reading operations on the ceph fs i
deployed.
```{
"assert_condition": "pg_upmap_primaries.empty()",
"assert_file":
"/home/jenkins-build/build
# ceph windows tests
PR check will be made required once regressions are fixed
windows build currently depends on gcc11 which limits use of c++20
features. investigating newer gcc or clang toolchain
# 16.2.13 release
final testing in progress
# prometheus metric regressions
https://tracker.ceph.c
Looks like you've somehow managed to enable the upmap balancer while
allowing a client that's too told to understand it to mount.
Radek, this is a commit from yesterday; is it a known issue?
On Wed, Apr 26, 2023 at 7:49 AM Nguetchouang Ngongang Kevin
wrote:
>
> Good morning, i found a bug on cep
are there any volunteers willing to help make these python packages
available upstream?
On Tue, Mar 28, 2023 at 5:34 AM Ernesto Puerta wrote:
>
> Hey Ken,
>
> This change doesn't not involve any further internet access other than the
> already required for the "make dist" stage (e.g.: npm packag
Hi Ben,
Are you compacting the relevant osds periodically? ceph tell osd.x
compact (for the three osds holding the bilog) would help reshape the
rocksdb levels to least perform better for a little while until the
next round of bilog trims.
Otherwise, I have experience deleting ~50M object indices
Thanks Tom, this is a very useful post!
I've added our docs guy Zac in cc: IMHO this would be useful in a
"Tips & Tricks" section of the docs.
-- dan
__
Clyso GmbH | https://www.clyso.com
On Wed, Apr 26, 2023 at 7:46 AM Thomas Bennett wrote:
>
> I would second Joa
Hi,
Simplest solution would be to add a few OSDs.
-- dan
__
Clyso GmbH | https://www.clyso.com
On Tue, Apr 25, 2023 at 2:58 PM WeiGuo Ren wrote:
>
> I have two osds. these osd are used to rgw index pool. After a lot of
> stress tests, these two osds were written t
Hi,
Your cluster probably has dns-style buckets enabled. ..
In that case the path does not include the bucket name, and neither
does the rgw log.
Do you have a frontend lb like haproxy? You'll find the bucket names there.
-- Dan
__
Clyso GmbH | https://www.clyso.com
Hi,
I have a Ceph 16.2.12 cluster with uniform hardware, same drive make/model,
etc. A particular OSD is showing higher latency than usual in `ceph osd
perf`, usually mid to high tens of milliseconds while other OSDs show low
single digits, although its drive's I/O stats don't look different from
Hi,
I have a Ceph 16.2.12 cluster with hybrid OSDs (HDD block storage, DB/WAL
on NVME). All OSD settings are default except, cache-related settings are
as follows:
osd.14dev bluestore_cache_autotune true
osd.14dev bluestore_cache_size_hdd
4294967
Hi Marc,
thanks for your reply.
100MB/s is sequential, your scrubbing is random. afaik everything is random.
Is there any docs that explain this, any code, or other definitive answer?
Also wouldn't it make sense that for scrubbing to be able to read the disk
linearly, at least to some signi
Hi Niklas,
>
> > 100MB/s is sequential, your scrubbing is random. afaik everything is
> random.
>
> Is there any docs that explain this, any code, or other definitive
> answer?
do a fio[1] test on a disk to see how it performs under certain conditions. Or
look at atop during scrubbing, it will
Hi Richard! Thanks a lot for your answer!
Indeed I’m s dumb…
I just performed a K/M migration on another service that involved RBD last
week and so just completely eluded the fact that you, of course, can change
the crush_rule of a pool without having to copy anything!!
OMFG… I can be so sil
The question you should ask yourself, why you want to change/investigate this?
Because if scrubbing takes 10x longer thrashing seeks, my scrubs never finish
in time (the default is 1 week).
I end with e.g.
267 pgs not deep-scrubbed in time
On a 38 TB cluster, if you scrub 8 MB/s on 10 disks
As suggested by someone, I tried `dump_historic_slow_ops`. There aren't
many, and they're somewhat difficult to interpret:
"description": "osd_op(client.250533532.0:56821 13.16f
13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write
3518464~8192] snapc 0=[] ondisk+writ
Den ons 26 apr. 2023 kl 21:20 skrev Niklas Hambüchen :
> > 100MB/s is sequential, your scrubbing is random. afaik everything is random.
>
> Is there any docs that explain this, any code, or other definitive answer?
> Also wouldn't it make sense that for scrubbing to be able to read the disk
> line
31 matches
Mail list logo