Hi,
We're running latest Pacific on our production cluster and we've been
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out
after 15.00954s' error. We have reasons to believe this happens each
time the RocksDB compaction process is launched on an OSD. My question
is,
Hi, I did something like that in the past. If you have a sufficient amount of
cold data in general and you can bring the OSDs back with their original IDs,
recovery was significantly faster than rebalancing. It really depends how
trivial the version update per object is. In my case it could re-u
Dear Ceph users,
I just upgraded my cluster to Reef, and with the new version came also a
revamped dashboard. Unfortunately the new dashboard is really awful to me:
1) it's no longer possible to see the status of the PGs: in the old
dashboard it was very easy to see e.g. how many PGs were rec
Hi,
> On 7 Sep 2023, at 10:05, J-P Methot wrote:
>
> We're running latest Pacific on our production cluster and we've been seeing
> the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after
> 15.00954s' error. We have reasons to believe this happens each time the
> RocksDB co
On Thu, 7 Sept 2023 at 18:05, Nicola Mori wrote:
> Is it just me or maybe my impressions are shared by someone else? Is
> there anything that can be done to improve the situation?
>
I wonder about the implementation choice for this dashboard. I find with
our Reef cluster it seems to get stuck du
My cluster has 104 OSDs, so I don't think this can be a factor for the
malfunctioning.
smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
We're talking about automatic online compaction here, not running the
command.
On 9/7/23 04:04, Konstantin Shalygin wrote:
Hi,
On 7 Sep 2023, at 10:05, J-P Methot wrote:
We're running latest Pacific on our production cluster and we've been
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346a
Hi,
we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on
120T/200T data.
Is there any tracker about the problem?
Does upgrade to 17.x "solves" the problem?
Kind regards,
Rok
On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta wrote:
> Dear Cephers,
>
> Today brought us an even
Hi,
On 21/08/2023 17:16, Josh Durgin wrote:
We weren't targeting bullseye once we discovered the compiler version
problem, the focus shifted to bookworm. If anyone would like to help
maintaining debian builds, or looking into these issues, it would be
welcome:
https://bugs.debian.org/cgi-bin
On an HDD-based Quincy 17.2.5 cluster (with DB/WAL on datacenter-class
NVMe with enhanced power loss protection), I sometimes (once or twice
per week) see log entries similar to what I reproduced below (a bit
trimmed):
Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22
Thanks all for the advice, very helpful!
The node also had a mon, which happily slotted right back into the cluster. The
node's been up and running for a number of days now, but the systemd OSD
processes don't seem to be trying continously, they're never progressing or
getting a newer map.
As
Your description seems to match my observations trying to create
cephfs snapshots via dashboard. In latest Octopus it works, in Pacific
16.2.13 and Quincy 17.2.6 it doesn't, in Reef 18.2.0 it works again.
Zitat von MARTEL Arnaud :
Hi Eugen,
We have a lot of shared directories in cephfs
Hello,
What is the best strategy regarding failure domain and rack awareness when
there are only 2 physical racks and we need 3 replicas of data?
In this scenario what is your point of view if we create 4 artificial racks
at least to be able to manage deliberate node maintenance in a more
efficie
On 07-09-2023 09:05, J-P Methot wrote:
Hi,
We're running latest Pacific on our production cluster and we've been
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out
after 15.00954s' error. We have reasons to believe this happens each
time the RocksDB compaction process
Hi,
Since my post, we've been speaking with a member of the Ceph dev team.
He did, at first, believe it was an issue linked to the common
performance degradation after huge deletes operation. So we did do
offline compactions on all our OSDs. It fixed nothing and we are going
through the logs
Hello all,
We've had a Nautilus [latest releases] cluster for some years now, and are
planning the upgrade process - both moving off Centos7 [ideally to a RHEL9
compatible spin like Alma 9 or Rocky 9] and also moving to a newer Ceph release
[ideally Pacific or higher to avoid too many later upg
Hi,
> On 7 Sep 2023, at 18:21, J-P Methot wrote:
>
> Since my post, we've been speaking with a member of the Ceph dev team. He
> did, at first, believe it was an issue linked to the common performance
> degradation after huge deletes operation. So we did do offline compactions on
> all our OS
Hello,
There are two things that might help you here. One is to try the new
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we
backported to Pacific in 16.2.13. So far this appears to be a huge win
for avoiding tombstone accumulation during iteration which is often the
We went from 16.2.13 to 16.2.14
Also, timeout is 15 seconds because it's the default in Ceph. Basically,
15 seconds before Ceph shows a warning that OSD is timing out.
We may have found the solution, but it would be, in fact, related to
bluestore_allocator and not the compaction process. I'll
Hi,
By this point, we're 95% sure that, contrary to our previous beliefs,
it's an issue with changes to the bluestore_allocator and not the
compaction process. That said, I will keep this email in mind as we will
want to test optimizations to compaction on our test environment.
On 9/7/23 12:
Ok, good to know. Please feel free to update us here with what you are
seeing in the allocator. It might also be worth opening a tracker
ticket as well. I did some work in the AVL allocator a while back where
we were repeating the linear search from the same offset every
allocation, getting
To be quite honest, I will not pretend I have a low level understanding
of what was going on. There is very little documentation as to what the
bluestore allocator actually does and we had to rely on Igor's help to
find the solution, so my understanding of the situation is limited. What
I under
Oh that's very good to know. I'm sure Igor will respond here, but do
you know which PR this was related to? (possibly
https://github.com/ceph/ceph/pull/50321)
If we think there's a regression here we should get it into the tracker
ASAP.
Mark
On 9/7/23 13:45, J-P Methot wrote:
To be quite
Hi Rok,
We're still try to catch what's causing the memory growth, so it's hard
to guess at which releases are affected. We know it's happening
intermittently on a live Pacific cluster at least. If you have the
ability to catch it while it's happening, there are several
approaches/tools tha
I also see the dreaded. i find this is bcache problem .you can use blktrace tools capture iodatas analysis
发自我的小米在 Stefan Kooman ,2023年9月7日 下午10:52写道:On 07-09-2023 09:05, J-P Methot wrote:
> Hi,
>
> We're running latest Pacific on our production cluster and we've been
> seeing the dreaded 'O
On 07-09-2023 19:20, J-P Methot wrote:
We went from 16.2.13 to 16.2.14
Also, timeout is 15 seconds because it's the default in Ceph. Basically,
15 seconds before Ceph shows a warning that OSD is timing out.
We may have found the solution, but it would be, in fact, related to
bluestore_alloca
This cluster use the default settings or something for Bluestore was changed?
You can check this via `ceph config diff`
As Mark said, it will be nice to have a tracker, if this really release problem
Thanks,
k
Sent from my iPhone
> On 7 Sep 2023, at 20:22, J-P Methot wrote:
>
> We went from
27 matches
Mail list logo