On 10/6/2020 4:43 PM, Stefan Kooman wrote:
On 2020-10-06 15:27, Igor Fedotov wrote:
I'm working on improving PG removal in master, see:
https://github.com/ceph/ceph/pull/37496
Hopefully this will help in case of "cleanup after rebalancing" issue
which you presumably had.
That would be great.
On 2020-10-06 15:27, Igor Fedotov wrote:
> I'm working on improving PG removal in master, see:
> https://github.com/ceph/ceph/pull/37496
>
> Hopefully this will help in case of "cleanup after rebalancing" issue
> which you presumably had.
That would be great. Does the offline compaction with the
On 2020-10-06 14:18, Kristof Coucke wrote:
> Ok, I did the compact on 1 osd.
> The utilization is back to normal, so that's good... Thumbs up to you guys!
We learned the hard way, but happy to spot the issue and share the info.
> Though, one thing I want to get out of the way before adapting the
On 2020-10-06 13:05, Igor Fedotov wrote:
>
> On 10/6/2020 1:04 PM, Kristof Coucke wrote:
>> Another strange thing is going on:
>>
>> No client software is using the system any longer, so we would expect
>> that all IOs are related to the recovery (fixing of the degraded PG).
>> However, the disks
I'm working on improving PG removal in master, see:
https://github.com/ceph/ceph/pull/37496
Hopefully this will help in case of "cleanup after rebalancing" issue
which you presumably had.
On 10/6/2020 4:24 PM, Kristof Coucke wrote:
Hi Igor and Stefan,
Everything seems okay, so we'll now cr
Hi Igor and Stefan,
Everything seems okay, so we'll now create a script to automate this on all
the nodes and we will also review the monitoring possibilities.
Thanks for your help, it was a time saver.
Does anyone know if this issue is better handled in the newer versions or
if this is planned i
I've seen similar reports after manual compactions as well. But it looks
like a presentation bug in RocksDB to me.
You can check if all the data is spilled over (as it ought to be for L4)
in bluefs section of OSD perf counters dump...
On 10/6/2020 3:18 PM, Kristof Coucke wrote:
Ok, I did th
We have similar with this issue last week. We have sluggish disk (10TB
SAS in RAID 0 mode) in half of node which affect performance of cluster.
These disk has high CPU usage and very high latency. Turns out there is
a process *patrol read* from RAID card that running automatically every
week. W
Ok, I did the compact on 1 osd.
The utilization is back to normal, so that's good... Thumbs up to you guys!
Though, one thing I want to get out of the way before adapting the other
OSDs:
When I now get the RocksDb stats, my L1, L2 and L3 are gone:
db_statistics {
"rocksdb_compaction_statistics
Den tis 6 okt. 2020 kl 11:13 skrev Kristof Coucke :
> I'm now wondering what my options are to improve the performance... The
> main goal is to use the system again, and make sure write operations are
> not affected.
> - Putting weight on 0 for the slow OSDs (temporary)? This way they recovery
> c
Unfortunately currently available Ceph releases lack any means to
monitor KV data removal. The only way is to set debug_bluestore to 20
(for a short period of time, e.g. 1 min) and inspect OSD log for
_remove/_do_remove/_omap_clear calls. Plenty of them within the
inspected period means ongoing
Is there a way that I can check if this process is causing performance
issues?
Can I check somehow if this process is causing the issue?
Op di 6 okt. 2020 om 13:05 schreef Igor Fedotov :
>
> On 10/6/2020 1:04 PM, Kristof Coucke wrote:
>
> Another strange thing is going on:
>
> No client software
On 10/6/2020 1:04 PM, Kristof Coucke wrote:
Another strange thing is going on:
No client software is using the system any longer, so we would expect
that all IOs are related to the recovery (fixing of the degraded PG).
However, the disks that are reaching high IO are not a member of the
PGs t
I presume that this might be caused by massive KV data removal which was
initiated after(or during) data rebalance. We've seen multiple complains
about RocksDB's performance negatively affected by pool/pg removal. And
I expect data rebalance might suffer from the same...
You might want to run
Another strange thing is going on:
No client software is using the system any longer, so we would expect that
all IOs are related to the recovery (fixing of the degraded PG).
However, the disks that are reaching high IO are not a member of the PGs
that are being fixed.
So, something is heavily us
Yes, some disks are spiking near 100%... The delay I see with the iostat
(r_await) seems to be synchronised with the delays between queued_for_pg
and reached_pg events.
The NVMe disks are not spiking, just the spinner disks.
I know the rocksdb is only partial on the NVMe. The read-ahead is also
12
Hi Kristof,
are you seeing high (around 100%) OSDs' disks (main or DB ones)
utilization along with slow ops?
Thanks,
Igor
On 10/6/2020 11:09 AM, Kristof Coucke wrote:
Hi all,
We have a Ceph cluster which has been expanded from 10 to 16 nodes.
Each node has between 14 and 16 OSDs of which
Thanks to @Anthony:
Diving further I see that I probably was blinded by the CPU load...
I see that some disks are very slow (so my first observations were
incorrect), and the latency seen using iostat seems more or less the same
as what we see in the dump_historic_ops. (+ 3s for r_await)
So, it l
Hi Anthony,
Thnx for the reply
Average values:
User: 3.5
Idle: 78.4
Wait: 20
System: 1.2
/K.
Op di 6 okt. 2020 om 10:18 schreef Anthony D'Atri :
>
>
> >
> > Diving onto the nodes we could see that the OSD daemons are consuming the
> > CPU power, resulting in average CPU loads going near 10 (!)
19 matches
Mail list logo