Wait …  1 gigabit??  That sure isn’t doing you any favors.  Remember that RADOS 
sends replication sub-ops over that, though you mentioned a size=1 pool.  
You’ll have mon <----> OSD traffic and OSD <—> OSD heartbeats going over that 
link as well.

> On Nov 26, 2024, at 5:22 AM, Martin Gerhard Loschwitz 
> <martin.loschw...@true-west.com> wrote:
> 
> Hi Anthony,
> 
> I think problems have always been like this, albeit these setups are a bit 
> older already. We’ve specifically set the MTU to 9000 on both switches and 
> all affected machines, but MTU 1500 or MTU 9000 literally doesn’t make a 
> difference.
> 
> Network is non-LACP on one of the test clusters (the HDD cluster with the 
> worst hardware). It’s a single 1G link, but that should not be a problem for 
> an idling cluster during a normal 4k IOPS test, should it?
> 
> Best regards
> Martin
> 
>> Am 26.11.2024 um 04:48 schrieb Anthony D'Atri <anthony.da...@gmail.com>:
>> 
>> Good insights from Alex.  
>> 
>> Are these clusters all new? Or have they been around a while, previously 
>> happier?
>> 
>> One idea that comes to mind is an MTU mismatch between hosts and switches, 
>> or some manner of bonding misalignment.  What does `netstat -I` show?  
>> `ethtool -S`?  I’m thinking that maybe just maybe bonding (if present) is 
>> awry in some fashion such that half of packets in/out disappear into the 
>> twilight zone. Like if LACP appears up on the host but a switch issue dooms 
>> all packets on one link, in or out.  
>> 
>>> On Nov 25, 2024, at 9:45 PM, Alex Gorbachev <a...@iss-integration.com> 
>>> wrote:
>>> 
>>> Hi Martin,
>>> 
>>> This is a bit of generic recommendation, but I would go down the path of
>>> reducing complexity, i.e. first test the drive locally on the OSD node and
>>> see if there's anything going on with e.g. drive firmware, cables, HBA,
>>> power.
>>> 
>>> Then do fio from another host, and this would incorporate networking.
>>> 
>>> If those look fine, I would do something crazy with Ceph, such as a huge
>>> number of PGs, or failure domain of OSD, and just deploy a handful of OSDs
>>> to see if you can bring the problem out in the open.  I would use a default
>>> setup, with no tweaks to scheduler etc.  Hopefully, you'll get some error
>>> messages in the logs - ceph logs, syslog, dmesg.  Maybe at that point it
>>> will become more obvious, or at least some messages will come through that
>>> will make sense (to you or someone else on the list).
>>> 
>>> In other words, it seems you have to break this a bit more to get proper
>>> diagnostics.  I know you guys have played with Ceph before, and can do the
>>> math of what the IOPS values should be - three clusters all seeing the same
>>> problem would most likely indicate a non-default configuration value that
>>> is not correct.
>>> --
>>> Alex Gorbachev
>>> ISS
>>> 
>>> 
>>> 
>>>> On Mon, Nov 25, 2024 at 9:34 PM Martin Gerhard Loschwitz <
>>>> martin.loschw...@true-west.com> wrote:
>>>> 
>>>> Folks,
>>>> 
>>>> I am getting somewhat desperate debugging multiple setups here within the
>>>> same environment. Three clusters, two SSD-only, one HDD-only, and what they
>>>> all have in common is abysmal 4k IOPS performance when measuring with
>>>> „rados bench“. Abysmal means: In an All-SSD cluster I will get roughly 400
>>>> IOPS over more than 250 devices. I’ve know SAS-SSDs are not ideal, but 250
>>>> looks a bit on the low side of things to me.
>>>> 
>>>> In the second cluster, also All-SSD based, I get roughly 120 4k IOPS. And
>>>> the HDD-only cluster delivers 60 4k IOPS. The latter both with
>>>> substantially fewer devices, granted. But even with 20 HDDs, 68 4k IOPS
>>>> seems like a very bad value to me.
>>>> 
>>>> I’ve tried to rule out everything I know of: BIOS misconfigurations, HBA
>>>> problems, networking trouble (I am seeing comparably bad values with a
>>>> size=1 pool) and so further and so on. But to no avail. Has anybody dealt
>>>> with something similar on Dell hardware or in general? What could cause
>>>> such extremely bad benchmark results?
>>>> 
>>>> I measure with rados bench and qd=1 at 4k block size. „ceph tell osd
>>>> bench“ with 4k blocks yields 30k+ IOPS for every single device in the big
>>>> cluster, and all that leads to is 400 IOPS in total when writing to it?
>>>> Even with no replication in place? That looks a bit off, doesn't it? Any
>>>> help will be greatly appreciated, thank you very much in advance. Even a
>>>> pointer to the right direction would be held in high esteem right now.
>>>> Thank you very much in advance!
>>>> 
>>>> Best regards
>>>> Martin
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> -- 
> 
>       
> Martin Gerhard Loschwitz
> Geschäftsführer / CEO, True West IT Services GmbH
> P +49 2433 5253130 <tel:+49 2433 5253130>
> M +49 176 61832178 <https://mysig.io/4ngY23j0>
> A Schmiedegasse 24a, 41836 Hückelhoven, Deutschland
> R HRB 21985, Amtsgericht Mönchengladbach <https://mysig.io/b4g0y3rz>
> <https://mysignature.io/editor?utm_source=expiredpixel>
> True West IT Services GmbH is compliant with the GDPR regulation on data 
> protection and privacy in the European Union and the European Economic Area. 
> You can request the information on how we collect and process your private 
> data according to the law by contacting the email sender.
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to