Wait … 1 gigabit?? That sure isn’t doing you any favors. Remember that RADOS sends replication sub-ops over that, though you mentioned a size=1 pool. You’ll have mon <----> OSD traffic and OSD <—> OSD heartbeats going over that link as well.
> On Nov 26, 2024, at 5:22 AM, Martin Gerhard Loschwitz > <martin.loschw...@true-west.com> wrote: > > Hi Anthony, > > I think problems have always been like this, albeit these setups are a bit > older already. We’ve specifically set the MTU to 9000 on both switches and > all affected machines, but MTU 1500 or MTU 9000 literally doesn’t make a > difference. > > Network is non-LACP on one of the test clusters (the HDD cluster with the > worst hardware). It’s a single 1G link, but that should not be a problem for > an idling cluster during a normal 4k IOPS test, should it? > > Best regards > Martin > >> Am 26.11.2024 um 04:48 schrieb Anthony D'Atri <anthony.da...@gmail.com>: >> >> Good insights from Alex. >> >> Are these clusters all new? Or have they been around a while, previously >> happier? >> >> One idea that comes to mind is an MTU mismatch between hosts and switches, >> or some manner of bonding misalignment. What does `netstat -I` show? >> `ethtool -S`? I’m thinking that maybe just maybe bonding (if present) is >> awry in some fashion such that half of packets in/out disappear into the >> twilight zone. Like if LACP appears up on the host but a switch issue dooms >> all packets on one link, in or out. >> >>> On Nov 25, 2024, at 9:45 PM, Alex Gorbachev <a...@iss-integration.com> >>> wrote: >>> >>> Hi Martin, >>> >>> This is a bit of generic recommendation, but I would go down the path of >>> reducing complexity, i.e. first test the drive locally on the OSD node and >>> see if there's anything going on with e.g. drive firmware, cables, HBA, >>> power. >>> >>> Then do fio from another host, and this would incorporate networking. >>> >>> If those look fine, I would do something crazy with Ceph, such as a huge >>> number of PGs, or failure domain of OSD, and just deploy a handful of OSDs >>> to see if you can bring the problem out in the open. I would use a default >>> setup, with no tweaks to scheduler etc. Hopefully, you'll get some error >>> messages in the logs - ceph logs, syslog, dmesg. Maybe at that point it >>> will become more obvious, or at least some messages will come through that >>> will make sense (to you or someone else on the list). >>> >>> In other words, it seems you have to break this a bit more to get proper >>> diagnostics. I know you guys have played with Ceph before, and can do the >>> math of what the IOPS values should be - three clusters all seeing the same >>> problem would most likely indicate a non-default configuration value that >>> is not correct. >>> -- >>> Alex Gorbachev >>> ISS >>> >>> >>> >>>> On Mon, Nov 25, 2024 at 9:34 PM Martin Gerhard Loschwitz < >>>> martin.loschw...@true-west.com> wrote: >>>> >>>> Folks, >>>> >>>> I am getting somewhat desperate debugging multiple setups here within the >>>> same environment. Three clusters, two SSD-only, one HDD-only, and what they >>>> all have in common is abysmal 4k IOPS performance when measuring with >>>> „rados bench“. Abysmal means: In an All-SSD cluster I will get roughly 400 >>>> IOPS over more than 250 devices. I’ve know SAS-SSDs are not ideal, but 250 >>>> looks a bit on the low side of things to me. >>>> >>>> In the second cluster, also All-SSD based, I get roughly 120 4k IOPS. And >>>> the HDD-only cluster delivers 60 4k IOPS. The latter both with >>>> substantially fewer devices, granted. But even with 20 HDDs, 68 4k IOPS >>>> seems like a very bad value to me. >>>> >>>> I’ve tried to rule out everything I know of: BIOS misconfigurations, HBA >>>> problems, networking trouble (I am seeing comparably bad values with a >>>> size=1 pool) and so further and so on. But to no avail. Has anybody dealt >>>> with something similar on Dell hardware or in general? What could cause >>>> such extremely bad benchmark results? >>>> >>>> I measure with rados bench and qd=1 at 4k block size. „ceph tell osd >>>> bench“ with 4k blocks yields 30k+ IOPS for every single device in the big >>>> cluster, and all that leads to is 400 IOPS in total when writing to it? >>>> Even with no replication in place? That looks a bit off, doesn't it? Any >>>> help will be greatly appreciated, thank you very much in advance. Even a >>>> pointer to the right direction would be held in high esteem right now. >>>> Thank you very much in advance! >>>> >>>> Best regards >>>> Martin >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io > > -- > > > Martin Gerhard Loschwitz > Geschäftsführer / CEO, True West IT Services GmbH > P +49 2433 5253130 <tel:+49 2433 5253130> > M +49 176 61832178 <https://mysig.io/4ngY23j0> > A Schmiedegasse 24a, 41836 Hückelhoven, Deutschland > R HRB 21985, Amtsgericht Mönchengladbach <https://mysig.io/b4g0y3rz> > <https://mysignature.io/editor?utm_source=expiredpixel> > True West IT Services GmbH is compliant with the GDPR regulation on data > protection and privacy in the European Union and the European Economic Area. > You can request the information on how we collect and process your private > data according to the law by contacting the email sender. > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io