On Wed, 27 May 2015 15:38:26 +0200 Xavier Serrano wrote: > Hello, > > On Wed May 27 21:20:49 2015, Christian Balzer wrote: > > > > > Hello, > > > > On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote: > > > > > Hello, > > > > > > Slow requests, blocked requests and blocked ops occur quite often > > > in our cluster; too often, I'd say: several times during one day. > > > I must say we are running some tests, but we are far from pushing > > > the cluster to the limit (or at least, that's what I believe). > > > > > > Every time a blocked request/operation happened, restarting the > > > affected OSD solved the problem. > > > > > You should open a bug with that description and a way to reproduce > > things, even if only sometimes. > > Having slow disks instead of an overloaded network causing permanently > > blocked requests definitely shouldn't happen. > > > I totally agree. I'll try to reproduce and definitely open a bug. > I'll let you know. > > > > > Yesterday, we wanted to see if it was possible to minimize the impact > > > that backfills and recovery have over normal cluster performace. > > > In our case, performance dropped from 1000 cluster IOPS (approx) > > > to 10 IOPS (approx) when doing some kind of recovery. > > > > > > Thus, we reduced the parameters "osd max backfills" and "osd recovery > > > max active" to 1 (defaults are 10 and 15, respectively). Cluster > > > performance during recovery improved to 500-600 IOPS (approx), > > > and overall recovery time stayed approximately the same > > > (surprisingly). > > > > > There are some "sleep" values for recovery and scrub as well, these > > help a LOT with loaded clusters, too. > > > > > Since then, we have had no more slow/blocked requests/ops > > > (and our tests are still running). It is soon to say this, but > > > my guess is that osds/disks in our cluster cannot cope with > > > all I/O: network bandwidth is not an issue (10 GbE interconnection, > > > graphs show network usage is under control all the time), but > > > spindles are not high-performance (WD Green). Eventually, this might > > > lead to slow/blocked requests/ops (which shouldn't occur that often). > > > > > Ah yes, I was going to comment on your HDDs earlier. > > As Dan van der Ster at CERN will happily admit, using green, slow HDDs > > with Ceph (and no SSD journals) is a bad idea. > > > > You're likely to see a VAST improvement with even just 1 journal SSD > > (of suficient speed and durability) for 10 of your HDDs, a 1:5 ratio > > would of course be better. > > We do have SSDs, but we are not using them right now. > We have 4 SSD per osd host (24 SSD at the moment). > SSD model is Intel DC S3700 (400 GB). > That's a nice one. ^^
> We are testing different scenarios before making our final decision > (cache-tiering, journaling, separate pool,...). > Definitely a good idea to test things out and get an idea what Ceph and your hardware can do. From my experience and reading this ML however I think your best bet (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your 20 OSDs HDDs. Currently cache-tiering is probably the worst use for those SSD resources, though the code and strategy is of course improving. Dedicated SSD pools may be a good fit depending on your use case. However I'd advise against mixing SSD and HDD OSDs on the same node. To fully utilize those SSDs you'll need a LOT more CPU power than required by HDD OSDs or SSD journals/HDD OSDs systems. And you already have 20 OSDs in that box. What CPUs do you have in those storage nodes anyway? If you have the budget, I'd deploy the current storage nodes in classic (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD nodes, optimized for their task (more CPU power, faster network). Then use those SSD nodes to experiment with cache-tiers and pure SSD pools and switch over things when you're comfortable with this and happy with the performance. > > > However with 20 OSDs per node, you're likely to go from a being > > bottlenecked by your HDDs to being CPU limited (when dealing with lots > > of small IOPS at least). > > Still, better than now for sure. > > > This is very interesting, thanks for pointing it out! > What would you suggest to use in order to identify the actual > bottleneck? (disk, CPU, RAM, etc.). Tools like munin? > Munin might work, I use collectd to gather all those values (and even more importantly all Ceph counters) and graphite to visualize it. For ad-hoc, on the spot analysis I really like atop (in a huge window), which will make it very clear what is going on. > In addition, there are some kernel tunables that may be helpful > to improve overall performance. Maybe we are filling some kernel > internals and that limits our results (for instance, we had to increase > fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per > host). Which tunables should we observe? > I'm no expert for large (not even medium) clusters, so you'll have to research the archives and net (the CERN Ceph slide is nice). One thing I remember is "kernel.pid_max", which is something you're likely to run into at some point with your dense storage nodes: http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations Christian > Thank you very much again for your time. > > Best regards, > - Xavier Serrano > - LCAC, Laboratori de Càlcul > - Departament d'Arquitectura de Computadors, UPC > > > > BTW, if your monitors are just used for that function, 128GB is total > > and utter overkill. > > They will be fine with 16-32GB, your storage nodes will be much better > > served (pagecache for hot read objects) with more RAM. > > And with 20 OSDs per node 32GB is pretty close to the minimum I'd > > recommend anyway. > > > > > > > Reducing I/O pressure caused by recovery and backfill undoubtedly > > > helped on improving cluster performance during recovery, that was > > > expected. But we did not expect that recovery time stayed the same... > > > The only explanation for this is that, during recovery, there are > > > lots of operations that fail due a timeout, are retried several > > > times, etc. > > > > > > So if disks are the bottleneck, reducing such values may help as > > > well in normal cluster operation (when propagating the replicas, > > > for instance). And slow/blocked requests/ops do not occur (or at > > > least, occur less frequently). > > > > > > Does this make sense to you? Any other thoughts? > > > > > Very much so, see above for more thoughts. > > > > Christian > > > > -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com