Hello,

On Thu, 28 May 2015 12:05:03 +0200 Xavier Serrano wrote:

> On Thu May 28 11:22:52 2015, Christian Balzer wrote:
> 
> > > We are testing different scenarios before making our final decision
> > > (cache-tiering, journaling, separate pool,...).
> > >
> > Definitely a good idea to test things out and get an idea what Ceph and
> > your hardware can do.
> > 
> > From my experience and reading this ML however I think your best bet
> > (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for
> > your 20 OSDs HDDs.
> > 
> > Currently cache-tiering is probably the worst use for those SSD
> > resources, though the code and strategy is of course improving.
> > 
> I agree: in our particular enviroment, our tests also conclude that
> SSD journaling performs far better than cache-tiering, especially when
> cache becomes close to its capacity and data movement between cache
> and backing storage occurs frequently.
>
Precisely.
 
> We also want to test if it is possible to use SSD disks as a
> "transparent" cache for the HDDs at system (Linux kernel) level, and how
> reliable/good is it.
> 
There are quite a number of threads about this here, some quite
recent/current. 
They range from "not worth it" (i.e. about the same performance as journal
SSDs) to "xyz-cache destroyed my data, ate my babies and set the house on
fire" (i.e. massive reliability problems).

Which is a pity, as in theory they look like a nice fit/addition to Ceph.

> > Dedicated SSD pools may be a good fit depending on your use case.
> > However I'd advise against mixing SSD and HDD OSDs on the same node.
> > To fully utilize those SSDs you'll need a LOT more CPU power than
> > required by HDD OSDs or SSD journals/HDD OSDs systems. 
> > And you already have 20 OSDs in that box.
> 
> Good point! We did not consider that, thanks for pointing it out.
> 
> > What CPUs do you have in those storage nodes anyway?
> > 
> Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo.
> We have only 1 CPU per osd node, so I'm afraid we have another
> potential bottleneck here.
> 
Oh dear, about 10GHz (that CPU is supposedly 2.4, but you may see the
2.5 because it already is in turbo mode) for 20 OSDs.
Where the recommendation for HDD only OSDs is 1GHz.

Fire up atop (large window so you can see all the details and devices) on
one of your storage nodes.

Then from a client (VM) run this:
---
fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=randwrite --name=fiojob --blocksize=4M --iodepth=32
---
This should result in your disks (OSDs) getting busy to the point of 100%
utilization, but your CPU to still have some idle (that's idle AND wait
combined).

If you change the blocksize to 4K (and just ctrl-c fio after 30 or so
seconds) you should see a very different picture, with the CPU being much
busier and the HDDs seeing less than 100% usage.

That will become even more pronounced with faster HDDs and/or journal SSDs.

And pure SSD clusters/pools are way above that in terms of CPU hunger.

> > If you have the budget, I'd deploy the current storage nodes in classic
> > (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure
> > SSD nodes, optimized for their task (more CPU power, faster network).
> > 
> > Then use those SSD nodes to experiment with cache-tiers and pure SSD
> > pools and switch over things when you're comfortable with this and
> > happy with the performance. 
> >  
> > > 
> > > > However with 20 OSDs per node, you're likely to go from a being
> > > > bottlenecked by your HDDs to being CPU limited (when dealing with
> > > > lots of small IOPS at least).
> > > > Still, better than now for sure.
> > > > 
> > > This is very interesting, thanks for pointing it out!
> > > What would you suggest to use in order to identify the actual
> > > bottleneck? (disk, CPU, RAM, etc.). Tools like munin?
> > > 
> > Munin might work, I use collectd to gather all those values (and even
> > more importantly all Ceph counters) and graphite to visualize it.
> > For ad-hoc, on the spot analysis I really like atop (in a huge window),
> > which will make it very clear what is going on.
> > 
> > > In addition, there are some kernel tunables that may be helpful
> > > to improve overall performance. Maybe we are filling some kernel
> > > internals and that limits our results (for instance, we had to
> > > increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20
> > > disks per host). Which tunables should we observe?
> > > 
> > I'm no expert for large (not even medium) clusters, so you'll have to
> > research the archives and net (the CERN Ceph slide is nice).
> > One thing I remember is "kernel.pid_max", which is something you're
> > likely to run into at some point with your dense storage nodes:
> > http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations
> > 
> > Christian
> 
> All you say is really interesting. Thanks for your valuable advice.
> We surely still have plenty of things to learn and test before going
> to production.
> 
As long as you have the time to test out things, you'll be fine. ^_^

Christian

> Thanks again for your time and help.
> 
> Best regards,
> - Xavier Serrano
> - LCAC, Laboratori de Càlcul
> - Departament d'Arquitectura de Computadors, UPC
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to