Christian, you mention single socket systems for storage servers.
I often thought that the Xeon-D would be ideal as a building block for
storage servers
https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html
Low power, and a complete System-On-Chip with 10gig Ethernet.

I haven't been following these processors lately. Is anyone building  CEPH
clusters using them

On 2 April 2018 at 02:59, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> firstly, Jack pretty much correctly correlated my issues to Mark's points,
> more below.
>
> On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:
>
> > On 03/29/2018 08:59 PM, Christian Balzer wrote:
> >
> > > Hello,
> > >
> > > my crappy test cluster was rendered inoperational by an IP renumbering
> > > that wasn't planned and forced on me during a DC move, so I decided to
> > > start from scratch and explore the fascinating world of
> Luminous/bluestore
> > > and all the assorted bugs. ^_-
> > > (yes I could have recovered the cluster by setting up a local VLAN with
> > > the old IPs, extract the monmap, etc, but I consider the need for a
> > > running monitor a flaw, since all the relevant data was present in the
> > > leveldb).
> > >
> > > Anyways, while I've read about bluestore OSD cache in passing here, the
> > > back of my brain was clearly still hoping that it would use
> pagecache/SLAB
> > > like other filesystems.
> > > Which after my first round of playing with things clearly isn't the
> case.
> > >
> > > This strikes me as a design flaw and regression because:
> >
> > Bluestore's cache is not broken by design.
> >
>
> During further tests I verified something that caught my attention out of
> the corner of my when glancing at atop output of the OSDs during my fio
> runs.
>
> Consider this fio run, after having done the same with write to populate
> the file and caches (1GB per OSD default on the test cluster, 20 OSDs
> total on 5 nodes):
> ---
> $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> --rw=randread --name=fiojob --blocksize=4M --iodepth=32
> ---
>
> This is being run against a kernel mounted RBD image.
> On the Luminous test cluster it will read the data from the disks,
> completely ignoring the pagecache on the host (as expected and desired)
> AND the bluestore cache.
>
> On a Jewel based test cluster with filestore the reads will be served from
> the pagecaches of the OSD nodes, not only massively improving speed but
> more importantly spindle contention.
>
> My guess is that bluestore treats "direct" differently than the kernel
> accessing a filestore based OSD and I'm not sure what the "correct"
> behavior here is.
> But somebody migrating to bluestore with such a use case and plenty of RAM
> on their OSD nodes is likely to notice this and not going to be happy about
> it.
>
>
> > I'm not totally convinced that some of the trade-offs we've made with
> > bluestore's cache implementation are optimal, but I think you should
> > consider cooling your rhetoric down.
> >
> > > 1. Completely new users may think that bluestore defaults are fine and
> > > waste all that RAM in their machines.
> >
> > What does "wasting" RAM mean in the context of a node running ceph? Are
> > you upset that other applications can't come in and evict bluestore
> > onode, OMAP, or object data from cache?
> >
> What Jack pointed out, unless you go around and start tuning things,
> all available free RAM won't be used for caching.
>
> This raises another point, it being per process data and from skimming
> over some bluestore threads here, if you go and raise the cache to use
> most RAM during normal ops you're likely to be visited by the evil OOM
> witch during heavy recovery OPS.
>
> Whereas the good ole pagecache would just get evicted in that scenario.
>
> > > 2. Having a per OSD cache is inefficient compared to a common cache
> like
> > > pagecache, since an OSD that is busier than others would benefit from a
> > > shared cache more.
> >
> > It's only "inefficient" if you assume that using the pagecache, and more
> > generally, kernel syscalls, is free.  Yes the pagecache is convenient
> > and yes it gives you a lot of flexibility, but you pay for that
> > flexibility if you are trying to do anything fast.
> >
> > For instance, take the new KPTI patches in the kernel for meltdown. Look
> > at how badly it can hurt MyISAM database performance in MariaDB:
> >
> I, like many others here, have decided that all the Meltdown and Spectre
> patches are a bit pointless on pure OSD nodes, because if somebody on the
> node is running random code you're already in deep doodoo.
>
> That being said, I will totally concur that syscalls aren't free.
> However given the latencies induced by the rather long/complex code IOPS
> have to transverse within Ceph, how much of a gain would you say
> eliminating these particular calls did achieve?
>
> > https://mariadb.org/myisam-table-scan-performance-kpti/
> >
> > MyISAM does not have a dedicated row cache and instead caches row data
> > in the page cache as you suggest Bluestore should do for it's data.
> > Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a
> > dedicated 128MB cache (less than 1%).  KPTI is a really good example of
> > how much this stuff can hurt you, but syscalls, context switches, and
> > page faults were already expensive even before meltdown.  Not to mention
> > that right now bluestore keeps onodes and buffers stored in it's cache
> > in an unencoded form.
> >
> That last bit is quite relevant of course.
>
> > Here's a couple of other articles worth looking at:
> >
> > https://eng.uber.com/mysql-migration/
> > https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
> > http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-
> meltdown-performance.html
> >
> > > 3. A uniform OSD cache size of course will be a nightmare when having
> > > non-uniform HW, either with RAM or number of OSDs.
> >
> > Non-Uniform hardware is a big reason that pinning dedicated memory to
> > specific cores/sockets is really nice vs relying on potentially remote
> > memory page cache reads.  A long time ago I was responsible for
> > validating the performance of CXFS on an SGI Altix UV distributed
> > shared-memory supercomputer.  As it turns out, we could achieve about
> > 22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x
> > slower.  A big part of that turned out to be the kernel distributing
> > page cache across the Numalink5 interconnects to remote memory.  The
> > problem can potentially happen on any NUMA system to varying degrees.
> >
> I could regale you with even more ancient stories when I was working with
> DEC VMSclusters. ^o^ But that's not here and now.
>
> As for pinning, I'm doing this mostly on compute nodes, to basically keep
> a good number of cores free on the NUMA node(s) that handle HW interrupts,
> the kernel tends to do a decent enough job from there on.
>
> And while you definitely have a point, modern designs like Epyc pretty much
> negate NUMA issues.
> Never mind that performance/latency conscious Ceph users already do stick
> to single NUMA CPUs for SSD/NVMe servers if possible.
>
> > Personally I have two primary issues with bluestore's memory
> > configuration right now:
> >
> > 1) It's too complicated for users to figure out where to assign memory
> > and in what ratios.  I'm attempting to improve this by making
> > bluestore's cache autotuning so the user just gives it a number and
> > bluestore will try to work out where it should assign memory.
> >
> This would be very helpful (as in ratio of # of OSDs/total RAM).
> Otherwise you wind up with non-uniformity issues again.
>
> And especially _if_ it can also drop caches in low memory situations
> voluntarily.
>
> > 2) In the case where a subset of OSDs are really hot (maybe RGW bucket
> > accesses) you might want some OSDs to get more memory than others.  I
> > think we can tackle this better if we migrate to a one-osd-per-node
> > sharded architecture (likely based on seastar), though we'll still need
> > to be very aware of remote memory.  Given that this is fairly difficult
> > to do well, we're probably going to be better off just dedicating a
> > static pool to each shard initially.
> >
> I'm wondering if and how such a sharding can be realized while still
> keeping the OSD (storage device really) the smallest failure domain and
> not just the host.
> Because I'm betting you that some people have specialty use cases
> depending on that (not me for a change).
>
> Christian
>
> > Mark
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Rakuten Communications
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to