Christian, you mention single socket systems for storage servers. I often thought that the Xeon-D would be ideal as a building block for storage servers https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html Low power, and a complete System-On-Chip with 10gig Ethernet.
I haven't been following these processors lately. Is anyone building CEPH clusters using them On 2 April 2018 at 02:59, Christian Balzer <ch...@gol.com> wrote: > > Hello, > > firstly, Jack pretty much correctly correlated my issues to Mark's points, > more below. > > On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote: > > > On 03/29/2018 08:59 PM, Christian Balzer wrote: > > > > > Hello, > > > > > > my crappy test cluster was rendered inoperational by an IP renumbering > > > that wasn't planned and forced on me during a DC move, so I decided to > > > start from scratch and explore the fascinating world of > Luminous/bluestore > > > and all the assorted bugs. ^_- > > > (yes I could have recovered the cluster by setting up a local VLAN with > > > the old IPs, extract the monmap, etc, but I consider the need for a > > > running monitor a flaw, since all the relevant data was present in the > > > leveldb). > > > > > > Anyways, while I've read about bluestore OSD cache in passing here, the > > > back of my brain was clearly still hoping that it would use > pagecache/SLAB > > > like other filesystems. > > > Which after my first round of playing with things clearly isn't the > case. > > > > > > This strikes me as a design flaw and regression because: > > > > Bluestore's cache is not broken by design. > > > > During further tests I verified something that caught my attention out of > the corner of my when glancing at atop output of the OSDs during my fio > runs. > > Consider this fio run, after having done the same with write to populate > the file and caches (1GB per OSD default on the test cluster, 20 OSDs > total on 5 nodes): > --- > $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > --rw=randread --name=fiojob --blocksize=4M --iodepth=32 > --- > > This is being run against a kernel mounted RBD image. > On the Luminous test cluster it will read the data from the disks, > completely ignoring the pagecache on the host (as expected and desired) > AND the bluestore cache. > > On a Jewel based test cluster with filestore the reads will be served from > the pagecaches of the OSD nodes, not only massively improving speed but > more importantly spindle contention. > > My guess is that bluestore treats "direct" differently than the kernel > accessing a filestore based OSD and I'm not sure what the "correct" > behavior here is. > But somebody migrating to bluestore with such a use case and plenty of RAM > on their OSD nodes is likely to notice this and not going to be happy about > it. > > > > I'm not totally convinced that some of the trade-offs we've made with > > bluestore's cache implementation are optimal, but I think you should > > consider cooling your rhetoric down. > > > > > 1. Completely new users may think that bluestore defaults are fine and > > > waste all that RAM in their machines. > > > > What does "wasting" RAM mean in the context of a node running ceph? Are > > you upset that other applications can't come in and evict bluestore > > onode, OMAP, or object data from cache? > > > What Jack pointed out, unless you go around and start tuning things, > all available free RAM won't be used for caching. > > This raises another point, it being per process data and from skimming > over some bluestore threads here, if you go and raise the cache to use > most RAM during normal ops you're likely to be visited by the evil OOM > witch during heavy recovery OPS. > > Whereas the good ole pagecache would just get evicted in that scenario. > > > > 2. Having a per OSD cache is inefficient compared to a common cache > like > > > pagecache, since an OSD that is busier than others would benefit from a > > > shared cache more. > > > > It's only "inefficient" if you assume that using the pagecache, and more > > generally, kernel syscalls, is free. Yes the pagecache is convenient > > and yes it gives you a lot of flexibility, but you pay for that > > flexibility if you are trying to do anything fast. > > > > For instance, take the new KPTI patches in the kernel for meltdown. Look > > at how badly it can hurt MyISAM database performance in MariaDB: > > > I, like many others here, have decided that all the Meltdown and Spectre > patches are a bit pointless on pure OSD nodes, because if somebody on the > node is running random code you're already in deep doodoo. > > That being said, I will totally concur that syscalls aren't free. > However given the latencies induced by the rather long/complex code IOPS > have to transverse within Ceph, how much of a gain would you say > eliminating these particular calls did achieve? > > > https://mariadb.org/myisam-table-scan-performance-kpti/ > > > > MyISAM does not have a dedicated row cache and instead caches row data > > in the page cache as you suggest Bluestore should do for it's data. > > Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a > > dedicated 128MB cache (less than 1%). KPTI is a really good example of > > how much this stuff can hurt you, but syscalls, context switches, and > > page faults were already expensive even before meltdown. Not to mention > > that right now bluestore keeps onodes and buffers stored in it's cache > > in an unencoded form. > > > That last bit is quite relevant of course. > > > Here's a couple of other articles worth looking at: > > > > https://eng.uber.com/mysql-migration/ > > https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/ > > http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser- > meltdown-performance.html > > > > > 3. A uniform OSD cache size of course will be a nightmare when having > > > non-uniform HW, either with RAM or number of OSDs. > > > > Non-Uniform hardware is a big reason that pinning dedicated memory to > > specific cores/sockets is really nice vs relying on potentially remote > > memory page cache reads. A long time ago I was responsible for > > validating the performance of CXFS on an SGI Altix UV distributed > > shared-memory supercomputer. As it turns out, we could achieve about > > 22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x > > slower. A big part of that turned out to be the kernel distributing > > page cache across the Numalink5 interconnects to remote memory. The > > problem can potentially happen on any NUMA system to varying degrees. > > > I could regale you with even more ancient stories when I was working with > DEC VMSclusters. ^o^ But that's not here and now. > > As for pinning, I'm doing this mostly on compute nodes, to basically keep > a good number of cores free on the NUMA node(s) that handle HW interrupts, > the kernel tends to do a decent enough job from there on. > > And while you definitely have a point, modern designs like Epyc pretty much > negate NUMA issues. > Never mind that performance/latency conscious Ceph users already do stick > to single NUMA CPUs for SSD/NVMe servers if possible. > > > Personally I have two primary issues with bluestore's memory > > configuration right now: > > > > 1) It's too complicated for users to figure out where to assign memory > > and in what ratios. I'm attempting to improve this by making > > bluestore's cache autotuning so the user just gives it a number and > > bluestore will try to work out where it should assign memory. > > > This would be very helpful (as in ratio of # of OSDs/total RAM). > Otherwise you wind up with non-uniformity issues again. > > And especially _if_ it can also drop caches in low memory situations > voluntarily. > > > 2) In the case where a subset of OSDs are really hot (maybe RGW bucket > > accesses) you might want some OSDs to get more memory than others. I > > think we can tackle this better if we migrate to a one-osd-per-node > > sharded architecture (likely based on seastar), though we'll still need > > to be very aware of remote memory. Given that this is fairly difficult > > to do well, we're probably going to be better off just dedicating a > > static pool to each shard initially. > > > I'm wondering if and how such a sharding can be realized while still > keeping the OSD (storage device really) the smallest failure domain and > not just the host. > Because I'm betting you that some people have specialty use cases > depending on that (not me for a change). > > Christian > > > Mark > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Rakuten Communications > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com