On Mon, May 02, 2016 at 10:53:25AM -0700, Dan Williams wrote: > On Mon, May 2, 2016 at 8:18 AM, Jeff Moyer <jmo...@redhat.com> wrote: > > Dave Chinner <da...@fromorbit.com> writes: > [..] > >> We need some form of redundancy and correction in the PMEM stack to > >> prevent single sector errors from taking down services until an > >> administrator can correct the problem. I'm trying to understand > >> where this is supposed to fit into the picture - at this point I > >> really don't think userspace applications are going to be able to do > >> this reliably.... > > > > Not all storage is configured into a RAID volume, and in some instances, > > the application is better positioned to recover the data (gluster/ceph, > > for example). It really comes down to whether applications or libraries > > will want to implement redundancy themselves in order to get a bump in > > performance by not going through the kernel. And I think I know what > > your opinion is on that front. :-) > > > > Speaking of which, did you see the numbers Dan shared at LSF on how much > > overhead there is in calling into the kernel for syncing? Dan, can/did > > you publish that spreadsheet somewhere? > > Here it is: > > https://docs.google.com/spreadsheets/d/1pwr9psy6vtB9DOsc2bUdXevJRz5Guf6laZ4DaZlkhoo/edit?usp=sharing > > On the "Filtered" tab I have some of the comparisons where:
Those numbers are really wacky - the inconsistent decimal place representation makes it really, really hard to read the differences in orders of magnitude, too. Let's take the first numbers - noop, 64 byte ops are: threads ops/s 1 90M 2 310M 4 65M 8 175M 16 426M Why aren't these linear? And if the test is not running in an environment where these are controlled and linear, how valid are the rest of the tests and hence the comparison. > noop => don't call msync and don't flush caches in userspace > > persist => cache flushing only in userspace and only on individual cache lines So these look a lot more linear than the no-op behaviour, so I'll just ignore the no-op results for now. > persist_4k => cache flushing only in userspace, but flushing is > performed in 4K aligned units Urg, your "vs persist" percentages are all wrong. You can't have a "-1000%" difference, you have "persist 4k" running at 10% of the speed of "persist". So, with that in mind, the "persist_4k" speed is: ops/s single thread Size vs "persist" 4k flush rate 64 10% 834k 128 13% 849k 256 15% 410k(one off variation?) 512 20% 860k 1024 25% 850k 2048 50% 840k 4096 none 836k 8192 none 410k What we see here is that the CPU(s) can flush the 4k pages at a rate of roughly 850,000 flushes/s, whilst the 64 byte flush rate is around 8.8M flushes/s. This is clearly demonstrated in the numbers - as the dirty object size approaches the cache flush granularity, the speed approaches single cacheline flush granularity speed. Comparing 4k vs 64b flushes, we have 63 clean cache line flushes taking roughly the same time as 9 dirty cache line flushes. Nice numbers - that means a clean cache line flush has ~14% of the overhead of dirty cache line flush. Seems rather high - it's tens of CPU cycles to determine that the flush is a no-op for that cacheline. Fixing this seems like a hardware optimisation issue to me, but I still have to question how many applications are going to have such fine-grained random synchronous memory writes that this actually matters in practice? If we are doing such small writes across multiple different 4k pages, then TLB overhead for all the page faults is going to be as much of an issue as 4k cache flushes... > msync => same granularity flushing as the 'persist' case, but the > kernel internally promotes this to a 4K sized / aligned flush So you're calling msync for every modification that is made? What application needs to do that? Anyway, page flush rates paint an interesting picture: single thread versus Size 4k flush rate persist_4k 64 655k 78% 128 655k 81% 256 670k 163% (* persist 4k number low) 512 681k 79% 1024 666k 78% 2048 650k 77% 4096 652k 78% 8192 390k 95% msync adds relatively little overhead (~20% extra overhead) compared to the performance loss from the 4k flush granularity change. And given this appears to be a worst case test scenario (and I'm sure msync could be improved), I don't think this demonstrates a problem with using msync. IMO, these numbers don't support the argument that the *msync model* for data integrity for DAX is flawed, unworkable, or too slow. What I see is a performance problem resulting from the overhead of flushing clean cachelines. i.e. there's data here that supports the argument for reducing the overhead of flushing clean cachelines in the hardware and/or better tracking of dirty cachelines within the kernel, but not data that says the msync() based data integrity model is the source of the problem. i.e. separate the programming model from the performance issue, and we can see that the performance problem is not caused by the programming model - it's caused by the kernel implementation of the model. > The takeaway is that msync() is 9-10x slower than userspace cache management. An alternative viewpoint: that flushing clean cachelines is extremely expensive on Intel CPUs. ;) i.e. Same numbers, different analysis from a different PoV, and that gives a *completely different conclusion*. Think about it for the moment. The hardware inefficiency being demonstrated could be fixed/optimised in the next hardware product cycle(s) and so will eventually go away. OTOH, we'll be stuck with whatever programming model we come up with for the next 30-40 years, and we'll never be able to fix flaws in it because applications will be depending on them. Do we really want to be stuck with a pmem model that is designed around the flaws and deficiencies of ~1st generation hardware? Cheers, Dave. -- Dave Chinner da...@fromorbit.com