On 6/3/20 1:56 PM, Jan Kara wrote:
On Tue 02-06-20 17:59:08, Williams, Dan J wrote:
[ forgive formatting, a series of unfortunate events has me using Outlook for 
the moment ]

From: Jan Kara <j...@suse.cz>
These flags are device properties that affect the kernel and
userspace's handling of persistence.


That will not handle the scenario with multiple applications using
the same fsdax mount point where one is updated to use the new
instruction and the other is not.

Right, it needs to be a global setting / flag day to switch from one
regime to another. Per-process control is a recipe for disaster.

First I'd like to mention that hopefully the concern is mostly theoretical since
as Aneesh wrote above, real persistent memory never shipped for PPC and
so there are very few apps (if any) using the old way to ensure cache
flushing.

But I'd like to understand why do you think per-process control is a recipe for
disaster? Because from my POV the sysfs interface you propose is actually
difficult to use in practice. As a distributor, you have hard time picking the
default because you have a choice between picking safe option which is
going to confuse users because of failing MAP_SYNC and unsafe option
where everyone will be happy until someone looses data because of some
ancient application using wrong instructions to persist data. Poor experience
for users in either way. And when distro defaults to "safe option", then the
burden is on the sysadmin to toggle the switch but how is he supposed to
decide when that is safe? First he has to understand what the problem
actually is, then he has to audit all the applications using pmem whether they
use the new instruction - which is IMO a lot of effort if you have a couple of
applications and practically infeasible if you have more of them.
So IMO the burden should be *on the application* to declare that it is aware
of the new instructions to flush pmem on the platform and only to such
application the kernel should give the trust to use MAP_SYNC mappings.

The "disaster" in my mind is this need to globally change the ABI for
persistence semantics for all of Linux because one CPU wants a do over.
What does a generic "MAP_SYNC_ENABLE" knob even mean to the existing
deployed base of persistent memory applications? Yes, sysfs is awkward,
but it's trying to provide some relief without imposing unexplainable
semantics on everyone else. I think a comprehensive (overengineered)
solution would involve not introducing another "I know what I'm doing"
flag to the interface, but maybe requiring applications to call a pmem
sync API in something like a vsyscall. Or, also overengineered, some
binary translation / interpretation to actively detect and kill
applications that deploy the old instructions. Something horrid like on
first write fault to a MAP_SYNC try to look ahead in the binary for the
correct sync sequence and kill the application otherwise. That would at
least provide some enforcement and safety without requiring other
architectures to consider what MAP_SYNC_ENABLE means to them.

Thanks for explanation. So I absolutely agree that other architectures (and
even older versions of POWER architecture) must not be influenced by the new
tunable. That's why I wrote in my reply to Aneesh that I'd be for checking
during mmap(2) with MAP_SYNC, whether we are in a situation where new PPC
flush instructions are required and *only in that case* decide based on the
prctl value whether MAP_SYNC should be allowed or not.


v2 version of the patch series does that

https://lore.kernel.org/linuxppc-dev/20200602074909.36738-1-aneesh.ku...@linux.ibm.com/

Whether this solution is overengineering or not depends on how you think
it's likely there will be applications trying to use old flush instructions
with MAP_SYNC on POWER10 platforms...


Now considering that with ppc64 we never had a real persistent memory device available for the end-user to try and the new instructions are only needed on newer hardware, can we assume we have enough time to get the userspace to use new instructions?

As a safety net, we can keep the dax device-specific sysfs control. But in reality, by the time newer hardware gets released, we can get the distributions updated to flip the CONFIG_ARCH_MAP_SYNC_DISABLE=n?

With this:
1) vPMEM continues to work and since it is a volatile region. That doesn't need any flush instructions.

2) We get pmdk and other user applications updated to use new instructions and make sure updated packages are made available to all distributions

3) On newer hardware, the device will appear with a new compat string. Hence older distributions won't initialize pmem on newer hardware.

4) If we have a newer kernel with an older distro, we use the per namespace sysfs knob that prevents the usage of MAP_SYNC.

5) After a year or so we mark the CONFIG_ARCH_MAP_SYNC_DISABLE=n
on ppc64 when we are confident that everybody is using the new flush instruction.

-aneesh

Reply via email to