On Tue, Feb 11, 2020 at 6:57 AM Aneesh Kumar K.V <aneesh.ku...@linux.ibm.com> wrote: > > On 2/10/20 11:48 PM, Dan Williams wrote: > > On Mon, Feb 10, 2020 at 6:20 AM Aneesh Kumar K.V > > <aneesh.ku...@linux.ibm.com> wrote: > >> > >> Dan Williams <dan.j.willi...@intel.com> writes: > >> > >>> On Tue, Feb 4, 2020 at 9:21 PM Aneesh Kumar K.V > >>> <aneesh.ku...@linux.ibm.com> wrote: > >>>> > >>>> Currently, kernel shows the below values > >>>> "persistence_domain":"cpu_cache" > >>>> "persistence_domain":"memory_controller" > >>>> "persistence_domain":"unknown" > >>>> > >>>> "cpu_cache" indicates no extra instructions is needed to ensure the > >>>> persistence > >>>> of data in the pmem media on power failure. > >>>> > >>>> "memory_controller" indicates platform provided instructions need to be > >>>> issued > >>> > >>> No, it does not. The only requirement implied by "memory_controller" > >>> is global visibility outside the cpu cache. If there are special > >>> instructions beyond that then it isn't persistent memory, at least not > >>> pmem that is safe for dax. virtio-pmem is an example of pmem-like > >>> memory that is not enabled for userspace flushing (MAP_SYNC disabled). > >>> > >> > >> Can you explain this more? The way I was expecting the application to > >> interpret the value was, a regular store instruction doesn't guarantee > >> persistence if you find the "memory_controller" value for > >> persistence_domain. Instead, we need to make sure we flush data to the > >> controller at which point the platform will take care of the persistence in > >> case of power loss. How we flush data to the controller will also be > >> defined by the platform. > > > > If the platform requires any flush mechanism outside of the base cpu > > ISA of cache flushes and memory barriers then MAP_SYNC needs to be > > explicitly disabled to force the application to call fsync()/msync(). > > Then those platform specific mechanisms need to be triggered through a > > platform-aware driver. > > > > > Agreed. I was thinking we mark the persistence_domain: "Unknown" in that > case. virtio-pmem mark it that way.
I would say the driver requirement case is persistence_domain "None", not "Unknown". I.e. the platform provides no mechanism to flush data to the persistence domain on power loss, it's back to typical storage semantics. > > > >> > >> > >>>> as per documented sequence to make sure data get flushed so that it is > >>>> guaranteed to be on pmem media in case of system power loss. > >>>> > >>>> Based on the above use memory_controller for non volatile regions on > >>>> ppc64. > >>>> > >>>> Signed-off-by: Aneesh Kumar K.V <aneesh.ku...@linux.ibm.com> > >>>> --- > >>>> arch/powerpc/platforms/pseries/papr_scm.c | 7 ++++++- > >>>> drivers/nvdimm/of_pmem.c | 4 +++- > >>>> include/linux/libnvdimm.h | 1 - > >>>> 3 files changed, 9 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > >>>> b/arch/powerpc/platforms/pseries/papr_scm.c > >>>> index 7525635a8536..ffcd0d7a867c 100644 > >>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c > >>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c > >>>> @@ -359,8 +359,13 @@ static int papr_scm_nvdimm_init(struct > >>>> papr_scm_priv *p) > >>>> > >>>> if (p->is_volatile) > >>>> p->region = nvdimm_volatile_region_create(p->bus, > >>>> &ndr_desc); > >>>> - else > >>>> + else { > >>>> + /* > >>>> + * We need to flush things correctly to guarantee > >>>> persistance > >>>> + */ > >>> > >>> There are never guarantees. If you're going to comment what does > >>> software need to flush, and how? > >> > >> Can you explain why you say there are never guarantees? If you follow the > >> platform > >> recommended instruction sequence to flush data, we can be sure of data > >> persistence in the pmem media. > > > > Because storage can always fail. You can reduce risk, but never > > eliminate it. This is similar to SSDs that use latent capacitance to > > flush their write caches on driver power loss. Even if the application > > successfully flushes its writes to buffers that are protected by that > > capacitance that power source can still (and in practice does) fail. > > > > ok guarantee is not the right term there. Can we say > > /* We need to flush tings correctly to ensure persistence */ The definition of the "memory_controller" persistence domain is: "the platform takes care to flush writes to media once they are globally visible outside the cache". > > > What I was trying to understand/clarify was the detail an application > can infer looking at the value of persistence_domain ? > > Do you agree that below can be inferred from the "memory_controller" > value of persistence_domain > > 1) Application needs to use cache flush instructions and that ensures > data is persistent across power failure. > > > Or are you suggesting that application should not infer any of those > details looking at persistence_domain value? If so what is the purpose > of exporting that attribute? The way the patch was worded I thought it was referring to an explicit mechanism outside cpu cache flushes, i.e. a mechanism that required a driver call. > > > >> > >> > >>> > >>>> + set_bit(ND_REGION_PERSIST_MEMCTRL, &ndr_desc.flags); > >>>> p->region = nvdimm_pmem_region_create(p->bus, > >>>> &ndr_desc); > >>>> + } > >>>> if (!p->region) { > >>>> dev_err(dev, "Error registering region %pR from %pOF\n", > >>>> ndr_desc.res, p->dn); > >>>> diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c > >>>> index 8224d1431ea9..6826a274a1f1 100644 > >>>> --- a/drivers/nvdimm/of_pmem.c > >>>> +++ b/drivers/nvdimm/of_pmem.c > >>>> @@ -62,8 +62,10 @@ static int of_pmem_region_probe(struct > >>>> platform_device *pdev) > >>>> > >>>> if (is_volatile) > >>>> region = nvdimm_volatile_region_create(bus, > >>>> &ndr_desc); > >>>> - else > >>>> + else { > >>>> + set_bit(ND_REGION_PERSIST_MEMCTRL, > >>>> &ndr_desc.flags); > >>>> region = nvdimm_pmem_region_create(bus, > >>>> &ndr_desc); > >>>> + } > >>>> > >>>> if (!region) > >>>> dev_warn(&pdev->dev, "Unable to register region > >>>> %pR from %pOF\n", > >>>> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h > >>>> index 0f366706b0aa..771d888a5ed7 100644 > >>>> --- a/include/linux/libnvdimm.h > >>>> +++ b/include/linux/libnvdimm.h > >>>> @@ -54,7 +54,6 @@ enum { > >>>> /* > >>>> * Platform provides mechanisms to automatically flush > >>>> outstanding > >>>> * write data from memory controler to pmem on system power > >>>> loss. > >>>> - * (ADR) > >>> > >>> I'd rather not delete critical terminology for a developer / platform > >>> owner to be able to consult documentation, or their vendor. Can you > >>> instead add the PowerPC equivalent term for this capability? I.e. list > >>> (x86: ADR PowerPC: foo ...). > >> > >> Power ISA doesn't clearly call out what mechanism will be used to ensure > >> that a load following power loss will return the previously flushed > >> data. Hence there is no description of details like Asynchronous DRAM > >> Refresh. Only details specified is with respect to flush sequence that > >> ensures > >> that a load following power loss will return the value stored. > > > > What is this "flush sequence"? > > > > cpu cache flush instructions "dcbf; hwsync" Looks good, as long as the flush mechanism is defined by the cpu ISA then MAP_SYNC is viable.