<snip> > >> > >> +TO: @Honnappa, we need input from ARM > >> > >>> From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com] > >>> Sent: Friday, 29 July 2022 21.49 > >>>> > >>>>> From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com] > >>>>> Sent: Friday, 29 July 2022 14.14 > >>>>> > >>>>> > >>>>> Sorry, missed that part. > >>>>> > >>>>>> > >>>>>>> Another question - who will do 'sfence' after the copying? > >>>>>>> Would it be inside memcpy_nt (seems quite costly), or would it > >>>>>>> be another API function for that: memcpy_nt_flush() or so? > >>>>>> > >>>>>> Outside. Only the developer knows when it is required, so it > >>> wouldn't > >>>>> make any sense to add the cost inside memcpy_nt(). > >>>>>> > >>>>>> I don't think we should add a flush function; it would just be > >>>>> another name for an already existing function. Referring to the > >>>>> required > >>>>>> operation in the memcpy_nt() function documentation should > >>> suffice. > >>>>>> > >>>>> > >>>>> Ok, but again wouldn't it be arch specific? > >>>>> AFAIK for x86 it needs to boil down to sfence, for other > >>> architectures > >>>>> - I don't know. > >>>>> If you think there already is some generic one (rte_wmb?) that > >>> would > >>>>> always produce > >>>>> correct instructions - sure let's use it. > >>>>> > >>>> > >>>> DPDK has generic functions to wrap architecture specific stuff like > >>> memory barriers. > >>>> > >>>> Because they are non-temporal stores, I suspect that rte_mb() is > >>> required before reading the data from the location it was copied to. > >>>> Ensuring that STORE operations are ordered (rte_wmb) might not > >>> suffice. However, I'm not a CPU expert, so I will seek advice from > >>>> more qualified people in the community on this. > >>> > >>> I think for IA sfence is enough, see citation below, for other > >>> architectures - no idea. > >>> What I am trying to say - it needs to be the *same* function on all > >>> archs we support. > >> > >> Now I get it: rte_wmb() might be appropriate on x86, but if any other > >> architecture requires something else, we should add a new common > >> function for flushing, e.g. rte_memcpy_nt_flush(). > >> > >>> > >>> IA SW optimization manual: > >>> 9.4.2 Streaming Store Usage Models > >>> The two primary usage domains for streaming store are coherent > >>> requests and non-coherent requests. > >>> 9.4.2.1 Coherent Requests > >>> Coherent requests are normal loads and stores to system memory, > >>> which may also hit cache lines present in another processor in a > >>> multiprocessor environment. With coherent requests, a streaming > >>> store can be used in the same way as a regular store that has been > >>> mapped with a WC memory type (PAT or MTRR). An SFENCE instruction > >>> must be used within a producer-consumer usage model in order to > >>> ensure coherency and visibility of data between processors. > >>> Within a single-processor system, the CPU can also re-read the same > >>> memory location and be assured of coherence (that is, a single, > >>> consistent view of this memory location). > >>> The same is true for a multiprocessor > >>> (MP) system, assuming an accepted MP software producer-consumer > >>> synchronization policy is employed. > >>> > >> > >> With this reference, I am convinced that you are right about the > >> SFENCE. This puts a checkmark on this item on my TODO list for the > >> patch. Thank you, Konstantin! > >> > >> Any ARM CPU experts on the mailing list seeing this, not on vacation? > >> @Honnappa, I'm looking at you. :-) > >> > >> Summing up, the question is: > >> > >> After a bunch of *non-temporal* stores (STNP instruction) on ARM > >> architecture, does calling rte_wmb() suffice to ensure the data is > >> visible across the system? > > Apologies for the late response, the docs did not have enough information. > The internal dialogue is still going on, but I have some information now. > There is some information in ArmV8 programmer's guide [1], though it is not > complete. > > In summary, rte_wmb()/rte_mb() would not suffice, we need new APIs. > > > > From my perspective, I see several scenarios: > > 1) Need for ordering before the memcpy_nt. Here there are several > cases: > > a. LD – LDNP/STNP – DMB NSHLD > > b. ST – LDNP/STNP – DMB NSH > > 2) Need for ordering after the memcpy. Again, we have the similar use > cases: > > a. LDNP/STNP – LD – DMB NSH > > b. LDNP/STNP – ST – DMB NSH > > > > The 'ST - STNP' and 'STNP - ST' do not apply here, but good to add an API > > for > completion. > > > > So, may be we could have rte_[r|w]mb_nt() APIs. > > > > Is rte_smp_rmb()/rte_smp_wmb() also not enough on ARM? No, they are not as they fall under inner sharable domain where as non-temporal loads/stores fall under non-sharable domain
> > > [1] > > https://developer.arm.com/documentation/den0024/a/The-A64- > instruction- > > set/Memory-access-instructions/Non-temporal-load-and-store-pair