On Wed, 2019-01-23 at 16:29 +0000, Ola Liljedahl wrote: > External Email > > ------------------------------------------------------------------- > --- > On Wed, 2019-01-23 at 16:02 +0000, Jerin Jacob Kollanukkaran wrote: > > On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote: > > > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote: > > > > v3: > > > > - Avoid the ABI break by putting 64-bit head and tail values > > > > in > > > > the > > > > same > > > > cacheline as struct rte_ring's prod and cons members. > > > > - Don't attempt to compile rte_atomic128_cmpset without > > > > ALLOW_EXPERIMENTAL_API, as this would break a large number > > > > of > > > > libraries. > > > > - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in > > > > case > > > > someone tries > > > > to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag. > > > > - Update the ring mempool to use experimental APIs > > > > - Clarify that RINB_F_NB is only limited to x86_64 currently; > > > > ARMv8.1-A builds > > > > can eventually support it with the CASP instruction. > > > ARMv8.0 should be able to implement a 128-bit atomic compare > > > exchange > > > operation using LDXP/STXP. > > Just wondering what would the performance difference between CASP > > vs > > LDXP/STXP on LSE supported machine? > I think that is up to the microarchitecture. But one the ideas behind
Yes. This is where things are getting little messy to have generic code where a lot of stuff is defined based on micro architecture/IMPLEMENTATION DEFINED as arm spec. Al least, I am dealing with three different micro archirectures now with a lot of difference. Including the arm cores and qualcomm cores there could around >6ish different micro archtectures. > introducing the LSE atomics was that they should be "better" than the > equivalent > code using exclusives. I think non-conditional LDxxx and STxxx > atomics could be > better than using exclusives while conditional atomics (CAS, CASP) > might not be > so different (the reason has to do with cache coherency, a core can > speculatively snoop-unique the cache line which is targetted by an > atomic > instruction but to what extent that provides a benefit could be > depend on > whether the atomic actually performs a store or not). > > > I think, We can not detect the presese of LSE support in compile > > time. > > Right? > Unfortunately, AFAIK GCC doesn't notify the source code that it is > targetting > v8.1+ with LSE support. If there were intrinsics for (certain) LSE > instructions > (e.g. those not generated by the compiler, e.g. STxxx and CASP), we > could use > some corresponding preprocessor define to detect the presence of such > intrinsics > (they exist for other intrinsics, e.g. __ARM_FEATURE_QRDMX for > SQRDMLAH/SQRDMLSH > instructions and corresponding intrinsics). > > I have tried to interest the Arm GCC developers in this but have not > yet > succeeded. Perhaps if we have more use cases were atomics intrinsics > would be > useful, we could convince them to add such intrinsics to the ACLE > (ARM C > Language Extensions). But we will never get intrinsics for > exclusives, they are > deemed unsafe for explicit use from C. Instead need to provide inline > assembler > that contains the complete exclusives sequence. But in practice it > seems to work > with using inline assembler for LDXR and STXR as I do in the lockfree > code > linked below. > > > The dynamic one will be costly like, > Do you think so? Shouldn't this branch be perfectly predictable? Once Not just branch predication. Right? Corresponding Load and need for more I cache etc. I think, for the generic build we can have either run time detection or stick with LDXR/STXR. We can give a compile time option for CASP based code so that for given micro architecture if it optimized it can make use of it.(Something we can easily expressed on meson build with MIDR value) > in a while > it will fall out of the branch history table but doesn't that mean > the > application hasn't been executing this code for some time so not > really > performance critical? > > > if (hwcaps & HWCAP_ATOMICS) { > > casp > > } else { > > ldxp > > stxp > > } > > > > > From an ARM perspective, I want all atomic operations to take > > > memory > > > ordering arguments (e.g. acquire, release). Not all usages of > > > e.g. > > +1 > > > > > atomic compare exchange require sequential consistency (which I > > > think > > > what x86 cmpxchg instruction provides). DPDK functions should not > > > be > > > modelled after x86 behaviour. > > > > > > Lock-free 128-bit atomics implementations for ARM/AArch64 and > > > x86-64 > > > are available here: > > > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h > > > > -- > Ola Liljedahl, Networking System Architect, Arm > Phone +46706866373, Skype ola.liljedahl >