> -----Original Message----- > From: Neil Horman [mailto:nhorman at tuxdriver.com] > Sent: Tuesday, August 05, 2014 7:21 PM > To: Ananyev, Konstantin > Cc: dev at dpdk.org; Thomas Monjalon; Richardson, Bruce > Subject: Re: [PATCH] acl: If build does not support sse4.2, emulate missing > instructions with C code > > On Tue, Aug 05, 2014 at 03:26:27PM +0000, Ananyev, Konstantin wrote: > > Hi Neil, > > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com] > > > Sent: Monday, August 04, 2014 4:36 PM > > > To: dev at dpdk.org > > > Cc: Neil Horman; Thomas Monjalon; Ananyev, Konstantin; Richardson, Bruce > > > Subject: [PATCH] acl: If build does not support sse4.2, emulate missing > > > instructions with C code > > > > > > The ACL library makes extensive use of some SSE4.2 instructions, which > > > means the > > > default build can't compile this library. Work around the problem by > > > testing > > > the __SSE42__ definition in the acl_vects.h file and defining the macros > > > there > > > as intrinsics or c-level equivalants. Note this is a minimal patch, > > > adjusting > > > only the definitions that are currently used in the ACL library. > > > > > > > My comments about actual implementations of c-level equivalents below. > > None of them look correct to me. > > Of course it could be fixed. > > Though I am not sure that is a right way to proceed: > > At first I really doubt that these equivalents will provide similar > > performance. > > As you probably note - we do have a scalar version of rte_acl_classify(): > > rte_acl_classify_scalar(). > > So I think it might be faster than vector one with 'emulated' instrincts. > > Unfortunately it is all mixed right now into one file and 'scalar' version > > could use few sse4 instrincts through resolve_priority(). > > Another thing - we consider to add another version of rte_acl_classify() > > that will use avx2 instrincts. > > If we go the way you suggest - I am afraid will soon have to provide scalar > > equivalents for several AVX2 instrincts too. > > So in summary that way (providing our own scalar equivalents of SIMD > > instrincts) seems to me slow, hard to maintain and error > prone. > > > > What porbably can be done instead: rework acl_run.c a bit, so it provide: > > rte_acl_classify_scalar() - could be build and used on all systems. > > rte_acl_classify_sse() - could be build and used only on systems with > > sse4.2 and upper, return ENOTSUP on lower arch. > > In future: rte_acl_classify_avx2 - could be build and used only on systems > > with avx2 and upper, return ENOTSUP on lower arch. > > > > I am looking at rte_acl right now anyway. > > So will try to come up with something workable. > > > So, this is exactly the opposite of what Bruce and I just spent several days > and > a huge email thread that you clearly are aware of discussing run time versus > compile time selection of paths. At this point I'm done ping ponging between > your opposing viewpoints. If you want to implement something that does run > time > checking, I'm fine with it, but I'm not going back and forth until you two > come > to an agreement on this.
Right now, I am not talking about 'run time vs compile time selection'. Whatever way we choose, I think the implementation need to be: 1) correct 2) allow easily add(/modify) code path optimised for particular architecture. Without need to modify/re-test what you call 'least common denominator' code path. And visa-versa, if someone find a way to optimise common code path - no need to touch/retest architecture specific ones. > > > > Only compile tested so far, but I wanted to post it for early review so > > > that > > > others could aid in unit testing. > > > > Could you please stop submitting untested patches. > What part of "Compile tested only" was hard to understand? I posted it early > specficially because I expected comments like you've provided above. After I > posted my first idea, you naked it, and after I started fixing it up to make > run-time selections, Bruce came down in favor of common compile time > functionality. Given our previous conversation, I figured you would be > opposed > to this approach, so whats the point in testing this if you're just going to > kill the idea? > > > It is common (and good) practice to do at least some minimal testing before > > submitting your code changes. > Never intended it to be accepted just tested/reviewed. Hence my comment > "compile tested only". I admit I was unaware of the unit test below > > At the end, it will save your own and other people time. > > For ACL there is a simple UT, that could be run as: > > ./x86_64-native-linuxapp-gcc/app/test ... > > RTE>>acl_autotest > > It takes just few seconds to run. > > It doesn't seem to work properly, at least not on any of my systems. With a > system allocation 100 pages to hugepage memory: > > [root at hmsreliant app]# ./test -c 0x3 -n 2 > EAL: Detected lcore 0 as core 0 on socket 0 > EAL: Detected lcore 1 as core 1 on socket 0 > EAL: Detected lcore 2 as core 2 on socket 0 > EAL: Detected lcore 3 as core 3 on socket 0 > EAL: Detected lcore 4 as core 0 on socket 0 > EAL: Detected lcore 5 as core 1 on socket 0 > EAL: Detected lcore 6 as core 2 on socket 0 > EAL: Detected lcore 7 as core 3 on socket 0 > EAL: Support maximum 64 logical core(s) by configuration. > EAL: Detected 8 lcore(s) > EAL: cannot open VFIO container, error 2 (No such file or directory) > EAL: VFIO support could not be initialized > EAL: Setting up memory... > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7fef07000000 (size = 0x200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7fef06c00000 (size = 0x200000) > EAL: Ask a virtual area of 0x400000 bytes > EAL: Virtual area found at 0x7fef06600000 (size = 0x400000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7fef06200000 (size = 0x200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7fef05e00000 (size = 0x200000) > EAL: Ask a virtual area of 0x600000 bytes > EAL: Virtual area found at 0x7fef05600000 (size = 0x600000) > EAL: Ask a virtual area of 0x600000 bytes > EAL: Virtual area found at 0x7fef04e00000 (size = 0x600000) > EAL: Ask a virtual area of 0x800000 bytes > EAL: Virtual area found at 0x7fef04400000 (size = 0x800000) > EAL: Ask a virtual area of 0x400000 bytes > EAL: Virtual area found at 0x7fef03e00000 (size = 0x400000) > EAL: Ask a virtual area of 0xa00000 bytes > EAL: Virtual area found at 0x7fef03200000 (size = 0xa00000) > EAL: Ask a virtual area of 0x400000 bytes > EAL: Virtual area found at 0x7fef02c00000 (size = 0x400000) > EAL: Ask a virtual area of 0x400000 bytes > EAL: Virtual area found at 0x7fef02600000 (size = 0x400000) > EAL: Ask a virtual area of 0x1800000 bytes > EAL: Virtual area found at 0x7fef00c00000 (size = 0x1800000) > EAL: Ask a virtual area of 0x1200000 bytes > EAL: Virtual area found at 0x7feeff800000 (size = 0x1200000) > EAL: Ask a virtual area of 0x2600000 bytes > EAL: Virtual area found at 0x7feefd000000 (size = 0x2600000) > EAL: Ask a virtual area of 0xc00000 bytes > EAL: Virtual area found at 0x7feefc200000 (size = 0xc00000) > EAL: Ask a virtual area of 0x2200000 bytes > EAL: Virtual area found at 0x7feef9e00000 (size = 0x2200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7feef9a00000 (size = 0x200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7feef9600000 (size = 0x200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7feef9200000 (size = 0x200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7feef8e00000 (size = 0x200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7feef8a00000 (size = 0x200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7feef8600000 (size = 0x200000) > EAL: Ask a virtual area of 0x200000 bytes > EAL: Virtual area found at 0x7feef8200000 (size = 0x200000) > EAL: Ask a virtual area of 0x600000 bytes > EAL: Virtual area found at 0x7feef7a00000 (size = 0x600000) > EAL: Requesting 100 pages of size 2MB from socket 0 > EAL: TSC frequency is ~3392297 KHz > EAL: Master core 0 is ready (tid=73cf880) > EAL: Core 1 is ready (tid=f71fe700) > APP: HPET is not enabled, using TSC as default timer > RTE>>acl_autotest > ACL: allocation of 25166720 bytes on socket 9 for ACL_acl_ctx failed > > This hangs forever (well, at least 30 minutes, but thats sufficiently > close to forever for me to wait). > > > Also if you plan further development for ACL, we probably can provide you > > with extra test-cases that we use for functional testing. > That would be good please. > > Thanks > Neil >