On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > 2014-07-31 09:13, Neil Horman: > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > Hey all- > > > > > > > > > I've been trying to update the fedora dpdk package to > > > > > > > > > support VFIO > > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't > > > > > > > > > compile because the > > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which > > > > > > > > > aren't supported in the > > > > > > > > > default config I have. I tried to remedy this by replacing > > > > > > > > > the intrinsics with > > > > > > > > > the __builtin macros, but it was pointed out (correctly), > > > > > > > > > that this doesn't work > > > > > > > > > properly. So this is my second attempt, which I actually > > > > > > > > > like a bit better. I > > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl > > > > > > > > > library), don't need to > > > > > > > > > have those instructions turned on build-wide. Rather, we can > > > > > > > > > just enable the > > > > > > > > > instructions in the specific code we want to build with > > > > > > > > > support for that, and > > > > > > > > > test for instruction support dynamically at run time. This > > > > > > > > > allows me to build > > > > > > > > > the dpdk for a generic platform, but in such a way that some > > > > > > > > > optimizations can > > > > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman at tuxdriver.com> > > > > > > > > > CC: Thomas Monjalon <thomas.monjalon at 6wind.com> > > > > > > > > > > > > > > > > > I'd prefer if a solution could be found based off your original > > > > > > > > patch > > > > > > > > set, as it gives us more chance to deprecate the older code > > > > > > > > paths in > > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it > > > > > > > > shows that > > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than > > > > > > > > SSE4.x, > > > > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > > > > popcount intrinsic is newer, but it's a much more basic > > > > > > > > instruction so > > > > > > > > hopefully the __builtin should work for that. > > > > > > > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, > > > > > > > which is to offer > > > > > > > accelerated code paths on systems that can make use of it at run > > > > > > > time. If We > > > > > > > use the __builtin compiler functions, we will either: > > > > > > > > > > > > > > 1) Build those code paths with advanced instructions that won't > > > > > > > work on older > > > > > > > systems (i.e. crash) > > > > > > > > > > > > > > 2) Build those code paths with less advanced instructions, > > > > > > > meaning that we won't > > > > > > > speedup execution on systems that are capable of using the more > > > > > > > advanced > > > > > > > instructions. > > > > > > > > > > > > > > Using this run time check, we can, at least in these situations, > > > > > > > make use of the > > > > > > > accelerated paths when the instructions are available, and ignore > > > > > > > them when > > > > > > > they're not, at run time. > > > > > > > > > > > > > > What would be ideal, would be an alternative type macro, like the > > > > > > > linux kernel > > > > > > > employs, but implementing that would require some pretty > > > > > > > significant work and > > > > > > > testing. This seems like a much simpler approach. > > > > > > > > [...] > > > > > > > > > Now, a macro that selected an instruction optimized or generic path > > > > > is fine, as > > > > > long as it can happen at run time. The Linux kernel has such a > > > > > feature, called > > > > > alternatives. But its a complex subsystem that does run time > > > > > replacement of > > > > > instructions based on cpu feature flags. It would be great to have > > > > > in the DPDK, > > > > > but its a significant code base and difficult to maintain, which goes > > > > > against > > > > > your desire to reduce code. > > > > > > > > [...] > > > > > > > > > > Even though the code is written using intrinsics which correspond > > > > > > to SSE > > > > > > operations, the compiler is free to use AVX instructions where > > > > > > necessary > > > > > Not if you use the default machine target. > > > > > > > > > > > to improve performance. Therefore, if we go down this road, we need > > > > > > to > > > > > > look to compile up the code for all microarchitectures, rather than > > > > > > just > > > > > > assuming that we will get equivalent performance to "native" by > > > > > > turning > > > > > > on the instruction set indicated by the primitives in the code. > > > > > > This is > > > > > No, you compile for the least common demonitor system, and enable more > > > > > performant paths opportunistically as run time checks allow. > > > > > > > > > > > where having one codepath recompiled multiple times will work far > > > > > > better > > > > > > than having multiple code paths. > > > > > Only if you're only concern is performance. As noted above, my goal > > > > > is more > > > > > than just performance, its compatibility accross systems. Multiple > > > > > builds for > > > > > multiple cpu flag availability is simply a non-starter for a generic > > > > > distribution. > > > > > > > > Neil, we are mixing 2 different problems here. > > > > 1) we have to fix default build (without SSE-4.2) > > > Thats nothing to fix, thats a configuration issue. Just build for a > > > lesser > > > machine. I've already done that in the fedora build, using the defalut > > > machine > > > target. What exactly is missing from that? > > > > > Re-reading this, I'm wondering if I missed what you were trying to say, if > > so I > > apologize. Were you trying to assert that the right thing to do here is to > > adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that > > they are buildable on the default platform? If so, I agree, thats a nice > > idea, > > and am supportive of it, though I don't think that fully solves teh > > problem. In > > the case of the ixgbe pmd, what we have is 2 code paths, a generic code > > path, > > and an optimized code path using sse4.2 intrinsics. In this case, I don't > > think > > theres anything to fix, in that I'm fine with the optimized path needing > > sse4.2 > > to execute. There I just want to be able to do a run time check and use the > > optimized path if the cpu supports it, and just use the default path > > otherwise. > > In effect we already have exactly what you are looking for there. > > > > As far as the ACL library goes, yes, thats more complex. The use of sse4.2 > > intrinsics there is done througout the code, so theres no easy way to > > select a > > path. we're just left with either using the code or returning an error at > > run > > time, as my patch does. Certainly we can build some macros that either use > > the > > intrinsics for sse4.2 or code up some C-level variants of those instructions > > based on generic code, and build for the least common demoniator, or > > compile the > > code twice (once without sse4.2 support, and once with), and do a runtime > > selection between the two. Either way, thats going to be a useful, though > > significant effort. > > I think a good first step here that I can't see anyone objecting to is > to enable the ixgbe driver to use the vector code path for a generic > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] > allows a compile for machine type default, and testpmd can still forward > packets at a good rate (roughly perf down about 10% vs native compile on > SNB). > The ACL is a tougher nut to crack, but anyone see any issues with that > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch > set thread, do you want to submit an official patch here, or would you prefer > I > do so?] >
I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 instructions in that path seems significant, isn't it? Given that performance delta, it seems like it would still be preferable to have a path that used the sse4.2 instructions when they're available. Or am I misreading what you mean when you say down 10% Neil > > > > > > 2) we could try to have performance with default build > > > > > > > Yes, we can, thats what this patch does. It doesn't address every code > > > path, > > > no, but it addresses two paths that are low hanging fruit for doing so, > > > and we > > > can incrementally build on that > > > > > > > Please, let's focus on the first item and we could discuss about > > > > performance > > > > later. Having some different code path choosed at runtime is a big > > > > rework and > > > > imply changing the compilation model (RFC welcome). > > > > > > Even if I misinterpreted your statement above, I'm still not sure why your > > asserting this. Fixing the build to work with the default target machine is > > good, and should be undertaken, and I'll happily do so, but why reject the > > solution in front of you to wait for it? Even if I write macros to fix up > > the > > ACL library, I'd still like to be able to do a run time check and select the > > optimized version or the generic version based on cpu support. Just doing a > > compile time check to determine if sse4.2 is available really isn't going > > to cut > > it for me, as I don't want the fedora dpdk to have pessimal performance if > > it > > doesn't have to. > > > > Regards > > Neil > > > > With regards to the general approach for runtime detection of software > functions, I wonder if something like this can be handled by the > packaging system? Is it possible to ship out a set of shared libs > compiled up for different instruction sets, and then at rpm install > time, symlink the appropriate library? This would push the whole issue > of detection of code paths outside of code, work across all our > libraries and ensure each user got the best performance they could get > form a binary? > Has something like this been done before? The building of all the > libraries could be scripted easy enough, just do multiple builds using > different EXTRA_CFLAGS each time, and move and rename the .so's after > each run. > > /Bruce >