On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > 2014-07-31 09:13, Neil Horman:
> > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > Hey all-
> > > > > > > > >         I've been trying to update the fedora dpdk package to 
> > > > > > > > > support VFIO 
> > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't 
> > > > > > > > > compile because the 
> > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which 
> > > > > > > > > aren't supported in the 
> > > > > > > > > default config I have.  I tried to remedy this by replacing 
> > > > > > > > > the intrinsics with 
> > > > > > > > > the __builtin macros, but it was pointed out (correctly), 
> > > > > > > > > that this doesn't work 
> > > > > > > > > properly.  So this is my second attempt, which I actually 
> > > > > > > > > like a bit better.  I 
> > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl 
> > > > > > > > > library), don't need to 
> > > > > > > > > have those instructions turned on build-wide.  Rather, we can 
> > > > > > > > > just enable the 
> > > > > > > > > instructions in the specific code we want to build with 
> > > > > > > > > support for that, and 
> > > > > > > > > test for instruction support dynamically at run time.  This 
> > > > > > > > > allows me to build 
> > > > > > > > > the dpdk for a generic platform, but in such a way that some 
> > > > > > > > > optimizations can 
> > > > > > > > > be used if the executing cpu supports them at run time.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Neil Horman <nhorman at tuxdriver.com>
> > > > > > > > > CC: Thomas Monjalon <thomas.monjalon at 6wind.com>
> > > > > > > > >
> > > > > > > > I'd prefer if a solution could be found based off your original 
> > > > > > > > patch
> > > > > > > > set, as it gives us more chance to deprecate the older code 
> > > > > > > > paths in
> > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it 
> > > > > > > > shows that
> > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than 
> > > > > > > > SSE4.x,
> > > > > > > > and so should be available on all 64-bit systems, I believe. The
> > > > > > > > popcount intrinsic is newer, but it's a much more basic 
> > > > > > > > instruction so
> > > > > > > > hopefully the __builtin should work for that.
> > > > > > > > 
> > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, 
> > > > > > > which is to offer
> > > > > > > accelerated code paths on systems that can make use of it at run 
> > > > > > > time.  If We
> > > > > > > use the __builtin compiler functions, we will either:
> > > > > > > 
> > > > > > > 1) Build those code paths with advanced instructions that won't 
> > > > > > > work on older
> > > > > > > systems (i.e. crash)
> > > > > > > 
> > > > > > > 2) Build those code paths with less advanced instructions, 
> > > > > > > meaning that we won't
> > > > > > > speedup execution on systems that are capable of using the more 
> > > > > > > advanced
> > > > > > > instructions.
> > > > > > > 
> > > > > > > Using this run time check, we can, at least in these situations, 
> > > > > > > make use of the
> > > > > > > accelerated paths when the instructions are available, and ignore 
> > > > > > > them when
> > > > > > > they're not, at run time.
> > > > > > > 
> > > > > > > What would be ideal, would be an alternative type macro, like the 
> > > > > > > linux kernel
> > > > > > > employs, but implementing that would require some pretty 
> > > > > > > significant work and
> > > > > > > testing.  This seems like a much simpler approach.
> > > > 
> > > > [...]
> > > > 
> > > > > Now, a macro that selected an instruction optimized or generic path 
> > > > > is fine, as
> > > > > long as it can happen at run time.  The Linux kernel has such a 
> > > > > feature, called
> > > > > alternatives.  But its a complex subsystem that does run time 
> > > > > replacement of
> > > > > instructions based on cpu feature flags.  It would be great to have 
> > > > > in the DPDK,
> > > > > but its a significant code base and difficult to maintain, which goes 
> > > > > against
> > > > > your desire to reduce code.
> > > > 
> > > > [...]
> > > > 
> > > > > > Even though the code is written using intrinsics which correspond 
> > > > > > to SSE
> > > > > > operations, the compiler is free to use AVX instructions where 
> > > > > > necessary
> > > > > Not if you use the default machine target.
> > > > > 
> > > > > > to improve performance. Therefore, if we go down this road, we need 
> > > > > > to
> > > > > > look to compile up the code for all microarchitectures, rather than 
> > > > > > just
> > > > > > assuming that we will get equivalent performance to "native" by 
> > > > > > turning
> > > > > > on the instruction set indicated by the primitives in the code. 
> > > > > > This is
> > > > > No, you compile for the least common demonitor system, and enable more
> > > > > performant paths opportunistically as run time checks allow.
> > > > > 
> > > > > > where having one codepath recompiled multiple times will work far 
> > > > > > better
> > > > > > than having multiple code paths.
> > > > > Only if you're only concern is performance.  As noted above, my goal 
> > > > > is more
> > > > > than just performance, its compatibility accross systems.  Multiple 
> > > > > builds for
> > > > > multiple cpu flag availability is simply a non-starter for a generic
> > > > > distribution.
> > > > 
> > > > Neil, we are mixing 2 different problems here.
> > > > 1) we have to fix default build (without SSE-4.2)
> > > Thats nothing to fix, thats a configuration issue.  Just build for a 
> > > lesser
> > > machine.  I've already done that in the fedora build, using the defalut 
> > > machine
> > > target.  What exactly is missing from that?
> > > 
> > Re-reading this, I'm wondering if I missed what you were trying to say, if 
> > so I
> > apologize.  Were you trying to assert that the right thing to do here is to
> > adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that
> > they are buildable on the default platform?  If so, I agree, thats a nice 
> > idea,
> > and am supportive of it, though I don't think that fully solves teh 
> > problem.  In
> > the case of the ixgbe pmd, what we have is 2 code paths, a generic code 
> > path,
> > and an optimized code path using sse4.2 intrinsics.  In this case, I don't 
> > think
> > theres anything to fix, in that I'm fine with the optimized path needing 
> > sse4.2
> > to execute.  There I just want to be able to do a run time check and use the
> > optimized path if the cpu supports it, and just use the default path 
> > otherwise.
> > In effect we already have exactly what you are looking for there.
> > 
> > As far as the ACL library goes, yes, thats more complex.  The use of sse4.2
> > intrinsics there is done througout the code, so theres no easy way to 
> > select a
> > path.  we're just left with either using the code or returning an error at 
> > run
> > time, as my patch does.  Certainly we can build some macros that either use 
> > the
> > intrinsics for sse4.2 or code up some C-level variants of those instructions
> > based on generic code, and build for the least common demoniator, or 
> > compile the
> > code twice (once without sse4.2 support, and once with), and do a runtime
> > selection between the two.  Either way, thats going to be a useful, though
> > significant effort.
> 
> I think a good first step here that I can't see anyone objecting to is
> to enable the ixgbe driver to use the vector code path for a generic
> x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
> to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
> allows a compile for machine type default, and testpmd can still forward
> packets at a good rate (roughly perf down about 10% vs native compile on
> SNB).
> The ACL is a tougher nut to crack, but anyone see any issues with that
> two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
> set thread, do you want to submit an official patch here, or would you prefer 
> I
> do so?]
> 

I'm happy to do so, Though 10% performance degradation vs. using the sse4.2
instructions in that path seems significant, isn't it? Given that performance
delta, it seems like it would still be preferable to have a path that used the
sse4.2 instructions when they're available.  Or am I misreading what you mean
when you say down 10%

Neil

> > 
> > > > 2) we could try to have performance with default build
> > > > 
> > > Yes, we can, thats what this patch does.  It doesn't address every code 
> > > path,
> > > no, but it addresses two paths that are low hanging fruit for doing so, 
> > > and we
> > > can incrementally build on that
> > > 
> > > > Please, let's focus on the first item and we could discuss about 
> > > > performance
> > > > later. Having some different code path choosed at runtime is a big 
> > > > rework and
> > > > imply changing the compilation model (RFC welcome).
> > > > 
> > Even if I misinterpreted your statement above, I'm still not sure why your
> > asserting this. Fixing the build to work with the default target machine is
> > good, and should be undertaken, and I'll happily do so, but why reject the
> > solution in front of you to wait for it?  Even if I write macros to fix up 
> > the
> > ACL library, I'd still like to be able to do a run time check and select the
> > optimized version or the generic version based on cpu support.  Just doing a
> > compile time check to determine if sse4.2 is available really isn't going 
> > to cut
> > it for me, as I don't want the fedora dpdk to have pessimal performance if 
> > it
> > doesn't have to.
> > 
> > Regards
> > Neil
> > 
> 
> With regards to the general approach for runtime detection of software
> functions, I wonder if something like this can be handled by the
> packaging system? Is it possible to ship out a set of shared libs
> compiled up for different instruction sets, and then at rpm install
> time, symlink the appropriate library? This would push the whole issue
> of detection of code paths outside of code, work across all our
> libraries and ensure each user got the best performance they could get
> form a binary?
> Has something like this been done before? The building of all the
> libraries could be scripted easy enough, just do multiple builds using
> different EXTRA_CFLAGS each time, and move and rename the .so's after
> each run.
> 
> /Bruce
> 

Reply via email to