[dpdk-dev] [PATCH 0/3] add i40e RSS support in VF
> As RSS in i40e VF is supported by hardware, these patches enable > it in i40e PMD, and also enable its testing in testpmd. > > Helin Zhang (3): > i40evf: add RSS support in VF > app/testpmd: enable RSS support for i40e > ethdev: improvements for some macro definition in head file Tested-by: Zhaochen Zhan I have tested this patch on KVM virtual machine with Fortville NIC. Testpmd support RSS works well with rx_queue=4/16, the max_queue can be configured by CONFIG_RTE_LIBRTE_I40E_QUEUE_NUM_PER_VF in the config file.
[dpdk-dev] [PATCH v2 0/6] Support configuring hash functions
> These pathches mainly support configuring hash functions. > In detail, > - It can select Toeplitz or simple XOR hash functions. > - It can configure symmetric hash functions. >* Get/set symmetric hash enable per port. >* Get/set symmetric hash enable per 'PCTYPE'. >* Get/set filter swap configurations. > - 'ethdev' level interfaces are added. >* 'is_command_supported', to check if a feature (command) > is supported on a port. >* 'rx_classification_filter_ctl', a common API to execute > specific command of each feature. > - Seven commands are implemented in testpmd to support >testing above. > Note that 'PCTYPE' means 'Packet Classification Type'. > > Helin Zhang (6): > ethdev: rename macros of packet classification type > ethdev: add new ops of 'is_command_supported' and > 'rx_classification_filter_ctl' > i40e: support of 'rx_classification_filter_ctl' > i40e: support of 'is_command_supported' > i40e: Initialize hash function during port initialization. > app/testpmd: add commands for configuring hash functions Tested-by: Zhaochen Zhan I have tested this patch on fedora20 with Fortville NIC. The hash function toeplitz/simple XOR/ symmetric all works well for ip/udp both ipv4 and ipv6 packets in testpmd support RSS.
[dpdk-dev] Compile error in igb_uio.c on RHEL 6.5
Hi Stephen It seems that the recent patches for igb_uio.c introduce compile errors on RHEL 6.5. As I searched the name that commit, could you help to have a check for that? Attached log file show all the kernel version, gcc version, and compile errors. Thank you in advance! Regards, Helin
[dpdk-dev] half packets dropped using l2fwd in VM
Hellow everyone, I have a question regarding l2fwd in VM. In my environment, about half of received packets are sent but remaining half are dropped so I get Tx: 14.88Mpps Rx: 7.4Mpps rate for 64 bytes packets at the traffic generator machine. On the other hand, I get Tx: 14.88Mpps Rx: 13.47Mpps using l2fwd in Host machine only. Something is wrong with my setting or that performance is no problem for VM? If there is a lack in my information, please let me know. It will be a great help if anyone advise me!! My environment is below. Both case 1 and 2 get same result where rate is Tx: 14.88Mpps Rx: 7.4Mpps. ** ==Host machine OS: Ubuntu 12.04 CPU: Xeon E5-2670 (8C/2.60GHz/20M) RAM: 32GB NIC: X520-DA2(dual port) Kernel: 3.11.0 hugepages=1024, intel_iommu=on, iommu=pt, pci=assign-busses qemu-kvm-0.14.0 [case1] VF0 and VF2 belong to PF0 VF1 and VF3 belong to PF1 /usr/local/kvm/bin/qemu-system-x86_64 -hda ./vm1.img \ -m 4096 -cpu host -smp 4 -boot c -k ja -name m2_vm1 \ -monitor telnet::,server,nowait -vnc :1 -daemonize \ -device pci-assign,host=02:10.0 \ -device pci-assign,host=02:10.1 \ -device pci-assign,host=02:10.2 \ -device pci-assign,host=02:10.3 [case1] VF0 belong to PF0 VF1 belong to PF1 /usr/local/kvm/bin/qemu-system-x86_64 -hda ./vm1.img \ -m 4096 -cpu host -smp 4 -boot c -k ja -name m2_vm1 \ -monitor telnet::,server,nowait -vnc :1 -daemonize \ -device pci-assign,host=02:10.0 \ -device pci-assign,host=02:10.1 ==Virtual machine OS: Ubuntu 12.04 Kernel: 3.11.0 DPDK: 1.6.0 [case1] l2fwd -c f -n 4 -- -q 1 -p 0x3 [case2] l2fwd -c 3 -n 4 -- -q 1 -p 0x3 ==Traffic generator machine Pktgen-DPDK pktgen>set mac 0 pktgen>start 0 ** Regards, nks
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
2014-07-30 14:09, Bruce Richardson: > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > Hey all- > > > > I've been trying to update the fedora dpdk package to support > > > > VFIO > > > > enabled drivers and ran into a problem in which ixgbe didn't compile > > > > because the > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't > > > > supported in the > > > > default config I have. I tried to remedy this by replacing the > > > > intrinsics with > > > > the __builtin macros, but it was pointed out (correctly), that this > > > > doesn't work > > > > properly. So this is my second attempt, which I actually like a bit > > > > better. I > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't > > > > need to > > > > have those instructions turned on build-wide. Rather, we can just > > > > enable the > > > > instructions in the specific code we want to build with support for > > > > that, and > > > > test for instruction support dynamically at run time. This allows me > > > > to build > > > > the dpdk for a generic platform, but in such a way that some > > > > optimizations can > > > > be used if the executing cpu supports them at run time. > > > > > > > > Signed-off-by: Neil Horman > > > > CC: Thomas Monjalon > > > > > > > I'd prefer if a solution could be found based off your original patch > > > set, as it gives us more chance to deprecate the older code paths in > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > and so should be available on all 64-bit systems, I believe. The > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > hopefully the __builtin should work for that. > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to > > offer > > accelerated code paths on systems that can make use of it at run time. If > > We > > use the __builtin compiler functions, we will either: > > > > 1) Build those code paths with advanced instructions that won't work on > > older > > systems (i.e. crash) > > > > 2) Build those code paths with less advanced instructions, meaning that we > > won't > > speedup execution on systems that are capable of using the more advanced > > instructions. > > > > Using this run time check, we can, at least in these situations, make use > > of the > > accelerated paths when the instructions are available, and ignore them when > > they're not, at run time. > > > > What would be ideal, would be an alternative type macro, like the linux > > kernel > > employs, but implementing that would require some pretty significant work > > and > > testing. This seems like a much simpler approach. > > > > Ok, I understand where you are coming from indeed. However, within that, > I'd like to see us reduce the amount of code that's needed for > maintenance. > > What we should really aim for, is to have common code, with perhaps some > small ifdefs or __builtins, and then compile that code multiple times > for multiple different architectures. So in this case, it would be nice > to use the __builtin, and then compile that code up with and without SSE > and select at runtime the code path to be used. Ideally, this could be > done at the driver level. Yes, having a runtime fallback in the driver to make it work efficiently on all architectures seems a good idea. We should keep in mind that it's very convenient to functionnaly test a code path on a basic machine. > However, once you get down this path, you are dealing with more than > just SSE. If I compile up the PMD on my system, which has a chip based > on Sandy Bridge uarch, I find that there are multiple instructions > starting with "vp" which means that they are actually AVX instructions. > Even though the code is written using intrinsics which correspond to SSE > operations, the compiler is free to use AVX instructions where necessary > to improve performance. Therefore, if we go down this road, we need to > look to compile up the code for all microarchitectures, rather than just > assuming that we will get equivalent performance to "native" by turning > on the instruction set indicated by the primitives in the code. This is > where having one codepath recompiled multiple times will work far better > than having multiple code paths. Choosing the best instructions for a task is the work of the compiler. Making this choice at runtime is a big task and probably not desirable. For performance, compilation must be done in "native mode" to let compiler make the right decisions. For compatibility, compilation must be done in "default mode". Here the problem is that "default mode" compilation is broken for ixgbe and acl. So we must fix it with, a
[dpdk-dev] L2FWD with multiple RX/TX queues
Hi All, I am trying to modify L2FWD app to start using multiple RX/TX queues. Problem I am facing is that ping fails because ARP replies are not received by the DPDK device. I am seeing that ARP request is received by the receiving host and it is even sending a ARP reply but this is not received on DPDK port. I ensured that I set correct RSS value (as in L3FWD app) and changed initialization procedure and removed any modification of packet. My set up is as shown below : HostA(172.16.16.101) <--> (Port0) DPDK Device (Port1) <--> HostB(172.16.16.102). I am able to get default L2fwd app after removing packet modification and testpmd with multiple RX/TX queue to work on this device successfully. OS : Ubuntu 12.04 Processor : ATOM Board : Intel I354 Command used : ./l2fwd -c 7f -n 4 -- -p 3 -config=(0,0,1),(0,1,2),(0,2,3),(1,0,4),(1,1,5),(1,2,6) DPDK Version : 1.6.0 and 1.7.0 (I have tried both) I have even enable debug messages on PMD and see that no reply packets were receive on Port1. Can anybody please help to fix this issue. Kindly let me know if you need any information. Thanks and Regards, Tilak
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
Hi Bruce, > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson > Sent: Wednesday, July 30, 2014 10:09 PM > To: Neil Horman > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of > some isolated features > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > Hey all- > > > > I've been trying to update the fedora dpdk package to support > > > > VFIO > > > > enabled drivers and ran into a problem in which ixgbe didn't compile > > > > because the > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't > > > > supported in the > > > > default config I have. I tried to remedy this by replacing the > > > > intrinsics with > > > > the __builtin macros, but it was pointed out (correctly), that this > > > > doesn't work > > > > properly. So this is my second attempt, which I actually like a bit > > > > better. I > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't > > > > need to > > > > have those instructions turned on build-wide. Rather, we can just > > > > enable the > > > > instructions in the specific code we want to build with support for > > > > that, and > > > > test for instruction support dynamically at run time. This allows me > > > > to build > > > > the dpdk for a generic platform, but in such a way that some > > > > optimizations can > > > > be used if the executing cpu supports them at run time. > > > > > > > > Signed-off-by: Neil Horman > > > > CC: Thomas Monjalon > > > > > > > I'd prefer if a solution could be found based off your original patch > > > set, as it gives us more chance to deprecate the older code paths in > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > and so should be available on all 64-bit systems, I believe. The > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > hopefully the __builtin should work for that. > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to > > offer > > accelerated code paths on systems that can make use of it at run time. If > > We > > use the __builtin compiler functions, we will either: > > > > 1) Build those code paths with advanced instructions that won't work on > > older > > systems (i.e. crash) > > > > 2) Build those code paths with less advanced instructions, meaning that we > > won't > > speedup execution on systems that are capable of using the more advanced > > instructions. > > > > Using this run time check, we can, at least in these situations, make use > > of the > > accelerated paths when the instructions are available, and ignore them when > > they're not, at run time. > > > > What would be ideal, would be an alternative type macro, like the linux > > kernel > > employs, but implementing that would require some pretty significant work > > and > > testing. This seems like a much simpler approach. > > > > Ok, I understand where you are coming from indeed. However, within that, > I'd like to see us reduce the amount of code that's needed for > maintenance. > > What we should really aim for, is to have common code, with perhaps some > small ifdefs or __builtins, and then compile that code multiple times > for multiple different architectures. So in this case, it would be nice > to use the __builtin, and then compile that code up with and without SSE > and select at runtime the code path to be used. Ideally, this could be > done at the driver level. > > However, once you get down this path, you are dealing with more than > just SSE. If I compile up the PMD on my system, which has a chip based > on Sandy Bridge uarch, I find that there are multiple instructions > starting with "vp" which means that they are actually AVX instructions. > Even though the code is written using intrinsics which correspond to SSE > operations, the compiler is free to use AVX instructions where necessary > to improve performance. > Therefore, if we go down this road, we need to > look to compile up the code for all microarchitectures, rather than just > assuming that we will get equivalent performance to "native" by turning > on the instruction set indicated by the primitives in the code. This is > where having one codepath recompiled multiple times will work far better > than having multiple code paths. Using your example - as long as we specify '-mavx' compiler can (and does) use AVX instructions even for 'scalar' code (code without any SIMD instrincts). And yes, that probably affects performance. So, as I understand your suggestion, we'll then need to divide our code into: - generic one - compiled to run on all supported platforms - performance critical that will be recompiled for each supporte
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > Hey all- > > > > I've been trying to update the fedora dpdk package to support > > > > VFIO > > > > enabled drivers and ran into a problem in which ixgbe didn't compile > > > > because the > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't > > > > supported in the > > > > default config I have. I tried to remedy this by replacing the > > > > intrinsics with > > > > the __builtin macros, but it was pointed out (correctly), that this > > > > doesn't work > > > > properly. So this is my second attempt, which I actually like a bit > > > > better. I > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't > > > > need to > > > > have those instructions turned on build-wide. Rather, we can just > > > > enable the > > > > instructions in the specific code we want to build with support for > > > > that, and > > > > test for instruction support dynamically at run time. This allows me > > > > to build > > > > the dpdk for a generic platform, but in such a way that some > > > > optimizations can > > > > be used if the executing cpu supports them at run time. > > > > > > > > Signed-off-by: Neil Horman > > > > CC: Thomas Monjalon > > > > > > > I'd prefer if a solution could be found based off your original patch > > > set, as it gives us more chance to deprecate the older code paths in > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > and so should be available on all 64-bit systems, I believe. The > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > hopefully the __builtin should work for that. > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to > > offer > > accelerated code paths on systems that can make use of it at run time. If > > We > > use the __builtin compiler functions, we will either: > > > > 1) Build those code paths with advanced instructions that won't work on > > older > > systems (i.e. crash) > > > > 2) Build those code paths with less advanced instructions, meaning that we > > won't > > speedup execution on systems that are capable of using the more advanced > > instructions. > > > > Using this run time check, we can, at least in these situations, make use > > of the > > accelerated paths when the instructions are available, and ignore them when > > they're not, at run time. > > > > What would be ideal, would be an alternative type macro, like the linux > > kernel > > employs, but implementing that would require some pretty significant work > > and > > testing. This seems like a much simpler approach. > > > > Ok, I understand where you are coming from indeed. However, within that, > I'd like to see us reduce the amount of code that's needed for > maintenance. > Ok, but that seems orthogonal to what I've done here. I've added about 10 lines of easily understandable code, which seems reasonable to me. > What we should really aim for, is to have common code, with perhaps some > small ifdefs or __builtins, and then compile that code multiple times > for multiple different architectures. So in this case, it would be nice > to use the __builtin, and then compile that code up with and without SSE > and select at runtime the code path to be used. Ideally, this could be > done at the driver level. > No, that is in direct conflict with what I'm trying to do here. My goal is to enable a build for a least common denominator system, but include those code paths that allow some performance benefit on systems which support the feature, to be determined at run time. My goal is improving performance in the Fedora dpdk package, which we build once for all supported systems on an arch. Building multiple libraries for multiple system configurations is simply an unmaintainable solution. Now, a macro that selected an instruction optimized or generic path is fine, as long as it can happen at run time. The Linux kernel has such a feature, called alternatives. But its a complex subsystem that does run time replacement of instructions based on cpu feature flags. It would be great to have in the DPDK, but its a significant code base and difficult to maintain, which goes against your desire to reduce code. > However, once you get down this path, you are dealing with more than > just SSE. If I compile up the PMD on my system, which has a chip based > on Sandy Bridge uarch, I find that there are multiple instructions > starting with "vp" which means that they are actually AVX instructions. You've done it wrong, you're building for the native machine target. For fedora I build for the
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
2014-07-31 09:13, Neil Horman: > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > Hey all- > > > > > I've been trying to update the fedora dpdk package to support > > > > > VFIO > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile > > > > > because the > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't > > > > > supported in the > > > > > default config I have. I tried to remedy this by replacing the > > > > > intrinsics with > > > > > the __builtin macros, but it was pointed out (correctly), that this > > > > > doesn't work > > > > > properly. So this is my second attempt, which I actually like a bit > > > > > better. I > > > > > noted that code that uses intrinsics (ixgbe and the acl library), > > > > > don't need to > > > > > have those instructions turned on build-wide. Rather, we can just > > > > > enable the > > > > > instructions in the specific code we want to build with support for > > > > > that, and > > > > > test for instruction support dynamically at run time. This allows me > > > > > to build > > > > > the dpdk for a generic platform, but in such a way that some > > > > > optimizations can > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > Signed-off-by: Neil Horman > > > > > CC: Thomas Monjalon > > > > > > > > > I'd prefer if a solution could be found based off your original patch > > > > set, as it gives us more chance to deprecate the older code paths in > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > > and so should be available on all 64-bit systems, I believe. The > > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > > hopefully the __builtin should work for that. > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to > > > offer > > > accelerated code paths on systems that can make use of it at run time. > > > If We > > > use the __builtin compiler functions, we will either: > > > > > > 1) Build those code paths with advanced instructions that won't work on > > > older > > > systems (i.e. crash) > > > > > > 2) Build those code paths with less advanced instructions, meaning that > > > we won't > > > speedup execution on systems that are capable of using the more advanced > > > instructions. > > > > > > Using this run time check, we can, at least in these situations, make use > > > of the > > > accelerated paths when the instructions are available, and ignore them > > > when > > > they're not, at run time. > > > > > > What would be ideal, would be an alternative type macro, like the linux > > > kernel > > > employs, but implementing that would require some pretty significant work > > > and > > > testing. This seems like a much simpler approach. [...] > Now, a macro that selected an instruction optimized or generic path is fine, > as > long as it can happen at run time. The Linux kernel has such a feature, > called > alternatives. But its a complex subsystem that does run time replacement of > instructions based on cpu feature flags. It would be great to have in the > DPDK, > but its a significant code base and difficult to maintain, which goes against > your desire to reduce code. [...] > > Even though the code is written using intrinsics which correspond to SSE > > operations, the compiler is free to use AVX instructions where necessary > Not if you use the default machine target. > > > to improve performance. Therefore, if we go down this road, we need to > > look to compile up the code for all microarchitectures, rather than just > > assuming that we will get equivalent performance to "native" by turning > > on the instruction set indicated by the primitives in the code. This is > No, you compile for the least common demonitor system, and enable more > performant paths opportunistically as run time checks allow. > > > where having one codepath recompiled multiple times will work far better > > than having multiple code paths. > Only if you're only concern is performance. As noted above, my goal is more > than just performance, its compatibility accross systems. Multiple builds for > multiple cpu flag availability is simply a non-starter for a generic > distribution. Neil, we are mixing 2 different problems here. 1) we have to fix default build (without SSE-4.2) 2) we could try to have performance with default build Please, let's focus on the first item and we could discuss about performance later. Having some different code path choosed at runtime is a big rework and imply changing the compilation model (RFC welcome). -- Thomas
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > 2014-07-31 09:13, Neil Horman: > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > Hey all- > > > > > > I've been trying to update the fedora dpdk package to > > > > > > support VFIO > > > > > > enabled drivers and ran into a problem in which ixgbe didn't > > > > > > compile because the > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't > > > > > > supported in the > > > > > > default config I have. I tried to remedy this by replacing the > > > > > > intrinsics with > > > > > > the __builtin macros, but it was pointed out (correctly), that this > > > > > > doesn't work > > > > > > properly. So this is my second attempt, which I actually like a > > > > > > bit better. I > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), > > > > > > don't need to > > > > > > have those instructions turned on build-wide. Rather, we can just > > > > > > enable the > > > > > > instructions in the specific code we want to build with support for > > > > > > that, and > > > > > > test for instruction support dynamically at run time. This allows > > > > > > me to build > > > > > > the dpdk for a generic platform, but in such a way that some > > > > > > optimizations can > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > Signed-off-by: Neil Horman > > > > > > CC: Thomas Monjalon > > > > > > > > > > > I'd prefer if a solution could be found based off your original patch > > > > > set, as it gives us more chance to deprecate the older code paths in > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows > > > > > that > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > > > hopefully the __builtin should work for that. > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is > > > > to offer > > > > accelerated code paths on systems that can make use of it at run time. > > > > If We > > > > use the __builtin compiler functions, we will either: > > > > > > > > 1) Build those code paths with advanced instructions that won't work on > > > > older > > > > systems (i.e. crash) > > > > > > > > 2) Build those code paths with less advanced instructions, meaning that > > > > we won't > > > > speedup execution on systems that are capable of using the more advanced > > > > instructions. > > > > > > > > Using this run time check, we can, at least in these situations, make > > > > use of the > > > > accelerated paths when the instructions are available, and ignore them > > > > when > > > > they're not, at run time. > > > > > > > > What would be ideal, would be an alternative type macro, like the linux > > > > kernel > > > > employs, but implementing that would require some pretty significant > > > > work and > > > > testing. This seems like a much simpler approach. > > [...] > > > Now, a macro that selected an instruction optimized or generic path is > > fine, as > > long as it can happen at run time. The Linux kernel has such a feature, > > called > > alternatives. But its a complex subsystem that does run time replacement of > > instructions based on cpu feature flags. It would be great to have in the > > DPDK, > > but its a significant code base and difficult to maintain, which goes > > against > > your desire to reduce code. > > [...] > > > > Even though the code is written using intrinsics which correspond to SSE > > > operations, the compiler is free to use AVX instructions where necessary > > Not if you use the default machine target. > > > > > to improve performance. Therefore, if we go down this road, we need to > > > look to compile up the code for all microarchitectures, rather than just > > > assuming that we will get equivalent performance to "native" by turning > > > on the instruction set indicated by the primitives in the code. This is > > No, you compile for the least common demonitor system, and enable more > > performant paths opportunistically as run time checks allow. > > > > > where having one codepath recompiled multiple times will work far better > > > than having multiple code paths. > > Only if you're only concern is performance. As noted above, my goal is more > > than just performance, its compatibility accross systems. Multiple builds > > for > > multiple cpu flag availability is simply a non-starter for a generic > > distribution. > > Neil, we are mixing 2 different problems here. > 1) we have to fix default build (w
[dpdk-dev] free a memzone
Hi Konstantin, Thank you for your response. Your Idea seems great! I will implement it this way. Best Regards, Mahdi. On Mon, Jul 28, 2014 at 9:39 PM, Ananyev, Konstantin < konstantin.ananyev at intel.com> wrote: > Hi Mahdi, > > >Hi Konstantin, > >Thank you very much. Your solution fixed my problem. > >Is there a solution like this for resetting the memory zone, which is > used by rte_malloc function? > >Because if I use rte_malloc instead of malloc, in the case of application > crash, the memory zone, which was used by rte_malloc in the previous run > would be >unusable for the next run of slave process. > > Without significant modification inside librte_eal and/or librte_malloc - > nothing comes on top of my head. > If that is such a big problem to you, might be it is possible to change > your whole process model a bit: > > In the parent process: > - allocate memory/init run-time strcutures > L1: > - fork(); > - wait till child terminates > - if it terminated abnormally, then free memory/re-init run-time > structures. > Goto L1 > > Do actual packet processing inside child process. > > Would that help somehow? > Konstantin > > > -Original Message- > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Mahdi Dashtbozorgi > > Sent: Thursday, July 24, 2014 6:20 AM > > To: dev at dpdk.org > > Subject: Re: [dpdk-dev] free a memzone > > > > Hi Bruce, > > > > Thank you for the response. That's a great Idea! > > But I do not understand the last four parameters of this function. > (vaddr, > > paddr, pg_num, pg_shift) > > I guess vaddr is the virtual address of the previously allocated mempool, > yes > > > paddr is calculated using function call rte_mem_virt2phy(vaddr), am I > > right? > yes > > >what about pg_num and pg_shift? how can I pass them correctly? > From rte_mempool.h: > "* @param pg_num > * Number of elements in the paddr array. > * @param pg_shift > * LOG2 of the physical pages size." > > If you are using memzone as externally allocated memory - it will be > already physically continuos. > So in your case pg_num = MEMPOOL_PG_NUM_DEFAULT, pg_shift = > MEMPOOL_PG_SHIFT_MAX. > > Though, I don't think rte_mempool_xmem_create() will help you in any way. > Again from rte_mempool.h: > "* Creates a new mempool named *name* in memory. > * > * This function uses ``memzone_reserve()`` to allocate memory. The > * pool contains n elements of elt_size. Its size is set to n. > * Depending on the input parameters, mempool elements can be either > allocated > * together with the mempool header, or an externally provided memory > buffer > * could be used to store mempool objects. In later case, that external > * memory buffer can consist of set of disjoint phyiscal pages." > > So xmem_create would still create a new ring, reserve a new memzone of > mempool's metadata, etc. > The only difference - it can use externally allocated memory to store > mempool elements. > > As I understand what you need is sort of mempool_reset(): a function that > would re-init mempool to just created state > (all elements are free, lcores caches are empty, etc). > Right now we don't have such function, but I suppose something like that > should do > (note that I didn't run or even build it): > > If ((mp = rte_mempool_lookup(name)) != NULL { > > char ring_name[RTE_RING_NAMESIZE]; > > /* save mp ring name. */ > memcpy(ring_name, mp->ring->name, sizeof ring_name); > > /* reset the ring. */ > rte_ring_init(mp->ring, ring_name, rte_align32pow2(mp->size+1), > mp->ring->flags); > > /*repopulate mempool and reinit all its elements. */ > mempool_populate(mp, mp->size, 1, rte_pktmbuf_init, NULL); > > /* reset all lcore caches. */ > memset(mp->local_cache, 0, sizeof(local_cahce)); > > /* reset statistics if needed. */ > } else { > /* create new mempool. */ > } > > Ideally such function should be in the librte_mempool of course, but if > you are in a hurry - you probably can give it a try. > > Note that I assume that no other process, except failed/restarting > secondary are using this mempool. > If primary or some other secondary do, then first you need to stop them > using this mempool and wait till they finish all their packet processing > activity. > > Konstantin > > > Best Regards, > > Mahdi. > > > > > > On Thu, Jul 24, 2014 at 9:48 AM, Mahdi Dashtbozorgi > > wrote: > > > > > Hi Bruce, > > > > > > Thank you for the response. That's a great Idea! > > > But I do not understand the last four parameters of this function. > (vaddr, > > > paddr, pg_num, pg_shift) > > > I guess vaddr is the virtual address of the previously allocated > mempool, > > > paddr is calculated using function call rte_mem_virt2phy(vaddr), am I > > > right? what about pg_num and pg_shift? how can I pass them correctly? > > > > > > Best Regards, > > > Mahdi. > > > > > > > > > On Wed, Jul 23, 2014 at 11:09 PM, Richardson, Bruce < > > > bruce.richardson at intel.com> wrote: > > > > > >> Rather than freeing the previously allocated memzone,
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > 2014-07-31 09:13, Neil Horman: > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > Hey all- > > > > > > > I've been trying to update the fedora dpdk package to > > > > > > > support VFIO > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't > > > > > > > compile because the > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't > > > > > > > supported in the > > > > > > > default config I have. I tried to remedy this by replacing the > > > > > > > intrinsics with > > > > > > > the __builtin macros, but it was pointed out (correctly), that > > > > > > > this doesn't work > > > > > > > properly. So this is my second attempt, which I actually like a > > > > > > > bit better. I > > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), > > > > > > > don't need to > > > > > > > have those instructions turned on build-wide. Rather, we can > > > > > > > just enable the > > > > > > > instructions in the specific code we want to build with support > > > > > > > for that, and > > > > > > > test for instruction support dynamically at run time. This > > > > > > > allows me to build > > > > > > > the dpdk for a generic platform, but in such a way that some > > > > > > > optimizations can > > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > > > Signed-off-by: Neil Horman > > > > > > > CC: Thomas Monjalon > > > > > > > > > > > > > I'd prefer if a solution could be found based off your original > > > > > > patch > > > > > > set, as it gives us more chance to deprecate the older code paths in > > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows > > > > > > that > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than > > > > > > SSE4.x, > > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > > popcount intrinsic is newer, but it's a much more basic instruction > > > > > > so > > > > > > hopefully the __builtin should work for that. > > > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is > > > > > to offer > > > > > accelerated code paths on systems that can make use of it at run > > > > > time. If We > > > > > use the __builtin compiler functions, we will either: > > > > > > > > > > 1) Build those code paths with advanced instructions that won't work > > > > > on older > > > > > systems (i.e. crash) > > > > > > > > > > 2) Build those code paths with less advanced instructions, meaning > > > > > that we won't > > > > > speedup execution on systems that are capable of using the more > > > > > advanced > > > > > instructions. > > > > > > > > > > Using this run time check, we can, at least in these situations, make > > > > > use of the > > > > > accelerated paths when the instructions are available, and ignore > > > > > them when > > > > > they're not, at run time. > > > > > > > > > > What would be ideal, would be an alternative type macro, like the > > > > > linux kernel > > > > > employs, but implementing that would require some pretty significant > > > > > work and > > > > > testing. This seems like a much simpler approach. > > > > [...] > > > > > Now, a macro that selected an instruction optimized or generic path is > > > fine, as > > > long as it can happen at run time. The Linux kernel has such a feature, > > > called > > > alternatives. But its a complex subsystem that does run time replacement > > > of > > > instructions based on cpu feature flags. It would be great to have in > > > the DPDK, > > > but its a significant code base and difficult to maintain, which goes > > > against > > > your desire to reduce code. > > > > [...] > > > > > > Even though the code is written using intrinsics which correspond to SSE > > > > operations, the compiler is free to use AVX instructions where necessary > > > Not if you use the default machine target. > > > > > > > to improve performance. Therefore, if we go down this road, we need to > > > > look to compile up the code for all microarchitectures, rather than just > > > > assuming that we will get equivalent performance to "native" by turning > > > > on the instruction set indicated by the primitives in the code. This is > > > No, you compile for the least common demonitor system, and enable more > > > performant paths opportunistically as run time checks allow. > > > > > > > where having one codepath recompiled multiple times will work far better > > > > than having multiple code paths. > > > Only if you're only concern
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > 2014-07-31 09:13, Neil Horman: > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > Hey all- > > > > > > > > I've been trying to update the fedora dpdk package to > > > > > > > > support VFIO > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't > > > > > > > > compile because the > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't > > > > > > > > supported in the > > > > > > > > default config I have. I tried to remedy this by replacing the > > > > > > > > intrinsics with > > > > > > > > the __builtin macros, but it was pointed out (correctly), that > > > > > > > > this doesn't work > > > > > > > > properly. So this is my second attempt, which I actually like > > > > > > > > a bit better. I > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl > > > > > > > > library), don't need to > > > > > > > > have those instructions turned on build-wide. Rather, we can > > > > > > > > just enable the > > > > > > > > instructions in the specific code we want to build with support > > > > > > > > for that, and > > > > > > > > test for instruction support dynamically at run time. This > > > > > > > > allows me to build > > > > > > > > the dpdk for a generic platform, but in such a way that some > > > > > > > > optimizations can > > > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > > > > > Signed-off-by: Neil Horman > > > > > > > > CC: Thomas Monjalon > > > > > > > > > > > > > > > I'd prefer if a solution could be found based off your original > > > > > > > patch > > > > > > > set, as it gives us more chance to deprecate the older code paths > > > > > > > in > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it > > > > > > > shows that > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than > > > > > > > SSE4.x, > > > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > > > popcount intrinsic is newer, but it's a much more basic > > > > > > > instruction so > > > > > > > hopefully the __builtin should work for that. > > > > > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which > > > > > > is to offer > > > > > > accelerated code paths on systems that can make use of it at run > > > > > > time. If We > > > > > > use the __builtin compiler functions, we will either: > > > > > > > > > > > > 1) Build those code paths with advanced instructions that won't > > > > > > work on older > > > > > > systems (i.e. crash) > > > > > > > > > > > > 2) Build those code paths with less advanced instructions, meaning > > > > > > that we won't > > > > > > speedup execution on systems that are capable of using the more > > > > > > advanced > > > > > > instructions. > > > > > > > > > > > > Using this run time check, we can, at least in these situations, > > > > > > make use of the > > > > > > accelerated paths when the instructions are available, and ignore > > > > > > them when > > > > > > they're not, at run time. > > > > > > > > > > > > What would be ideal, would be an alternative type macro, like the > > > > > > linux kernel > > > > > > employs, but implementing that would require some pretty > > > > > > significant work and > > > > > > testing. This seems like a much simpler approach. > > > > > > [...] > > > > > > > Now, a macro that selected an instruction optimized or generic path is > > > > fine, as > > > > long as it can happen at run time. The Linux kernel has such a > > > > feature, called > > > > alternatives. But its a complex subsystem that does run time > > > > replacement of > > > > instructions based on cpu feature flags. It would be great to have in > > > > the DPDK, > > > > but its a significant code base and difficult to maintain, which goes > > > > against > > > > your desire to reduce code. > > > > > > [...] > > > > > > > > Even though the code is written using intrinsics which correspond to > > > > > SSE > > > > > operations, the compiler is free to use AVX instructions where > > > > > necessary > > > > Not if you use the default machine target. > > > > > > > > > to improve performance. Therefore, if we go down this road, we need to > > > > > look to compile up the code for all microarchitectures, rather than > > > > > just > > > > > assuming that we will get equivalent performance to "native" by > > > > > turning > > > > > on the instruction set indicated by the primitives in the code.
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > 2014-07-31 09:13, Neil Horman: > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > Hey all- > > > > > > > > > I've been trying to update the fedora dpdk package to > > > > > > > > > support VFIO > > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't > > > > > > > > > compile because the > > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which > > > > > > > > > aren't supported in the > > > > > > > > > default config I have. I tried to remedy this by replacing > > > > > > > > > the intrinsics with > > > > > > > > > the __builtin macros, but it was pointed out (correctly), > > > > > > > > > that this doesn't work > > > > > > > > > properly. So this is my second attempt, which I actually > > > > > > > > > like a bit better. I > > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl > > > > > > > > > library), don't need to > > > > > > > > > have those instructions turned on build-wide. Rather, we can > > > > > > > > > just enable the > > > > > > > > > instructions in the specific code we want to build with > > > > > > > > > support for that, and > > > > > > > > > test for instruction support dynamically at run time. This > > > > > > > > > allows me to build > > > > > > > > > the dpdk for a generic platform, but in such a way that some > > > > > > > > > optimizations can > > > > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > > > > > > > Signed-off-by: Neil Horman > > > > > > > > > CC: Thomas Monjalon > > > > > > > > > > > > > > > > > I'd prefer if a solution could be found based off your original > > > > > > > > patch > > > > > > > > set, as it gives us more chance to deprecate the older code > > > > > > > > paths in > > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it > > > > > > > > shows that > > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than > > > > > > > > SSE4.x, > > > > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > > > > popcount intrinsic is newer, but it's a much more basic > > > > > > > > instruction so > > > > > > > > hopefully the __builtin should work for that. > > > > > > > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, > > > > > > > which is to offer > > > > > > > accelerated code paths on systems that can make use of it at run > > > > > > > time. If We > > > > > > > use the __builtin compiler functions, we will either: > > > > > > > > > > > > > > 1) Build those code paths with advanced instructions that won't > > > > > > > work on older > > > > > > > systems (i.e. crash) > > > > > > > > > > > > > > 2) Build those code paths with less advanced instructions, > > > > > > > meaning that we won't > > > > > > > speedup execution on systems that are capable of using the more > > > > > > > advanced > > > > > > > instructions. > > > > > > > > > > > > > > Using this run time check, we can, at least in these situations, > > > > > > > make use of the > > > > > > > accelerated paths when the instructions are available, and ignore > > > > > > > them when > > > > > > > they're not, at run time. > > > > > > > > > > > > > > What would be ideal, would be an alternative type macro, like the > > > > > > > linux kernel > > > > > > > employs, but implementing that would require some pretty > > > > > > > significant work and > > > > > > > testing. This seems like a much simpler approach. > > > > > > > > [...] > > > > > > > > > Now, a macro that selected an instruction optimized or generic path > > > > > is fine, as > > > > > long as it can happen at run time. The Linux kernel has such a > > > > > feature, called > > > > > alternatives. But its a complex subsystem that does run time > > > > > replacement of > > > > > instructions based on cpu feature flags. It would be great to have > > > > > in the DPDK, > > > > > but its a significant code base and difficult to maintain, which goes > > > > > against > > > > > your desire to reduce code. > > > > > > > > [...] > > > > > > > > > > Even though the code is written using intrinsics which correspond > > > > > > to SSE > > > > > > operations, the compiler is free to use AVX instructions where > > > > > > necessary > > > > > Not if you use the default machine target. > > > > > > > > > > > to improve performance. Therefore, if we go down this road, we nee
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > With regards to the general approach for runtime detection of software > functions, I wonder if something like this can be handled by the > packaging system? Is it possible to ship out a set of shared libs > compiled up for different instruction sets, and then at rpm install > time, symlink the appropriate library? This would push the whole issue > of detection of code paths outside of code, work across all our > libraries and ensure each user got the best performance they could get > form a binary? > Has something like this been done before? The building of all the > libraries could be scripted easy enough, just do multiple builds using > different EXTRA_CFLAGS each time, and move and rename the .so's after > each run. I'm not aware of a package that does anything like that. It probably is possible, but I imagine that it would provoke a lot of debate and consternation in FESCO... -- John W. LinvilleSomeday the world will need a hero, and you linville at tuxdriver.com might be all we have. Be ready.
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > 2014-07-31 09:13, Neil Horman: > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > Hey all- > > With regards to the general approach for runtime detection of software > functions, I wonder if something like this can be handled by the > packaging system? Is it possible to ship out a set of shared libs > compiled up for different instruction sets, and then at rpm install > time, symlink the appropriate library? This would push the whole issue > of detection of code paths outside of code, work across all our > libraries and ensure each user got the best performance they could get > form a binary? > Has something like this been done before? The building of all the > libraries could be scripted easy enough, just do multiple builds using > different EXTRA_CFLAGS each time, and move and rename the .so's after > each run. > Sorry, I missed this in my last reply. In answer to your question, the short version is that such a thing is roughly possible from a packaging standpoint, but completely unworkable from a distribution standpoint. We could certainly build the dpdk multiple times and rename all the shared objects to some variant name representative of the optimzations we build in for certain cpu flags, but then we woudl be shipping X versions of the dpdk, and any appilcation (say OVS that made use of the dpdk would need to provide a version linked against each variant to be useful when making a product, and each end user would need to manually select (or run a script to select) which variant is most optimized for the system at hand. Its just not a reasonable way to package a library. When pacaging software, the only consideration given to code variance at pacakge time is architecture (x86/x86_64/ppc/s390/etc). If you install a package for your a given architecture, its expected to run on that architecture. Optional code paths are just that, optional, and executed based on run time tests. Its a requirement that we build for the lowest common demoniator system that is supported, and enable accelerative code paths optionally at run time when the cpu indicates support for them. Neil
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > I think a good first step here that I can't see anyone objecting to is > > to enable the ixgbe driver to use the vector code path for a generic > > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" > > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] > > allows a compile for machine type default, and testpmd can still forward > > packets at a good rate (roughly perf down about 10% vs native compile on > > SNB). > > The ACL is a tougher nut to crack, but anyone see any issues with that > > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch > > set thread, do you want to submit an official patch here, or would you > > prefer I > > do so?] > > > > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 > instructions in that path seems significant, isn't it? Given that performance > delta, it seems like it would still be preferable to have a path that used the > sse4.2 instructions when they're available. Or am I misreading what you mean > when you say down 10% > > Neil > Ok, I did a little bit more testing here. Using the vector pmd compiled for generic x86_64 and using __builtin_popcountll is approx 35% faster for packet IO than the existing fast-path functions. It is also 7% (a bit lower than ~10% as I originally stated) slower than the existing native-compiled vpmd on a Sandy Bridge platform. I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the extra instructions. The ~7% performance drop went to ~3%, so we would gain a little more with using SSE4.2, but compared to the gain from having the vector driver at all, it's not that much. [I don't have a system handy with AVX2 support to see what boosts might come from compiling with that instruction set enabled.] Because of this, I'd take the ~35% speed boost for now, and try and find what would be the best general way to solve this problem across all libraries. Also, I think that anyone who needs that extra 4% performance probably wants the other 3% too, and so will compile up the code from source using the "native" compilation target. :-) /Bruce
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 03:58:30PM -0400, John W. Linville wrote: > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > With regards to the general approach for runtime detection of software > > functions, I wonder if something like this can be handled by the > > packaging system? Is it possible to ship out a set of shared libs > > compiled up for different instruction sets, and then at rpm install > > time, symlink the appropriate library? This would push the whole issue > > of detection of code paths outside of code, work across all our > > libraries and ensure each user got the best performance they could get > > form a binary? > > Has something like this been done before? The building of all the > > libraries could be scripted easy enough, just do multiple builds using > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > each run. > > I'm not aware of a package that does anything like that. It probably > is possible, but I imagine that it would provoke a lot of debate > and consternation in FESCO... Nothing like a bit of consternation to get the adrenaline pumping, right :-) BTW: what is FESCO? /Bruce > > -- > John W. Linville Someday the world will need a hero, and you > linville at tuxdriver.com might be all we have. Be ready.
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > > 2014-07-31 09:13, Neil Horman: > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson > > > > > > > > wrote: > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > > Hey all- > > > > With regards to the general approach for runtime detection of software > > functions, I wonder if something like this can be handled by the > > packaging system? Is it possible to ship out a set of shared libs > > compiled up for different instruction sets, and then at rpm install > > time, symlink the appropriate library? This would push the whole issue > > of detection of code paths outside of code, work across all our > > libraries and ensure each user got the best performance they could get > > form a binary? > > Has something like this been done before? The building of all the > > libraries could be scripted easy enough, just do multiple builds using > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > each run. > > > > Sorry, I missed this in my last reply. > > In answer to your question, the short version is that such a thing is roughly > possible from a packaging standpoint, but completely unworkable from a > distribution standpoint. We could certainly build the dpdk multiple times and > rename all the shared objects to some variant name representative of the > optimzations we build in for certain cpu flags, but then we woudl be shipping > X > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk > would need to provide a version linked against each variant to be useful when > making a product, and each end user would need to manually select (or run a > script to select) which variant is most optimized for the system at hand. Its > just not a reasonable way to package a library. Sorry, perhaps I was not clear, having the user have to select the appropriate library was not what I was suggesting. Instead, I was suggesting that the rpm install "librte_pmd_ixgbe.so.generic", "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm post-install script would look at the cpuflags in cpuinfo and then symlink librte_pmd_ixgbe.so to the best-match version. That way the user only has to link against "librte_pmd_ixgbe.so" and depending on the system its run on, the loader will automatically resolve the symbols from the appropriate instruction-set specific .so file. > > When pacaging software, the only consideration given to code variance at > pacakge > time is architecture (x86/x86_64/ppc/s390/etc). If you install a package for > your a given architecture, its expected to run on that architecture. Optional > code paths are just that, optional, and executed based on run time tests. > Its a > requirement that we build for the lowest common demoniator system that is > supported, and enable accelerative code paths optionally at run time when the > cpu indicates support for them. > > Neil >
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
On Thu, Jul 31, 2014 at 01:20:42PM -0700, Bruce Richardson wrote: > On Thu, Jul 31, 2014 at 03:58:30PM -0400, John W. Linville wrote: > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > > With regards to the general approach for runtime detection of software > > > functions, I wonder if something like this can be handled by the > > > packaging system? Is it possible to ship out a set of shared libs > > > compiled up for different instruction sets, and then at rpm install > > > time, symlink the appropriate library? This would push the whole issue > > > of detection of code paths outside of code, work across all our > > > libraries and ensure each user got the best performance they could get > > > form a binary? > > > Has something like this been done before? The building of all the > > > libraries could be scripted easy enough, just do multiple builds using > > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > > each run. > > > > I'm not aware of a package that does anything like that. It probably > > is possible, but I imagine that it would provoke a lot of debate > > and consternation in FESCO... > > Nothing like a bit of consternation to get the adrenaline pumping, right > :-) > BTW: what is FESCO? Fedora Engineering Steering Committee Neil and I have already felt the hot breath of FESCO on our necks regarding the Fedora DPDK package... John -- John W. LinvilleSomeday the world will need a hero, and you linville at tuxdriver.com might be all we have. Be ready.
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
2014-07-31 10:32, Neil Horman: > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > Neil, we are mixing 2 different problems here. > > 1) we have to fix default build (without SSE-4.2) > > Thats nothing to fix, thats a configuration issue. Just build for a lesser > machine. I've already done that in the fedora build, using the defalut > machine > target. What exactly is missing from that? I mean that building with RTE_MACHINE=default is broken in ixgbe and acl. So the main priority is to fix it. But performance of the native build must not be degraded. And if we can have good performance with a default build, it's fine. -- Thomas
[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
2014-07-31 14:10, Neil Horman: > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > Please, let's focus on the first item and we could discuss about performance > > > later. Having some different code path choosed at runtime is a big rework > > > and > > > imply changing the compilation model (RFC welcome). > > > > Even if I misinterpreted your statement above, I'm still not sure why your > asserting this. Fixing the build to work with the default target machine is > good, and should be undertaken, and I'll happily do so, but why reject the > solution in front of you to wait for it? I'm not rejecting the solution. Let's try to improve performance of the default build with runtime checks. Seeing patches and benchmarks will be interesting. You're opening a door and I don't know if we'll see the ceiling of this room :) -- Thomas