[dpdk-dev] [PATCH] examples/vhost: increase MAX_QUEUE number
increase MAX_QUEUES from 256 to 512 In vhost example, MAX_QUEUES macro should be the maximum possible queue number of the port. Theoretically we should only set up the queues that are used, i.e., first rx queue of each pool, or at most queues from 0 to MAX_QUEUES. Before we revise the implementation and are certain all NICs support this well, add a remind message to user. Signed-off-by: Huawei Xie --- examples/vhost/main.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/examples/vhost/main.c b/examples/vhost/main.c index 0c76ece..d2c1852 100644 --- a/examples/vhost/main.c +++ b/examples/vhost/main.c @@ -53,7 +53,7 @@ #include "main.h" -#define MAX_QUEUES 256 +#define MAX_QUEUES 512 /* the maximum number of external ports supported */ #define MAX_SUP_PORTS 1 @@ -380,6 +380,12 @@ port_init(uint8_t port) /* The max pool number from dev_info will be used to validate the pool number specified in cmd line */ rte_eth_dev_info_get (port, &dev_info); + if (dev_info.max_rx_queues > MAX_QUEUES) { + rte_exit(EXIT_FAILURE, + "please define MAX_QUEUES no less than %u in %s\n", + dev_info.max_rx_queues, __FILE__); + } + rxconf = &dev_info.default_rxconf; txconf = &dev_info.default_txconf; rxconf->rx_drop_en = 1; -- 1.8.1.4
[dpdk-dev] [PATCH v4] VFIO: Avoid to enable vfio while the module not loaded
Hi Michael: > -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Qiu, Michael > Sent: Monday, December 08, 2014 8:28 AM > To: Burakov, Anatoly; dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH v4] VFIO: Avoid to enable vfio while the module > not loaded > > On 2014/12/8 20:19, Burakov, Anatoly wrote: > >> When vfio module is not loaded when kernel support vfio feature, the > >> routine still try to open the container to get file description. > >> > >> This action is not safe, and of cause got error messages: > >> > >> EAL: Detected 40 lcore(s) > >> EAL: unsupported IOMMU type! > >> EAL: VFIO support could not be initialized > >> EAL: Setting up memory... > >> > >> This may make user confuse, this patch make it reasonable and much more > >> soomth to user. > >> > >> Signed-off-by: Michael Qiu > >> --- > >> v4 --> v3: > >>1. Remove RTE_LOG for params check > >>2. Remove "vfio" module check as "vfio_iommu_type1" > >> loaded indecated "vfio" loaded > >> > >> v3 --> v2: > >> 1. Add error log in rte_eal_check_module() > >> 2. Some code clean up. > >> > >> v2 --> v1: > >> 1. Move check_module() from rte_common.h to eal_private.h > >>and rename to rte_eal_check_module(). > >>To make it linuxapp only. > >> 2. Some code clean up. > >> > >> lib/librte_eal/common/eal_private.h| 42 > >> ++ > >> lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 29 ++--- > >> 2 files changed, 68 insertions(+), 3 deletions(-) > >> > >> diff --git a/lib/librte_eal/common/eal_private.h > >> b/lib/librte_eal/common/eal_private.h > >> index 232fcec..e877a25 100644 > >> --- a/lib/librte_eal/common/eal_private.h > >> +++ b/lib/librte_eal/common/eal_private.h > >> @@ -35,6 +35,9 @@ > >> #define _EAL_PRIVATE_H_ > >> > >> #include > >> +#include > >> +#include > >> +#include > >> > >> /** > >> * Initialize the memzone subsystem (private to eal). > >> @@ -203,4 +206,43 @@ int rte_eal_alarm_init(void); > >> */ > >> int rte_eal_dev_init(void); > >> > >> +/** > >> + * Function is to check if the kernel module(like, vfio, > >> +vfio_iommu_type1, > >> + * etc.) loaded. > >> + * > >> + * @param module_name > >> + *The module's name which need to be checked > >> + * > >> + * @return > >> + *-1 means some error happens(NULL pointer or open failure) > >> + *0 means the module not loaded > >> + *1 means the module loaded > >> + */ > >> +static inline int > >> +rte_eal_check_module(const char *module_name) { > >> + char mod_name[30]; /* Any module names can be longer than 30 > >> bytes? */ > >> + int ret = 0; > >> + > >> + if (NULL == module_name) > >> + return -1; > >> + > >> + FILE * fd = fopen("/proc/modules", "r"); > >> + if (NULL == fd) { > >> + RTE_LOG(ERR, EAL, "Open /proc/modules failed!" > >> + " error %i (%s)\n", errno, strerror(errno)); > >> + return -1; > >> + } > >> + while(!feof(fd)) { > >> + fscanf(fd, "%s %*[^\n]", mod_name); I see it is already discussed in the other mail that we could limit the count to avoid overflow. I don't understand why you don't apply it here. There are already several existing modules that has 20+ length. > >> + if(!strcmp(mod_name, module_name)) { > >> + ret = 1; > >> + break; > >> + } > >> + } > >> + fclose(fd); > >> + > >> + return ret; > >> +} > >> + > > Apologies for not bringing this up before, but do we really want the > rte_eal_check_module inline in the header? I think it would be better to > declare > it in eal_private but move the definition into eal.c. > > No need, actually, I'm very appreciate that you can spend your time to > review my patch again and again. I really want to say thank you to you. > > For rte_eal_check_module inline in the header, it really no need stay in > header, so ugly. I will make new version of it, and re-post. > > > >> #endif /* _EAL_PRIVATE_H_ */ > >> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > >> b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > >> index c1246e8..8c54d2a 100644 > >> --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > >> +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > >> @@ -44,6 +44,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> > >> #include "eal_filesystem.h" > >> #include "eal_pci_init.h" > >> @@ -339,10 +340,15 @@ pci_vfio_get_container_fd(void) > >>ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, > >> VFIO_TYPE1_IOMMU); > >>if (ret != 1) { > >>if (ret < 0) > >> - RTE_LOG(ERR, EAL, " could not get IOMMU > >> type, " > >> - "error %i (%s)\n", errno, > >> strerror(errno)); > >> + RTE_LOG(ERR, EAL, " could not get IOMMU > >> type," > >> +
[dpdk-dev] [RFC PATCH 00/17] Single virtio implementation
Hi Michael, > -Original Message- > From: Qiu, Michael > Sent: Wednesday, December 10, 2014 12:03 AM > To: Ouyang, Changchun; Richardson, Bruce > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [RFC PATCH 00/17] Single virtio implementation > > On 2014/12/9 22:19, Ouyang, Changchun wrote: > > Hi Bruce, > > > >> -Original Message- > >> From: Richardson, Bruce > >> Sent: Tuesday, December 9, 2014 5:47 PM > >> To: Ouyang, Changchun > >> Cc: Thomas Monjalon; dev at dpdk.org > >> Subject: Re: [dpdk-dev] [RFC PATCH 00/17] Single virtio > >> implementation > >> > >> On Tue, Dec 09, 2014 at 06:40:23AM +, Ouyang, Changchun wrote: > >>> > -Original Message- > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com] > Sent: Tuesday, December 9, 2014 2:12 PM > To: Ouyang, Changchun > Cc: Qiu, Michael; Stephen Hemminger; dev at dpdk.org > Subject: Re: [dpdk-dev] [RFC PATCH 00/17] Single virtio > implementation > > 2014-12-09 05:41, Ouyang, Changchun: > > Hi > > > >> -Original Message- > >> From: Qiu, Michael > >> Sent: Tuesday, December 9, 2014 11:23 AM > >> To: Ouyang, Changchun; Thomas Monjalon; Stephen Hemminger > >> Cc: dev at dpdk.org > >> Subject: Re: [dpdk-dev] [RFC PATCH 00/17] Single virtio > >> implementation > >> > >> On 12/9/2014 9:11 AM, Ouyang, Changchun wrote: > >>> Hi Thomas, > >>> > -Original Message- > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com] > Sent: Monday, December 8, 2014 5:31 PM > To: Ouyang, Changchun > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [RFC PATCH 00/17] Single virtio > implementation > > Hi Changchun, > > 2014-12-08 14:21, Ouyang Changchun: > > This patch set bases on two original RFC patch sets from > > Stephen > Hemminger[stephen at networkplumber.org] > > Refer to > > [http://dpdk.org/ml/archives/dev/2014-August/004845.html ] > for > the original one. > > This patch set also resolves some conflict with latest codes > > and removed > duplicated codes. > > As you sent the patches, you appear as the author. > But I guess Stephen should be the author for some of them. > Please check who has contributed the most in each patch to > >> decide. > >>> You are right, most of patches originate from Stephen's > >>> patchset, except for the last one, To be honest, I am ok whoever > >>> is the author of this patch set, :-), We could co-own the > >>> feature of Single virtio if you all agree with it, and I think > >>> we couldn't finish Such a feature without collaboration among > >>> us, this is why I tried to communicate > >> with most of you to collect more feedback, suggestion and > >> comments for this feature. > >>> Very appreciate for all kinds of feedback, suggestion here, > >>> especially for > >> patch set from Stephen. > >>> According to your request, how could we make this patch set > >>> looks more > >> like Stephen as the author? > >>> Currently I add Stephen as Signed-off-by list in each patch(I > >>> got the > >> agreement from Stephen before doing this :-)). > >> > >> Hi Ouyang, > >> > >> "Signed-off-by" should be added by himself, because the one who > >> in the Signed-off-by list should take responsibility for it(like > >> potential > bugs/issues). > >> Although, lots of patches are originate from Stephen, we still > >> need himself add this line :) > > Hi Thomas, > > It that right? I can't add Stephen into Signed-off-by list even if > > I have gotten the agreement from Stephen, What 's the strict rule > here? > Stephen sent the patches with his Signed-off, then you added yours. > This is OK. > Using git am, author would have been Stephen. To change author now, > you can edit each commit with interactive rebase and "git commit > --amend -- author=Stephen". > No need to resend now. Please check it for next version of the > patchset. > > >>> So I understand correctly, Stephen need care for from patches from 1 > >>> to 16, I need care for the 17th patch from next version. > >>> What I mean "caring for" above is: debug and validate them and send > >>> out patches > >>> > >>> Thanks > >>> Changchun > >>> > >> Just to clarify Thomas point here about use of "git am". If you get a > >> patch from someone to test or work on, use "git am" to apply it, > >> rather than "git apply", since "git am" generates a commit in your > >> local repo and thereby maintains the original authorship of the > >> patch. If you do "git apply" and subsequently commit yourself, you - > >> rather than the original author - will appear as the "author" of the > >> patch, and you need to amend the
[dpdk-dev] [PATCH v3 0/3] enhance TX checksum command and csum forwarding engine
In the current codes, the "tx_checksum set (ip|udp|tcp|sctp|vxlan) (hw|sw) (port-id)" command is not easy to understand and extend, so the patch set enhances the tx_checksum command and reworks csum forwarding engine due to the change of tx_checksum command. The main changes of the tx_checksum command are listed below, 1> add "tx_checksum set tunnel (hw|sw|none) (port-id)" command The command is used to set/clear tunnel flag that is used to tell the NIC that the packetg is a tunneing packet and application want hardware TX checksum offload for outer layer, or inner layer, or both. The 'none' option means that user treat tunneling packet as ordinary packet when using the csum forward engine. for example, let say we have a tunnel packet: eth_hdr_out/ipv4_hdr_out/udp_hdr_out/vxlan_hdr/ehtr_hdr_in/ipv4_hdr_in/tcp_hdr_in. one of several scenarios: 1) User requests HW offload for ipv4_hdr_out checksum, and doesn't care is it a tunnelled packet or not. So he sets: tx_checksum set tunnel none 0 tx_checksum set ip hw 0 So for such case we should set tx_tunnel to 'none'. 2> add "tx_checksum set outer-ip (hw|sw) (port-id)" command The command is used to set/clear outer IP flag that is used to tell the NIC that application want hardware offload for outer layer. 3> remove the 'vxlan' option from the "tx_checksum set (ip|udp|tcp|sctp|vxlan) (hw|sw) (port-id)" command The command is used to set IP, UDP, TCP and SCTP TX checksum flag. In the case of tunneling packet, the IP, UDP, TCP and SCTP flags always concern inner layer. Moreover, replace the TESTPMD_TX_OFFLOAD_VXLAN_CKSUM flag with TESTPMD_TX_OFFLOAD_TUNNEL_CKSUM flag and add the TESTPMD_TX_OFFLOAD_OUTER_IP_CKSUM and TESTPMD_TX_OFFLOAD_NON_TUNNEL_CKSUM flag in test-pmd application. v2 change: redefine the 'none' behaviour for "tx_checksum set tunnel (hw|sw|none) (port-id)" command. v3 change: typo correction in cmdline help Jijiang Liu (3): add outer IP offload capability in librte_ether. add outer IP checksum capability in i40e PMD testpmd command lines of the tx_checksum and csum forwarding rework app/test-pmd/cmdline.c| 201 +++-- app/test-pmd/csumonly.c | 35 --- app/test-pmd/testpmd.h|6 +- lib/librte_ether/rte_ethdev.h |1 + lib/librte_pmd_i40e/i40e_ethdev.c |3 +- 5 files changed, 218 insertions(+), 28 deletions(-) -- 1.7.7.6
[dpdk-dev] [PATCH v3 1/3] librte_ether:add outer IP offload capability flag
If the flag is set in a PMD, which means the NIC(s) support TX checksum offload of tunneling packet. Signed-off-by: Jijiang Liu --- lib/librte_ether/rte_ethdev.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h index f66805d..bae59c3 100644 --- a/lib/librte_ether/rte_ethdev.h +++ b/lib/librte_ether/rte_ethdev.h @@ -916,6 +916,7 @@ struct rte_eth_conf { #define DEV_TX_OFFLOAD_SCTP_CKSUM 0x0010 #define DEV_TX_OFFLOAD_TCP_TSO 0x0020 #define DEV_TX_OFFLOAD_UDP_TSO 0x0040 +#define DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM 0x0080 /**< Used for tunneling packet. */ struct rte_eth_dev_info { struct rte_pci_device *pci_dev; /**< Device PCI information. */ -- 1.7.7.6
[dpdk-dev] [PATCH v3 2/3] i40e:support outer IPv4 checksum capability
The DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM flag is added in i40e capability set, which means the i40e supports TX checksum offload of tunneling packet. Signed-off-by: Jijiang Liu --- lib/librte_pmd_i40e/i40e_ethdev.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/lib/librte_pmd_i40e/i40e_ethdev.c b/lib/librte_pmd_i40e/i40e_ethdev.c index 87e750a..deddd0b 100644 --- a/lib/librte_pmd_i40e/i40e_ethdev.c +++ b/lib/librte_pmd_i40e/i40e_ethdev.c @@ -1491,7 +1491,8 @@ i40e_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) DEV_TX_OFFLOAD_IPV4_CKSUM | DEV_TX_OFFLOAD_UDP_CKSUM | DEV_TX_OFFLOAD_TCP_CKSUM | - DEV_TX_OFFLOAD_SCTP_CKSUM; + DEV_TX_OFFLOAD_SCTP_CKSUM | + DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM; dev_info->reta_size = pf->hash_lut_size; dev_info->default_rxconf = (struct rte_eth_rxconf) { -- 1.7.7.6
[dpdk-dev] [PATCH v3 3/3] app/testpmd:change tx_checksum command and csum forwarding engine
The patch enhances the tx_checksum command and reworks csum forwarding engine due to the change of tx_checksum command. The main changes of the tx_checksum command are listed below, 1. add "tx_checksum set tunnel (hw|sw|none) (port-id)" command 2. add "tx_checksum set outer-ip (hw|sw) (port-id)" command 3. remove the "vxlan" option from the "tx_checksum set(ip|udp|tcp|sctp|vxlan) (hw|sw) (port-id)" command Moreover, replace the TESTPMD_TX_OFFLOAD_VXLAN_CKSUM flag with TESTPMD_TX_OFFLOAD_TUNNEL_CKSUM flag, and add the TESTPMD_TX_OFFLOAD_OUTER_IP_CKSUM and TESTPMD_TX_OFFLOAD_NON_TUNNEL_CKSUM flag. Signed-off-by: Jijiang Liu --- app/test-pmd/cmdline.c | 209 --- app/test-pmd/csumonly.c | 38 ++--- app/test-pmd/testpmd.h | 14 +++- 3 files changed, 234 insertions(+), 27 deletions(-) diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c index f79ea3e..9bfa9ef 100644 --- a/app/test-pmd/cmdline.c +++ b/app/test-pmd/cmdline.c @@ -316,16 +316,30 @@ static void cmd_help_long_parsed(void *parsed_result, "Disable hardware insertion of a VLAN header in" " packets sent on a port.\n\n" - "tx_cksum set (ip|udp|tcp|sctp|vxlan) (hw|sw) (port_id)\n" + "tx_cksum set (ip|udp|tcp|sctp) (hw|sw) (port_id)\n" "Select hardware or software calculation of the" " checksum with when transmitting a packet using the" " csum forward engine.\n" - "ip|udp|tcp|sctp always concern the inner layer.\n" - "vxlan concerns the outer IP and UDP layer (in" - " case the packet is recognized as a vxlan packet by" - " the forward engine)\n" + "In the case of tunneling packet, ip|udp|tcp|sctp" + " always concern the inner layer.\n\n" + + "tx_cksum set tunnel (hw|sw|none) (port_id)\n" + " Select hardware or software calculation of the" + " checksum with when transmitting a tunneling packet" + " using the csum forward engine.\n" + " The none option means treat tunneling packet as ordinary" + " packet when using the csum forward engine\n." + "Tunneling packet concerns the outer IP, inner IP" + " and inner L4\n" "Please check the NIC datasheet for HW limits.\n\n" + "tx_cksum set (outer-ip) (hw|sw) (port_id)\n" + "Select hardware or software calculation of the" + " checksum with when transmitting a packet using the" + " csum forward engine.\n" + "outer-ip always concern the outer layer of" + " tunneling packet.\n\n" + "tx_checksum show (port_id)\n" "Display tx checksum offload configuration\n\n" @@ -2861,6 +2875,181 @@ cmdline_parse_inst_t cmd_tx_vlan_reset = { }, }; +/* ENABLE HARDWARE INSERTION OF CHECKSUM IN TX PACKETS FOR TUNNELING */ +struct cmd_tx_cksum_tunnel_result { + cmdline_fixed_string_t tx_cksum; + cmdline_fixed_string_t mode; + cmdline_fixed_string_t type; + cmdline_fixed_string_t hwsw; + uint8_t port_id; +}; + +static void +cmd_tx_cksum_tunnel_parsed(void *parsed_result, + __attribute__((unused)) struct cmdline *cl, + __attribute__((unused)) void *data) +{ + struct cmd_tx_cksum_tunnel_result *res = parsed_result; + int hw = 0; + uint16_t ol_flags, mask = 0; + + if (port_id_is_invalid(res->port_id)) { + printf("invalid port %d\n", res->port_id); + return; + } + + if (!strcmp(res->mode, "set")) { + + if (!strcmp(res->hwsw, "hw")) + hw = 1; + else if (!strcmp(res->hwsw, "none")) { + ports[res->port_id].tx_ol_flags &= + ~(TESTPMD_TX_OFFLOAD_TUNNEL_CKSUM + | TESTPMD_TX_OFFLOAD_OUTER_IP_CKSUM); + ports[res->port_id].tx_ol_flags |= + TESTPMD_TX_OFFLOAD_NON_TUNNEL_CKSUM; + return; + } + + ports[res->port_id].tx_ol_flags &= + ~TESTPMD_TX_OFFLOAD_NON_TUNNEL_CKSUM; + mask = TESTPMD_TX_OFFLOAD_TUNNEL_CKSUM; + + if (hw) + ports[res->port_id].tx_ol_flags |= mask; + else + ports[res->port_id].tx_ol_flags &= (~mask); + } + + ol_flags = ports[res->port_id].tx_ol_flags; + prin
[dpdk-dev] [PATCH v4] VFIO: Avoid to enable vfio while the module not loaded
On 12/10/2014 8:17 AM, Xie, Huawei wrote: > Hi Michael: > >> -Original Message- >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Qiu, Michael >> Sent: Monday, December 08, 2014 8:28 AM >> To: Burakov, Anatoly; dev at dpdk.org >> Subject: Re: [dpdk-dev] [PATCH v4] VFIO: Avoid to enable vfio while the >> module >> not loaded >> >> On 2014/12/8 20:19, Burakov, Anatoly wrote: When vfio module is not loaded when kernel support vfio feature, the routine still try to open the container to get file description. This action is not safe, and of cause got error messages: EAL: Detected 40 lcore(s) EAL: unsupported IOMMU type! EAL: VFIO support could not be initialized EAL: Setting up memory... This may make user confuse, this patch make it reasonable and much more soomth to user. Signed-off-by: Michael Qiu --- v4 --> v3: 1. Remove RTE_LOG for params check 2. Remove "vfio" module check as "vfio_iommu_type1" loaded indecated "vfio" loaded v3 --> v2: 1. Add error log in rte_eal_check_module() 2. Some code clean up. v2 --> v1: 1. Move check_module() from rte_common.h to eal_private.h and rename to rte_eal_check_module(). To make it linuxapp only. 2. Some code clean up. lib/librte_eal/common/eal_private.h| 42 ++ lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 29 ++--- 2 files changed, 68 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h index 232fcec..e877a25 100644 --- a/lib/librte_eal/common/eal_private.h +++ b/lib/librte_eal/common/eal_private.h @@ -35,6 +35,9 @@ #define _EAL_PRIVATE_H_ #include +#include +#include +#include /** * Initialize the memzone subsystem (private to eal). @@ -203,4 +206,43 @@ int rte_eal_alarm_init(void); */ int rte_eal_dev_init(void); +/** + * Function is to check if the kernel module(like, vfio, +vfio_iommu_type1, + * etc.) loaded. + * + * @param module_name + *The module's name which need to be checked + * + * @return + *-1 means some error happens(NULL pointer or open failure) + *0 means the module not loaded + *1 means the module loaded + */ +static inline int +rte_eal_check_module(const char *module_name) { + char mod_name[30]; /* Any module names can be longer than 30 bytes? */ + int ret = 0; + + if (NULL == module_name) + return -1; + + FILE * fd = fopen("/proc/modules", "r"); + if (NULL == fd) { + RTE_LOG(ERR, EAL, "Open /proc/modules failed!" + " error %i (%s)\n", errno, strerror(errno)); + return -1; + } + while(!feof(fd)) { + fscanf(fd, "%s %*[^\n]", mod_name); > I see it is already discussed in the other mail that we could limit the count > to avoid overflow. > I don't understand why you don't apply it here. Yes, I have reply in this thread with below: fscanf(fd, "%30s %*[^\n]", mod_name); > There are already several existing modules that has 20+ length. Yes, would you think 30 length is enough? Otherwise I need to increase the length Thanks, Michael + if(!strcmp(mod_name, module_name)) { + ret = 1; + break; + } + } + fclose(fd); + + return ret; +} + >>> Apologies for not bringing this up before, but do we really want the >> rte_eal_check_module inline in the header? I think it would be better to >> declare >> it in eal_private but move the definition into eal.c. >> >> No need, actually, I'm very appreciate that you can spend your time to >> review my patch again and again. I really want to say thank you to you. >> >> For rte_eal_check_module inline in the header, it really no need stay in >> header, so ugly. I will make new version of it, and re-post. >> >> #endif /* _EAL_PRIVATE_H_ */ diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c index c1246e8..8c54d2a 100644 --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c @@ -44,6 +44,7 @@ #include #include #include +#include #include "eal_filesystem.h" #include "eal_pci_init.h" @@ -339,10 +340,15 @@ pci_vfio_get_container_fd(void) ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU); if (ret != 1) { if (ret < 0) - RTE_L
[dpdk-dev] [PATCH v5] VFIO: Avoid to enable vfio while the module not loaded
When vfio module is not loaded when kernel support vfio feature, the routine still try to open the container to get file description. This action is not safe, and of cause got error messages: EAL: Detected 40 lcore(s) EAL: unsupported IOMMU type! EAL: VFIO support could not be initialized EAL: Setting up memory... This may make user confuse, this patch make it reasonable and much more soomth to user. Signed-off-by: Michael Qiu --- v5 --> v4 1. Move rte_eal_check_module() body to eal.c 2. Clean up "unsupported IOMMU type" log v4 --> v3: 1. Remove RTE_LOG for params check 2. Remove "vfio" module check as "vfio_iommu_type1" loaded indecated "vfio" loaded v3 --> v2: 1. Add error log in rte_eal_check_module() 2. Some code clean up. v2 --> v1: 1. Move check_module() from rte_common.h to eal_private.h and rename to rte_eal_check_module(). To make it linuxapp only. 2. Some code clean up. lib/librte_eal/common/eal_private.h| 15 +++ lib/librte_eal/linuxapp/eal/eal.c | 27 +++ lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 26 +++--- 3 files changed, 65 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h index 232fcec..4183b54 100644 --- a/lib/librte_eal/common/eal_private.h +++ b/lib/librte_eal/common/eal_private.h @@ -203,4 +203,19 @@ int rte_eal_alarm_init(void); */ int rte_eal_dev_init(void); +/** + * Function is to check if the kernel module(like, vfio, vfio_iommu_type1, + * etc.) loaded. + * + * @param module_name + * The module's name which need to be checked + * + * @return + * -1 means some error happens(NULL pointer or open failure) + * 0 means the module not loaded + * 1 means the module loaded + */ +inline int +rte_eal_check_module(const char *module_name); + #endif /* _EAL_PRIVATE_H_ */ diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 89f3b5e..40b462e 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -859,3 +859,30 @@ int rte_eal_has_hugepages(void) { return ! internal_config.no_hugetlbfs; } + +inline int +rte_eal_check_module(const char *module_name) +{ + char mod_name[30]; /* Any module names can be longer than 30 bytes? */ + int ret = 0; + + if (NULL == module_name) + return -1; + + FILE * fd = fopen("/proc/modules", "r"); + if (NULL == fd) { + RTE_LOG(ERR, EAL, "Open /proc/modules failed!" + " error %i (%s)\n", errno, strerror(errno)); + return -1; + } + while(!feof(fd)) { + fscanf(fd, "%30s %*[^\n]", mod_name); + if(!strcmp(mod_name, module_name)) { + ret = 1; + break; + } + } + fclose(fd); + + return ret; +} diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c index c1246e8..16fe10f 100644 --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c @@ -44,6 +44,7 @@ #include #include #include +#include #include "eal_filesystem.h" #include "eal_pci_init.h" @@ -339,10 +340,12 @@ pci_vfio_get_container_fd(void) ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU); if (ret != 1) { if (ret < 0) - RTE_LOG(ERR, EAL, " could not get IOMMU type, " - "error %i (%s)\n", errno, strerror(errno)); + RTE_LOG(ERR, EAL, " could not get IOMMU type," + " error %i (%s)\n", errno, + strerror(errno)); else - RTE_LOG(ERR, EAL, " unsupported IOMMU type!\n"); + RTE_LOG(ERR, EAL, " unsupported IOMMU type" + " detected in VFIO\n"); close(vfio_container_fd); return -1; } @@ -783,11 +786,28 @@ pci_vfio_enable(void) { /* initialize group list */ int i; + int module_vfio_type1; for (i = 0; i < VFIO_MAX_GROUPS; i++) { vfio_cfg.vfio_groups[i].fd = -1; vfio_cfg.vfio_groups[i].group_no = -1; } + + module_vfio_type1 = rte_eal_check_module("vfio_iommu_type1"); + + /* return error directly */ + if (module_vfio_type1 == -1) { + RTE_LOG(INFO, EAL, "Could not get loaded module details!\n"); + return -1; + } + + /* return 0 if VFIO modules not loaded */ + if (module_vfio_type1 == 0) { + RTE_LOG(
[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages
When the first address is the compared address in the loop, it will also do memory copy, which is meaningless, worse more, when hugepg_tbl is mostly in order. This should be a big deal in large hugepage memory systerm(like hunderd or thousand GB). Meanwhile smallest_idx never be a value of -1,so remove this check. This patch also includes some coding style fix. Signed-off-by: Michael Qiu --- lib/librte_eal/linuxapp/eal/eal_memory.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index e6cb919..700aba2 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -678,14 +678,13 @@ error: static int sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi) { - unsigned i, j; - int compare_idx; + unsigned i, j, compare_idx; uint64_t compare_addr; struct hugepage_file tmp; for (i = 0; i < hpi->num_pages[0]; i++) { compare_addr = 0; - compare_idx = -1; + compare_idx = i; /* * browse all entries starting at 'i', and find the @@ -704,11 +703,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi) } } - /* should not happen */ - if (compare_idx == -1) { - RTE_LOG(ERR, EAL, "%s(): error in physaddr sorting\n", __func__); - return -1; - } + /* avoid memory copy when the first entry is the compared */ + if (compare_idx == i) + continue; /* swap the 2 entries in the table */ memcpy(&tmp, &hugepg_tbl[compare_idx], -- 1.9.3
[dpdk-dev] [PATCH] igb_uio: kernel version check for using kstrtoul or strict_strtoul
strict_strtoul() was just a redefinition of kstrtoul() for a long time. From kernel version of 3.18, strict_strtoul() will not be defined at all. A compile time kernel version check is needed to decide which function or macro can be used for a specific version of kernel. Signed-off-by: Helin Zhang --- lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 4 1 file changed, 4 insertions(+) diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c index d1ca26e..2fcc5f4 100644 --- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c +++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c @@ -83,7 +83,11 @@ store_max_vfs(struct device *dev, struct device_attribute *attr, unsigned long max_vfs; struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 2, 0) if (0 != strict_strtoul(buf, 0, &max_vfs)) +#else + if (0 != kstrtoul(buf, 0, &max_vfs)) +#endif return -EINVAL; if (0 == max_vfs) -- 1.9.3
[dpdk-dev] [PATCH] igb_uio: kernel version check for using kstrtoul or strict_strtoul
Here is my patch for it, and it also resolves issue of pci_num_vf() definition. And I will send V3 for a while. On 12/10/2014 10:38 AM, Helin Zhang wrote: > strict_strtoul() was just a redefinition of kstrtoul() for a long > time. From kernel version of 3.18, strict_strtoul() will not be > defined at all. A compile time kernel version check is needed to > decide which function or macro can be used for a specific version > of kernel. > > Signed-off-by: Helin Zhang > --- > lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > index d1ca26e..2fcc5f4 100644 > --- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > +++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > @@ -83,7 +83,11 @@ store_max_vfs(struct device *dev, struct device_attribute > *attr, > unsigned long max_vfs; > struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); > > +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 2, 0) > if (0 != strict_strtoul(buf, 0, &max_vfs)) > +#else > + if (0 != kstrtoul(buf, 0, &max_vfs)) > +#endif > return -EINVAL; > > if (0 == max_vfs)
[dpdk-dev] [PATCH] igb_uio: kernel version check for using kstrtoul or strict_strtoul
Hi Jincheng Did you attach anything? I can see the text only. Could you forward your patch mail to me directly? Thanks a lot! Regards, Helin > -Original Message- > From: Jincheng Miao [mailto:jmiao at redhat.com] > Sent: Wednesday, December 10, 2014 10:55 AM > To: Zhang, Helin; dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH] igb_uio: kernel version check for using > kstrtoul > or strict_strtoul > > Here is my patch for it, and it also resolves issue of pci_num_vf() > definition. > > And I will send V3 for a while. > > > On 12/10/2014 10:38 AM, Helin Zhang wrote: > > strict_strtoul() was just a redefinition of kstrtoul() for a long > > time. From kernel version of 3.18, strict_strtoul() will not be > > defined at all. A compile time kernel version check is needed to > > decide which function or macro can be used for a specific version of > > kernel. > > > > Signed-off-by: Helin Zhang > > --- > > lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > > b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > > index d1ca26e..2fcc5f4 100644 > > --- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > > +++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > > @@ -83,7 +83,11 @@ store_max_vfs(struct device *dev, struct > device_attribute *attr, > > unsigned long max_vfs; > > struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); > > > > +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 2, 0) > > if (0 != strict_strtoul(buf, 0, &max_vfs)) > > +#else > > + if (0 != kstrtoul(buf, 0, &max_vfs)) #endif > > return -EINVAL; > > > > if (0 == max_vfs)
[dpdk-dev] [PATCH v2] i40e: bug fix of querying reta
There is a bug in querying reta, of storing the data to the correct entry. Code changes is straightforward for this bug. Signed-off-by: Helin Zhang --- lib/librte_pmd_i40e/i40e_ethdev.c| 2 +- lib/librte_pmd_i40e/i40e_ethdev_vf.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) v2 changes: * Added the fix for the same issue in i40e VF driver. diff --git a/lib/librte_pmd_i40e/i40e_ethdev.c b/lib/librte_pmd_i40e/i40e_ethdev.c index 899cb42..008d62c 100644 --- a/lib/librte_pmd_i40e/i40e_ethdev.c +++ b/lib/librte_pmd_i40e/i40e_ethdev.c @@ -1924,7 +1924,7 @@ i40e_dev_rss_reta_query(struct rte_eth_dev *dev, lut = I40E_READ_REG(hw, I40E_PFQF_HLUT(i >> 2)); for (j = 0; j < I40E_4_BIT_WIDTH; j++) { if (mask & (0x1 << j)) - reta_conf[idx].reta[shift] = ((lut >> + reta_conf[idx].reta[shift + j] = ((lut >> (CHAR_BIT * j)) & I40E_8_BIT_MASK); } } diff --git a/lib/librte_pmd_i40e/i40e_ethdev_vf.c b/lib/librte_pmd_i40e/i40e_ethdev_vf.c index 7ef7d58..fe46cf1 100644 --- a/lib/librte_pmd_i40e/i40e_ethdev_vf.c +++ b/lib/librte_pmd_i40e/i40e_ethdev_vf.c @@ -1760,7 +1760,7 @@ i40evf_dev_rss_reta_query(struct rte_eth_dev *dev, lut = I40E_READ_REG(hw, I40E_VFQF_HLUT(i >> 2)); for (j = 0; j < I40E_4_BIT_WIDTH; j++) { if (mask & (0x1 << j)) - reta_conf[idx].reta[shift] = + reta_conf[idx].reta[shift + j] = ((lut >> (CHAR_BIT * j)) & I40E_8_BIT_MASK); } -- 1.9.3
[dpdk-dev] [PATCH 2/4] igb_uio: replace strict_strtoul with kstrtoul
>From upstream kernel commit 3db2e9cd, strict_strto* serial functions are removed. So that we should directly used kstrtoul instead. Kstrtoul exists from RHEL6.4, so for compatible with old kernel and RHEL, add some logic to igb_uio/compat.h, same as what we do for pci_num_vf(). Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/igb_uio/compat.h |8 lib/librte_eal/linuxapp/igb_uio/igb_uio.c |4 ++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/linuxapp/igb_uio/compat.h b/lib/librte_eal/linuxapp/igb_uio/compat.h index a36f034..214d5de 100644 --- a/lib/librte_eal/linuxapp/igb_uio/compat.h +++ b/lib/librte_eal/linuxapp/igb_uio/compat.h @@ -44,6 +44,14 @@ static int pci_num_vf(struct pci_dev *dev) #endif /* < 2.6.34 */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ +(!(defined(RHEL_RELEASE_CODE) && \ + RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) + +#define kstrtoul strict_strtoul + +#endif /* < 2.6.39 */ + #if LINUX_VERSION_CODE < KERNEL_VERSION(3, 3, 0) && \ (!(defined(RHEL_RELEASE_CODE) && \ RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 3))) diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c index d1ca26e..47ff2f3 100644 --- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c +++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c @@ -83,7 +83,7 @@ store_max_vfs(struct device *dev, struct device_attribute *attr, unsigned long max_vfs; struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); - if (0 != strict_strtoul(buf, 0, &max_vfs)) + if (0 != kstrtoul(buf, 0, &max_vfs)) return -EINVAL; if (0 == max_vfs) @@ -174,7 +174,7 @@ store_max_read_request_size(struct device *dev, unsigned long size = 0; int ret; - if (strict_strtoul(buf, 0, &size) != 0) + if (0 != kstrtoul(buf, 0, &size)) return -EINVAL; ret = pcie_set_readrq(pci_dev, (int)size); -- 1.7.1
[dpdk-dev] [PATCH 3/4] kni: replace strict_strtoul with kstrtoul
>From upstream kernel commit 3db2e9cd, strict_strto* serial functions are removed. So that we should directly used kstrtoul instead. And add kni/compat.h for be compatible with older kernel. Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/kni/compat.h| 16 lib/librte_eal/linuxapp/kni/kni_vhost.c |2 +- 2 files changed, 17 insertions(+), 1 deletions(-) create mode 100644 lib/librte_eal/linuxapp/kni/compat.h diff --git a/lib/librte_eal/linuxapp/kni/compat.h b/lib/librte_eal/linuxapp/kni/compat.h new file mode 100644 index 000..c8c662c --- /dev/null +++ b/lib/librte_eal/linuxapp/kni/compat.h @@ -0,0 +1,16 @@ +/* + * Minimal wrappers to allow compiling kni on older kernels. + */ + +#ifndef RHEL_RELEASE_VERSION +#define RHEL_RELEASE_VERSION(a, b) (((a) << 8) + (b)) +#endif + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ +(!(defined(RHEL_RELEASE_CODE) && \ + RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) + +#define kstrtoul strict_strtoul + +#endif /* < 2.6.39 */ + diff --git a/lib/librte_eal/linuxapp/kni/kni_vhost.c b/lib/librte_eal/linuxapp/kni/kni_vhost.c index 7bcc985..c05c868 100644 --- a/lib/librte_eal/linuxapp/kni/kni_vhost.c +++ b/lib/librte_eal/linuxapp/kni/kni_vhost.c @@ -740,7 +740,7 @@ set_sock_en(struct device *dev, struct device_attribute *attr, unsigned long en; int err = 0; - if (0 != strict_strtoul(buf, 0, &en)) + if (0 != kstrtoul(buf, 0, &en)) return -EINVAL; if (en) -- 1.7.1
[dpdk-dev] [PATCH 4/4] xen_dom0: replace strict_strtoul with kstrtoul
>From upstream kernel commit 3db2e9cd, strict_strto* serial functions are removed. So that we should directly used kstrtoul instead. And add xen_dom0/compat.h for be compatible with older kernel. Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/xen_dom0/compat.h | 16 lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c |2 +- 2 files changed, 17 insertions(+), 1 deletions(-) create mode 100644 lib/librte_eal/linuxapp/xen_dom0/compat.h diff --git a/lib/librte_eal/linuxapp/xen_dom0/compat.h b/lib/librte_eal/linuxapp/xen_dom0/compat.h new file mode 100644 index 000..ca6e160 --- /dev/null +++ b/lib/librte_eal/linuxapp/xen_dom0/compat.h @@ -0,0 +1,16 @@ +/* + * Minimal wrappers to allow compiling xen_dom0 on older kernels. + */ + +#ifndef RHEL_RELEASE_VERSION +#define RHEL_RELEASE_VERSION(a, b) (((a) << 8) + (b)) +#endif + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ +(!(defined(RHEL_RELEASE_CODE) && \ + RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) + +#define kstrtoul strict_strtoul + +#endif /* < 2.6.39 */ + diff --git a/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c b/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c index dfb271d..8a3727d 100644 --- a/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c +++ b/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c @@ -123,7 +123,7 @@ store_memsize(struct device *dev, struct device_attribute *attr, int err = 0; unsigned long mem_size; - if (0 != strict_strtoul(buf, 0, &mem_size)) + if (0 != kstrtoul(buf, 0, &mem_size)) return -EINVAL; mutex_lock(&dom0_dev.data_lock); -- 1.7.1
[dpdk-dev] [PATCH] igb_uio: kernel version check for using kstrtoul or strict_strtoul
On 12/10/2014 11:02 AM, Zhang, Helin wrote: > Hi Jincheng > > Did you attach anything? I can see the text only. > Could you forward your patch mail to me directly? Thanks a lot! Hi Helin, I indeed ever saw one patch that was similar this one :) but not sure if it is post by Jincheng. You can go through patchwork to have a look. Thanks, Michael > Regards, > Helin > >> -Original Message- >> From: Jincheng Miao [mailto:jmiao at redhat.com] >> Sent: Wednesday, December 10, 2014 10:55 AM >> To: Zhang, Helin; dev at dpdk.org >> Subject: Re: [dpdk-dev] [PATCH] igb_uio: kernel version check for using >> kstrtoul >> or strict_strtoul >> >> Here is my patch for it, and it also resolves issue of pci_num_vf() >> definition. >> >> And I will send V3 for a while. >> >> >> On 12/10/2014 10:38 AM, Helin Zhang wrote: >>> strict_strtoul() was just a redefinition of kstrtoul() for a long >>> time. From kernel version of 3.18, strict_strtoul() will not be >>> defined at all. A compile time kernel version check is needed to >>> decide which function or macro can be used for a specific version of >>> kernel. >>> >>> Signed-off-by: Helin Zhang >>> --- >>> lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 4 >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c >>> b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c >>> index d1ca26e..2fcc5f4 100644 >>> --- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c >>> +++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c >>> @@ -83,7 +83,11 @@ store_max_vfs(struct device *dev, struct >> device_attribute *attr, >>> unsigned long max_vfs; >>> struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); >>> >>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 2, 0) >>> if (0 != strict_strtoul(buf, 0, &max_vfs)) >>> +#else >>> + if (0 != kstrtoul(buf, 0, &max_vfs)) #endif >>> return -EINVAL; >>> >>> if (0 == max_vfs) >
[dpdk-dev] [PATCH v3] test-pmd: Fix pointer aliasing error
Hi Thomas, What's going on with this patch? I really do not have other better solution. Thanks, Michael On 12/8/2014 9:30 AM, Qiu, Michael wrote: > On 12/4/2014 9:35 PM, Michael Qiu wrote: >> app/test-pmd/csumonly.c: In function ?get_psd_sum?: >> build/include/rte_ip.h:161: error: dereferencing pointer ?u16? >> does break strict-aliasing rules >> build/include/rte_ip.h:157: note: initialized from here >> ... >> >> The root cause is that, compile enable strict aliasing by default, >> while in function rte_raw_cksum() try to convert 'const char *' >> to 'const uint16_t *'. >> >> This patch is one workaround fix. >> >> Signed-off-by: Michael Qiu >> --- >> v3 --> v2: >> use uintptr_t instead of unsigned long to >> save pointer. >> >> v2 --> v1: >> Workaround solution instead of shut off the >> gcc params. >> >> lib/librte_net/rte_ip.h | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/lib/librte_net/rte_ip.h b/lib/librte_net/rte_ip.h >> index 61e4457..cda3436 100644 >> --- a/lib/librte_net/rte_ip.h >> +++ b/lib/librte_net/rte_ip.h >> @@ -154,7 +154,8 @@ struct ipv4_hdr { >> static inline uint16_t >> rte_raw_cksum(const char *buf, size_t len) >> { >> -const uint16_t *u16 = (const uint16_t *)buf; >> +uintptr_t ptr = (uintptr_t)buf; >> +const uint16_t *u16 = (const uint16_t *)ptr; >> uint32_t sum = 0; >> >> while (len >= (sizeof(*u16) * 4)) { > This workaround is to solve the compile issue of GCC strict-aliasing(Two > different type pointers should not be point to the same memory address). > > For GCC 4.4.7 it will definitely occurs if flags "-fstrict-aliasing" > and "-Wall" used. > > Thanks, > Michael >
[dpdk-dev] [PATCH 1/4] igb_uio: compatible with upstream longterm kernel and RHEL
Function pci_num_vf() is introduced from upstream linux-2.6.34. So this patch make compatible with longterm kernel linux-2.6.32.63. For RHEL, function pci_num_vf() begins from RHEL5 update9. And it is stub-defined when CONFIG_PCI_IOV is not enabled. So dropped the CONFIG_PCI_IOV checking of commit 11ba0426. For other distro like RHEL behaved to pci_num_vf(), we could simply append following condition macro: (!(defined(OTHER_RELEASE_CODE) && \ OTHER_RELEASE_CODE >= OTHER_RELEASE_VERSION(X, Y))) Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/igb_uio/compat.h |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/lib/librte_eal/linuxapp/igb_uio/compat.h b/lib/librte_eal/linuxapp/igb_uio/compat.h index 676fa1b..a36f034 100644 --- a/lib/librte_eal/linuxapp/igb_uio/compat.h +++ b/lib/librte_eal/linuxapp/igb_uio/compat.h @@ -21,7 +21,8 @@ #endif #if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 34) && \ - !defined(CONFIG_PCI_IOV) + (!(defined(RHEL_RELEASE_CODE) && \ +RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(5, 9))) static int pci_num_vf(struct pci_dev *dev) { -- 1.7.1
[dpdk-dev] [PATCH 0/4] compatibility fallback and replacement of kernel function invoking
The related kernel function is: - pci_num_vf, it is introduced from upstream linux-2.6.34. For RHEL-based kernel, it is defined from RHEL5.9. - kstrtoul, this function is united kernel API to replace strict_strtoul in the furture. It is introduced from linux-2.6.39. For RHEL6, it is defined from RHEL6.4. This patchset do some compatiblity work for these two functions, and replace strict_strtoul which is depleted from linux-3.18. v3: Adjust pci_num_vf() introduced RHEL version number. Seperate "replace strict_strtoul with kstrtoul" into 3 patches for igb_uio, kni, and xen_dom0. Add compat.h in kni and xen_dom0 for compatible with older kernels. v2: Merge these two patch in one patchset. Compatible with old kernel for kstrtoul. Compatible with RHEL6 for pci_num_vf. Jincheng Miao (4): igb_uio: compatible with upstream longterm kernel and RHEL igb_uio: replace strict_strtoul with kstrtoul kni: replace strict_strtoul with kstrtoul xen_dom0: replace strict_strtoul with kstrtoul lib/librte_eal/linuxapp/igb_uio/compat.h| 11 ++- lib/librte_eal/linuxapp/igb_uio/igb_uio.c |4 ++-- lib/librte_eal/linuxapp/kni/compat.h| 16 lib/librte_eal/linuxapp/kni/kni_vhost.c |2 +- lib/librte_eal/linuxapp/xen_dom0/compat.h | 16 lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c |2 +- 6 files changed, 46 insertions(+), 5 deletions(-) create mode 100644 lib/librte_eal/linuxapp/kni/compat.h create mode 100644 lib/librte_eal/linuxapp/xen_dom0/compat.h
[dpdk-dev] [PATCH 0/4] compatibility fallback and replacement of kernel function invoking
Acked-by: Helin Zhang Note: Don't forget to add the version number to your patch name next time. E.g. [PATCH v3 0/4] > -Original Message- > From: Jincheng Miao [mailto:jmiao at redhat.com] > Sent: Wednesday, December 10, 2014 11:33 AM > To: dev at dpdk.org > Cc: thomas.monjalon at 6wind.com; Zhang, Helin; Jincheng Miao > Subject: [PATCH 0/4] compatibility fallback and replacement of kernel function > invoking > > The related kernel function is: > - pci_num_vf, it is introduced from upstream linux-2.6.34. For RHEL-based > kernel, it is defined from RHEL5.9. > > - kstrtoul, this function is united kernel API to replace strict_strtoul in > the > furture. It is introduced from linux-2.6.39. For RHEL6, it is defined from > RHEL6.4. > > This patchset do some compatiblity work for these two functions, and replace > strict_strtoul which is depleted from linux-3.18. > > v3: > Adjust pci_num_vf() introduced RHEL version number. > Seperate "replace strict_strtoul with kstrtoul" into 3 patches for igb_uio, > kni, > and xen_dom0. Add compat.h in kni and xen_dom0 for compatible with older > kernels. > > v2: > Merge these two patch in one patchset. > Compatible with old kernel for kstrtoul. > Compatible with RHEL6 for pci_num_vf. > > Jincheng Miao (4): > igb_uio: compatible with upstream longterm kernel and RHEL > igb_uio: replace strict_strtoul with kstrtoul > kni: replace strict_strtoul with kstrtoul > xen_dom0: replace strict_strtoul with kstrtoul > > lib/librte_eal/linuxapp/igb_uio/compat.h| 11 ++- > lib/librte_eal/linuxapp/igb_uio/igb_uio.c |4 ++-- > lib/librte_eal/linuxapp/kni/compat.h| 16 > > lib/librte_eal/linuxapp/kni/kni_vhost.c |2 +- > lib/librte_eal/linuxapp/xen_dom0/compat.h | 16 > > lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c |2 +- > 6 files changed, 46 insertions(+), 5 deletions(-) create mode 100644 > lib/librte_eal/linuxapp/kni/compat.h > create mode 100644 lib/librte_eal/linuxapp/xen_dom0/compat.h
[dpdk-dev] [PATCH] igb_uio: kernel version check for using kstrtoul or strict_strtoul
Hi Jincheng I have seen your v3 patch, and ack-ed it. I just dropped my patch for the same issue. Thank you so much! Regards, Helin > -Original Message- > From: Jincheng Miao [mailto:jmiao at redhat.com] > Sent: Wednesday, December 10, 2014 10:55 AM > To: Zhang, Helin; dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH] igb_uio: kernel version check for using > kstrtoul > or strict_strtoul > > Here is my patch for it, and it also resolves issue of pci_num_vf() > definition. > > And I will send V3 for a while. > > > On 12/10/2014 10:38 AM, Helin Zhang wrote: > > strict_strtoul() was just a redefinition of kstrtoul() for a long > > time. From kernel version of 3.18, strict_strtoul() will not be > > defined at all. A compile time kernel version check is needed to > > decide which function or macro can be used for a specific version of > > kernel. > > > > Signed-off-by: Helin Zhang > > --- > > lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > > b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > > index d1ca26e..2fcc5f4 100644 > > --- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > > +++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c > > @@ -83,7 +83,11 @@ store_max_vfs(struct device *dev, struct > device_attribute *attr, > > unsigned long max_vfs; > > struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); > > > > +#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 2, 0) > > if (0 != strict_strtoul(buf, 0, &max_vfs)) > > +#else > > + if (0 != kstrtoul(buf, 0, &max_vfs)) #endif > > return -EINVAL; > > > > if (0 == max_vfs)
[dpdk-dev] [PATCH 0/4] compatibility fallback and replacement of kernel function invoking
Forget to add 'v3' in header. @helin, thanks for your review. On 12/10/2014 11:32 AM, Jincheng Miao wrote: > The related kernel function is: > - pci_num_vf, it is introduced from upstream linux-2.6.34. For RHEL-based > kernel, it is defined from RHEL5.9. > > - kstrtoul, this function is united kernel API to replace strict_strtoul in > the furture. It is introduced from linux-2.6.39. For RHEL6, it is defined > from RHEL6.4. > > This patchset do some compatiblity work for these two functions, and > replace strict_strtoul which is depleted from linux-3.18. > > v3: >Adjust pci_num_vf() introduced RHEL version number. >Seperate "replace strict_strtoul with kstrtoul" into 3 patches for igb_uio, > kni, and xen_dom0. Add compat.h in kni and xen_dom0 for compatible with > older kernels. > > v2: >Merge these two patch in one patchset. >Compatible with old kernel for kstrtoul. >Compatible with RHEL6 for pci_num_vf. > > Jincheng Miao (4): >igb_uio: compatible with upstream longterm kernel and RHEL >igb_uio: replace strict_strtoul with kstrtoul >kni: replace strict_strtoul with kstrtoul >xen_dom0: replace strict_strtoul with kstrtoul > > lib/librte_eal/linuxapp/igb_uio/compat.h| 11 ++- > lib/librte_eal/linuxapp/igb_uio/igb_uio.c |4 ++-- > lib/librte_eal/linuxapp/kni/compat.h| 16 > lib/librte_eal/linuxapp/kni/kni_vhost.c |2 +- > lib/librte_eal/linuxapp/xen_dom0/compat.h | 16 > lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c |2 +- > 6 files changed, 46 insertions(+), 5 deletions(-) > create mode 100644 lib/librte_eal/linuxapp/kni/compat.h > create mode 100644 lib/librte_eal/linuxapp/xen_dom0/compat.h >
[dpdk-dev] Accessing a Virtual Function Driver in Guest Machine via VFIO
Thanks for your reply. Sorry, that was a typo, It was meant to be this only. 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) On Wed, Dec 10, 2014 at 7:45 AM, Kiran KN wrote: > Looks like virtual functions are not created. > For 82599, use ixgbe driver as modprobe ixgbe max_vfs=1,1 > > Lspci should say "virtual". Something like this - > 03:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller > Virtual Function (rev 01) > > Thanks, > Kiran > > -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Aashima Arora > Sent: Friday, December 05, 2014 11:00 AM > To: dev at dpdk.org > Subject: [dpdk-dev] Accessing a Virtual Function Driver in Guest Machine > via VFIO > > Hi, > I was trying to access the pci config space(BAR) of the virtual function > device visible in the virtual machine, similar to what DPDK does on host > via both UIO and VFIO. Did the following steps. > > 1. Bound PF Drivers to ixgbe and spawned virtual function drivers , bound > them to vfio-pci and set their mac addresses via ip link. Ran Qemu and > assigned the VF Device using vfio-pci device assignment and initialized the > virtual machine. > insmod igb max_vfs=2 > > ./x86_64-softmmu/qemu-system-x86_64 -cpu host -boot c -hda > /home/vm-images/vm2.imgsnapshot -m 2048M -smp 2 --enable-kvm -name 'vm2' > -vnc :2 -pidfile /tmp/vm2.pid -driile=fat:rw:/tmp/share,snapshot=off > -device vfio-pci,host=01:10.1,id=net1 > > > 2. The VF Device was visible with another pci address. > > 00:04.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ > Network Connection (rev 01) > > Further ran DPDK testpmd on top of VM but bound the virtual function > driver to igb_uio instead of vfio-pci. It ran successfully. The ixgbevf > pmd driver is able to access the BAR registers in pci_uio_map_resource via > mmaping somewhere close to hugepages. I was not able to bind the virtual > function driver in VM to vfio-pci and hence DPDK would not be able to run > with VFIO enabled as it complains of no IOMMU support. I also believe that > there is little logic in binding the vf device to vfio-pci again since qemu > has already taken care of it and hardware support is involved. > > So my questions are > a. vfio is meant to be a replacement for both uio and device assignment > for qemu. This doesnt seem simultaneous. Comment? > b. Is there any way to access VF device using VFIO in guest userspace? > Have you run DPDK in guest machine with VFIO support enabled and all > dependent modules inserted and did it work? > c. Is igb_uio or uio_pci_generic in future the only way to access the > device in guest userspace? > > *Regards* > -- *Regards,* *Aashima Arora* *+91 9899019317* *arora.aa91 at gmail.com *
[dpdk-dev] Accessing a Virtual Function Driver in Guest Machine via VFIO
On 12/6/2014 3:01 AM, Aashima Arora wrote: > Hi, > I was trying to access the pci config space(BAR) of the virtual function > device visible in the virtual machine, similar to what DPDK does on host > via both UIO and VFIO. Did the following steps. > > 1. Bound PF Drivers to ixgbe and spawned virtual function drivers , bound > them to vfio-pci and set their mac addresses via ip link. Ran Qemu and > assigned the VF Device using vfio-pci device assignment and initialized the > virtual machine. > insmod igb max_vfs=2 > > ./x86_64-softmmu/qemu-system-x86_64 -cpu host -boot c -hda > /home/vm-images/vm2.imgsnapshot -m 2048M -smp 2 --enable-kvm -name 'vm2' > -vnc :2 -pidfile /tmp/vm2.pid -driile=fat:rw:/tmp/share,snapshot=off > -device vfio-pci,host=01:10.1,id=net1 > > > 2. The VF Device was visible with another pci address. > > 00:04.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ > Network > Connection (rev 01) > > Further ran DPDK testpmd on top of VM but bound the virtual function driver > to igb_uio instead of vfio-pci. It ran successfully. The ixgbevf pmd > driver is able to access the BAR registers in pci_uio_map_resource via > mmaping somewhere close to hugepages. I was not able to bind the virtual > function driver in VM to vfio-pci and hence DPDK would not be able to run > with VFIO enabled as it complains of no IOMMU support. I also believe that > there is little logic in binding the vf device to vfio-pci again since qemu > has already taken care of it and hardware support is involved. > > So my questions are > a. vfio is meant to be a replacement for both uio and device assignment > for qemu. This doesnt seem simultaneous. Comment? What do you mean "device assignment" ? You mean pass-through? > b. Is there any way to access VF device using VFIO in guest userspace? > Have you run DPDK in guest machine with VFIO support enabled and all > dependent modules inserted and did it work? As I knows, the answer is *NO*, because the iommu is not support in guest. > c. Is igb_uio or uio_pci_generic in future the only way to access the > device in guest userspace? It depends, who knows what will happen in future :) Thanks, Michael > *Regards* >
[dpdk-dev] Accessing a Virtual Function Driver in Guest Machine via VFIO
Thanks for your reply. What do you mean "device assignment" ? You mean pass-through? > Yes, the same thing. Pass through of PCI device to VM via qemu (pci-assign, vfio-pci, ivshmem etc) As I knows, the answer is *NO*, because the iommu is not support in guest. > NO was essentially what I was looking for, aware of it now :) On Wed, Dec 10, 2014 at 12:42 PM, Qiu, Michael wrote: > On 12/6/2014 3:01 AM, Aashima Arora wrote: > > Hi, > > I was trying to access the pci config space(BAR) of the virtual function > > device visible in the virtual machine, similar to what DPDK does on host > > via both UIO and VFIO. Did the following steps. > > > > 1. Bound PF Drivers to ixgbe and spawned virtual function drivers , bound > > them to vfio-pci and set their mac addresses via ip link. Ran Qemu and > > assigned the VF Device using vfio-pci device assignment and initialized > the > > virtual machine. > > insmod igb max_vfs=2 > > > > ./x86_64-softmmu/qemu-system-x86_64 -cpu host -boot c -hda > > /home/vm-images/vm2.imgsnapshot -m 2048M -smp 2 --enable-kvm -name 'vm2' > > -vnc :2 -pidfile /tmp/vm2.pid -driile=fat:rw:/tmp/share,snapshot=off > > -device vfio-pci,host=01:10.1,id=net1 > > > > > > 2. The VF Device was visible with another pci address. > > > > 00:04.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit > SFI/SFP+ > > Network > > Connection (rev 01) > > > > Further ran DPDK testpmd on top of VM but bound the virtual function > driver > > to igb_uio instead of vfio-pci. It ran successfully. The ixgbevf pmd > > driver is able to access the BAR registers in pci_uio_map_resource via > > mmaping somewhere close to hugepages. I was not able to bind the virtual > > function driver in VM to vfio-pci and hence DPDK would not be able to run > > with VFIO enabled as it complains of no IOMMU support. I also believe > that > > there is little logic in binding the vf device to vfio-pci again since > qemu > > has already taken care of it and hardware support is involved. > > > > So my questions are > > a. vfio is meant to be a replacement for both uio and device assignment > > for qemu. This doesnt seem simultaneous. Comment? > > What do you mean "device assignment" ? You mean pass-through? > > b. Is there any way to access VF device using VFIO in guest userspace? > > Have you run DPDK in guest machine with VFIO support enabled and all > > dependent modules inserted and did it work? > > As I knows, the answer is *NO*, because the iommu is not support in guest. > > c. Is igb_uio or uio_pci_generic in future the only way to access the > > device in guest userspace? > > It depends, who knows what will happen in future :) > > Thanks, > Michael > > *Regards* > > > > -- *Regards,* *Aashima Arora* *arora.aa91 at gmail.com *
[dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race conditioning
> Though, that said, doesn't it seem to anyone else like serialization of > enqueue > to a port should be the responsibility of the library, not the application? > > Neil >From my knowledge it is an application responsibility to serialize access to queue on particular port. Pawel
[dpdk-dev] [PATCH 3/4] kni: replace strict_strtoul with kstrtoul
Hi Jincheng, 2014-12-10 11:33, Jincheng Miao: > From upstream kernel commit 3db2e9cd, strict_strto* serial functions > are removed. So that we should directly used kstrtoul instead. > > And add kni/compat.h for be compatible with older kernel. > > Signed-off-by: Jincheng Miao [...] > new file mode 100644 > index 000..c8c662c > --- /dev/null > +++ b/lib/librte_eal/linuxapp/kni/compat.h > @@ -0,0 +1,16 @@ > +/* > + * Minimal wrappers to allow compiling kni on older kernels. > + */ > + > +#ifndef RHEL_RELEASE_VERSION > +#define RHEL_RELEASE_VERSION(a, b) (((a) << 8) + (b)) > +#endif > + > +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ > +(!(defined(RHEL_RELEASE_CODE) && \ > + RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) The first indent character must be a tab (highlighted by checkpatch). [...] > --- a/lib/librte_eal/linuxapp/kni/kni_vhost.c > +++ b/lib/librte_eal/linuxapp/kni/kni_vhost.c > @@ -740,7 +740,7 @@ set_sock_en(struct device *dev, struct device_attribute > *attr, > unsigned long en; > int err = 0; > > - if (0 != strict_strtoul(buf, 0, &en)) > + if (0 != kstrtoul(buf, 0, &en)) > return -EINVAL; > > if (en) It seems you forgot to include the new compat.h. Did you do some tests with different Fedora/RHEL versions? Thanks -- Thomas
[dpdk-dev] [PATCH 4/4] xen_dom0: replace strict_strtoul with kstrtoul
2014-12-10 11:33, Jincheng Miao: > From upstream kernel commit 3db2e9cd, strict_strto* serial functions > are removed. So that we should directly used kstrtoul instead. > > And add xen_dom0/compat.h for be compatible with older kernel. > > Signed-off-by: Jincheng Miao Same comments as patch 3/4 apply here. -- Thomas
[dpdk-dev] [PATCH 3/4] kni: replace strict_strtoul with kstrtoul
- Original Message - > Hi Jincheng, > > 2014-12-10 11:33, Jincheng Miao: > > From upstream kernel commit 3db2e9cd, strict_strto* serial functions > > are removed. So that we should directly used kstrtoul instead. > > > > And add kni/compat.h for be compatible with older kernel. > > > > Signed-off-by: Jincheng Miao > [...] > > new file mode 100644 > > index 000..c8c662c > > --- /dev/null > > +++ b/lib/librte_eal/linuxapp/kni/compat.h > > @@ -0,0 +1,16 @@ > > +/* > > + * Minimal wrappers to allow compiling kni on older kernels. > > + */ > > + > > +#ifndef RHEL_RELEASE_VERSION > > +#define RHEL_RELEASE_VERSION(a, b) (((a) << 8) + (b)) > > +#endif > > + > > +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ > > +(!(defined(RHEL_RELEASE_CODE) && \ > > + RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) > > The first indent character must be a tab (highlighted by checkpatch). Yes, I think the TAB is replaced by my vim :( > > [...] > > --- a/lib/librte_eal/linuxapp/kni/kni_vhost.c > > +++ b/lib/librte_eal/linuxapp/kni/kni_vhost.c > > @@ -740,7 +740,7 @@ set_sock_en(struct device *dev, struct device_attribute > > *attr, > > unsigned long en; > > int err = 0; > > > > - if (0 != strict_strtoul(buf, 0, &en)) > > + if (0 != kstrtoul(buf, 0, &en)) > > return -EINVAL; > > > > if (en) > > It seems you forgot to include the new compat.h. > > Did you do some tests with different Fedora/RHEL versions? Yes, missing compat.h in kni_vhost.c. And, I want to get your opinion about adding compat.h to kni and xen_dom0. The pros: easy to implement and minimal wrapper for older kernel. The cons: there is so many compat.h, and the file kcompat.h also makes user confuse. > > Thanks > -- > Thomas >
[dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly added RX error flags
Before redefining mbuf structure, there was lack of free bits in 'ol_flags' (32 bits in total) for new RX or TX flags. So it tried to reuse existant bits as most as possible, or even assigning 0 to some of bit flags. After new mbuf structure defined, there are quite a lot of free bits. So those newly added bit flags should be assigned with correct and valid bit values, and getting their names should be enabled as well. Note that 'RECIP' should be removed, as nowhere uses it. 'PKT_RX_ERR' is defined to replace all other error bit flags, e.g. MAC error, Oversize error, header buffer overflow error. Signed-off-by: Helin Zhang --- lib/librte_mbuf/rte_mbuf.c | 7 ++- lib/librte_mbuf/rte_mbuf.h | 9 +++-- lib/librte_pmd_i40e/i40e_rxtx.c | 37 - 3 files changed, 25 insertions(+), 28 deletions(-) v2 changes: * Removed error flag of 'ECIPE' processing only in both i40e PMD and mbuf. All other error flags were added back. * Assigned error flags with correct and valid values, as their previous values were invalid. * Enabled getting all error flag names. v3 changes: * 'PKT_RX_ERR' is defined to replace error bit flags of MAC error, Oversize error, header buffer overflow error. * Removed assigning values to PKT_TX_* bit flags, as it has already been done in another packet set. v4 changes: * Renamed 'PKT_RX_EIP_CKSUM_BAD' to 'PKT_RX_OUTER_IP_CKSUM_BAD'. * Fixed the bug of checking error bits of 'Descriptor oversize' and 'Header buffer oversize'. * Added debug logs for each RX error. diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c index 1b14e02..7611a38 100644 --- a/lib/librte_mbuf/rte_mbuf.c +++ b/lib/librte_mbuf/rte_mbuf.c @@ -210,11 +210,8 @@ const char *rte_get_rx_ol_flag_name(uint64_t mask) case PKT_RX_FDIR: return "PKT_RX_FDIR"; case PKT_RX_L4_CKSUM_BAD: return "PKT_RX_L4_CKSUM_BAD"; case PKT_RX_IP_CKSUM_BAD: return "PKT_RX_IP_CKSUM_BAD"; - /* case PKT_RX_EIP_CKSUM_BAD: return "PKT_RX_EIP_CKSUM_BAD"; */ - /* case PKT_RX_OVERSIZE: return "PKT_RX_OVERSIZE"; */ - /* case PKT_RX_HBUF_OVERFLOW: return "PKT_RX_HBUF_OVERFLOW"; */ - /* case PKT_RX_RECIP_ERR: return "PKT_RX_RECIP_ERR"; */ - /* case PKT_RX_MAC_ERR: return "PKT_RX_MAC_ERR"; */ + case PKT_RX_OUTER_IP_CKSUM_BAD: return "PKT_RX_OUTER_IP_CKSUM_BAD"; + case PKT_RX_ERR: return "PKT_RX_ERR"; case PKT_RX_IPV4_HDR: return "PKT_RX_IPV4_HDR"; case PKT_RX_IPV4_HDR_EXT: return "PKT_RX_IPV4_HDR_EXT"; case PKT_RX_IPV6_HDR: return "PKT_RX_IPV6_HDR"; diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index efdefc4..eefe8a6 100644 --- a/lib/librte_mbuf/rte_mbuf.h +++ b/lib/librte_mbuf/rte_mbuf.h @@ -83,12 +83,7 @@ extern "C" { #define PKT_RX_RSS_HASH (1ULL << 1) /**< RX packet with RSS hash result. */ #define PKT_RX_FDIR (1ULL << 2) /**< RX packet with FDIR match indicate. */ #define PKT_RX_L4_CKSUM_BAD (1ULL << 3) /**< L4 cksum of RX pkt. is not OK. */ -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP cksum of RX pkt. is not OK. */ -#define PKT_RX_EIP_CKSUM_BAD (0ULL << 0) /**< External IP header checksum error. */ -#define PKT_RX_OVERSIZE (0ULL << 0) /**< Num of desc of an RX pkt oversize. */ -#define PKT_RX_HBUF_OVERFLOW (0ULL << 0) /**< Header buffer overflow. */ -#define PKT_RX_RECIP_ERR (0ULL << 0) /**< Hardware processing error. */ -#define PKT_RX_MAC_ERR (0ULL << 0) /**< MAC error. */ +#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP (or inner IP) header checksum error. */ #define PKT_RX_IPV4_HDR (1ULL << 5) /**< RX packet with IPv4 header. */ #define PKT_RX_IPV4_HDR_EXT (1ULL << 6) /**< RX packet with extended IPv4 header. */ #define PKT_RX_IPV6_HDR (1ULL << 7) /**< RX packet with IPv6 header. */ @@ -99,6 +94,8 @@ extern "C" { #define PKT_RX_TUNNEL_IPV6_HDR (1ULL << 12) /**< RX tunnel packet with IPv6 header. */ #define PKT_RX_FDIR_ID (1ULL << 13) /**< FD id reported if FDIR match. */ #define PKT_RX_FDIR_FLX (1ULL << 14) /**< Flexible bytes reported if FDIR match. */ +#define PKT_RX_OUTER_IP_CKSUM_BAD (1ULL << 15) /**< Outer IP header checksum error. */ +#define PKT_RX_ERR (1ULL << 16) /**< Other errors, e.g. MAC error. */ /* add new RX flags here */ /* add new TX flags here */ diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c index 2beae3c..43f67c3 100644 --- a/lib/librte_pmd_i40e/i40e_rxtx.c +++ b/lib/librte_pmd_i40e/i40e_rxtx.c @@ -68,6 +68,7 @@ #define I40E_TX_MAX_BURST 32 #define I40E_DMA_MEM_ALIGN 4096 +#define I40E_RX_ERR_MASK 0x3F #define I40E_SIMPLE_FLAGS ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \ ETH_TXQ_FLAGS_NOOFFLOADS) @@ -118,30 +119,32 @@ i40e_rxd_error_to_pkt_flags(uint64_t qword) uint64_t flags = 0; uint64_t error_bits = (qword >> I40E_RXD_QW1_ERROR_SHIFT); -#defin
[dpdk-dev] [PATCH 3/4] kni: replace strict_strtoul with kstrtoul
2014-12-10 03:48, Jincheng Miao: > > It seems you forgot to include the new compat.h. > > > > Did you do some tests with different Fedora/RHEL versions? > > Yes, missing compat.h in kni_vhost.c. > > > And, I want to get your opinion about adding compat.h to kni and xen_dom0. > The pros: easy to implement and minimal wrapper for older kernel. Yes I think it's the good approach. > The cons: there is so many compat.h, and the file kcompat.h also makes user > confuse. Why kcompat makes user confuse? Do you think you could send a new version quickly to integrate it in the next RC (probably today)? Please test it with RHEL. Thanks -- Thomas
[dpdk-dev] [PATCH v5] VFIO: Avoid to enable vfio while the module not loaded
> > When vfio module is not loaded when kernel support vfio feature, the > routine still try to open the container to get file description. > > This action is not safe, and of cause got error messages: > > EAL: Detected 40 lcore(s) > EAL: unsupported IOMMU type! > EAL: VFIO support could not be initialized > EAL: Setting up memory... > > This may make user confuse, this patch make it reasonable and much more > soomth to user. > > Signed-off-by: Michael Qiu > --- > v5 --> v4 > 1. Move rte_eal_check_module() body to eal.c > 2. Clean up "unsupported IOMMU type" log > > v4 --> v3: > 1. Remove RTE_LOG for params check > 2. Remove "vfio" module check as "vfio_iommu_type1" > loaded indecated "vfio" loaded > > v3 --> v2: > 1. Add error log in rte_eal_check_module() > 2. Some code clean up. > > v2 --> v1: > 1. Move check_module() from rte_common.h to eal_private.h >and rename to rte_eal_check_module(). >To make it linuxapp only. > 2. Some code clean up. > > lib/librte_eal/common/eal_private.h| 15 +++ > lib/librte_eal/linuxapp/eal/eal.c | 27 +++ > lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 26 > +++--- > 3 files changed, 65 insertions(+), 3 deletions(-) > > diff --git a/lib/librte_eal/common/eal_private.h > b/lib/librte_eal/common/eal_private.h > index 232fcec..4183b54 100644 > --- a/lib/librte_eal/common/eal_private.h > +++ b/lib/librte_eal/common/eal_private.h > @@ -203,4 +203,19 @@ int rte_eal_alarm_init(void); > */ > int rte_eal_dev_init(void); > > +/** > + * Function is to check if the kernel module(like, vfio, > +vfio_iommu_type1, > + * etc.) loaded. > + * > + * @param module_name > + * The module's name which need to be checked > + * > + * @return > + * -1 means some error happens(NULL pointer or open failure) > + * 0 means the module not loaded > + * 1 means the module loaded > + */ > +inline int > +rte_eal_check_module(const char *module_name); Just curious - why inline? > + > #endif /* _EAL_PRIVATE_H_ */ > diff --git a/lib/librte_eal/linuxapp/eal/eal.c > b/lib/librte_eal/linuxapp/eal/eal.c > index 89f3b5e..40b462e 100644 > --- a/lib/librte_eal/linuxapp/eal/eal.c > +++ b/lib/librte_eal/linuxapp/eal/eal.c > @@ -859,3 +859,30 @@ int rte_eal_has_hugepages(void) { > return ! internal_config.no_hugetlbfs; } > + > +inline int > +rte_eal_check_module(const char *module_name) { > + char mod_name[30]; /* Any module names can be longer than 30 > bytes? */ > + int ret = 0; > + > + if (NULL == module_name) > + return -1; > + > + FILE * fd = fopen("/proc/modules", "r"); > + if (NULL == fd) { > + RTE_LOG(ERR, EAL, "Open /proc/modules failed!" > + " error %i (%s)\n", errno, strerror(errno)); > + return -1; > + } > + while(!feof(fd)) { > + fscanf(fd, "%30s %*[^\n]", mod_name); As far as I know, in fscanf terms, %30s will result in at most 30 character string, i.e. 31 bytes (30 characters + null terminator). So it probably should be %29s. > + if(!strcmp(mod_name, module_name)) { > + ret = 1; > + break; > + } > + } > + fclose(fd); > + > + return ret; > +} > diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > index c1246e8..16fe10f 100644 > --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > @@ -44,6 +44,7 @@ > #include > #include > #include > +#include > > #include "eal_filesystem.h" > #include "eal_pci_init.h" > @@ -339,10 +340,12 @@ pci_vfio_get_container_fd(void) > ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, > VFIO_TYPE1_IOMMU); > if (ret != 1) { > if (ret < 0) > - RTE_LOG(ERR, EAL, " could not get IOMMU > type, " > - "error %i (%s)\n", errno, > strerror(errno)); > + RTE_LOG(ERR, EAL, " could not get IOMMU > type," > + " error %i (%s)\n", errno, > + strerror(errno)); > else > - RTE_LOG(ERR, EAL, " unsupported IOMMU > type!\n"); > + RTE_LOG(ERR, EAL, " unsupported IOMMU > type" > + " detected in VFIO\n"); > close(vfio_container_fd); > return -1; > } > @@ -783,11 +786,28 @@ pci_vfio_enable(void) { > /* initialize group list */ > int i; > + int module_vfio_type1; > > for (i = 0; i < VFIO_MAX_GROUPS; i++) { > vfio_cfg.vfio_groups[i].fd = -1; > vfio_cfg.vfio_groups[i].group_no = -1; >
[dpdk-dev] error: value computed is not used
Hi all, Any idea? Leave this issue in DPDK? Thanks, Michael On 12/8/2014 5:07 PM, Qiu, Michael wrote: > Hi all, > My platform is: > > uname -a > Linux suse-11-sp3 3.0.77-0.11-xen #1 SMP Tue Mar 11 16:48:56 CST 2014 > x86_64 x86_64 x86_64 GNU/Linux > > gcc -v > Using built-in specs. > COLLECT_GCC=gcc > COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.5/lto-wrapper > Target: x86_64-suse-linux > Configured with: ../configure --prefix=/usr --infodir=/usr/share/info > --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 > --enable-languages=c,c++,objc,fortran,obj-c++,java,ada > --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.5 > --enable-ssp --disable-libssp --disable-plugin > --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' > --disable-libgcj --disable-libmudflap --with-slibdir=/lib64 > --with-system-zlib --enable-__cxa_atexit > --enable-libstdcxx-allocator=new --disable-libstdcxx-pch > --enable-version-specific-runtime-libs --program-suffix=-4.5 > --enable-linux-futex --without-system-libunwind --enable-gold > --with-plugin-ld=/usr/bin/gold --with-arch-32=i586 --with-tune=generic > --build=x86_64-suse-linux > Thread model: posix > gcc version 4.5.1 20101208 [gcc-4_5-branch revision 167585] (SUSE Linux) > > When I try to compile the source code to x86_64 linuxapp, I got this > error message: > > lib/librte_pmd_enic/enic_main.c: In function ?enic_set_rsskey?: > lib/librte_pmd_enic/enic_main.c:862:2: error: value computed is not used > > I dig out that, it was ome issue of the macros rte_memcpy() > #define rte_memcpy(dst, src, n) \ > ((__builtin_constant_p(n)) ? \ > memcpy((dst), (src), (n)) : \ > rte_memcpy_func((dst), (src), (n))) > > When I use only (n) instead of (__builtin_constant_p(n), it will pass( I > know that it was incorrect, just a experiment). > > But I try to use inline function instead of macros: > static inline void * rte_memcpy(void *dst, const void *src, size_t n) > { > return __builtin_constant_p(n) ? memcpy(dst, src, n) : > rte_memcpy_func(dst, src, n); > } > > It will pass:), and works, this could be one potential workaround fix. > > Who knows why? The root cause is what? > > I've no idea about this. > > Thanks, > Michael >
[dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly added RX error flags
2014-12-10 16:55, Helin Zhang: > Before redefining mbuf structure, there was lack of free bits in 'ol_flags' > (32 bits in total) for new RX or TX flags. So it tried to reuse existant > bits as most as possible, or even assigning 0 to some of bit flags. After > new mbuf structure defined, there are quite a lot of free bits. So those > newly added bit flags should be assigned with correct and valid bit values, > and getting their names should be enabled as well. Note that 'RECIP' should > be removed, as nowhere uses it. 'PKT_RX_ERR' is defined to replace all other > error bit flags, e.g. MAC error, Oversize error, header buffer overflow error. > > Signed-off-by: Helin Zhang > --- > lib/librte_mbuf/rte_mbuf.c | 7 ++- > lib/librte_mbuf/rte_mbuf.h | 9 +++-- > lib/librte_pmd_i40e/i40e_rxtx.c | 37 - > 3 files changed, 25 insertions(+), 28 deletions(-) > > v2 changes: > * Removed error flag of 'ECIPE' processing only in both i40e PMD and mbuf. All > other error flags were added back. > * Assigned error flags with correct and valid values, as their previous values > were invalid. > * Enabled getting all error flag names. > > v3 changes: > * 'PKT_RX_ERR' is defined to replace error bit flags of MAC error, Oversize > error, header buffer overflow error. > * Removed assigning values to PKT_TX_* bit flags, as it has already been done > in another packet set. > > v4 changes: > * Renamed 'PKT_RX_EIP_CKSUM_BAD' to 'PKT_RX_OUTER_IP_CKSUM_BAD'. > * Fixed the bug of checking error bits of 'Descriptor oversize' and > 'Header buffer oversize'. > * Added debug logs for each RX error. [...] > --- a/lib/librte_mbuf/rte_mbuf.h > +++ b/lib/librte_mbuf/rte_mbuf.h > @@ -83,12 +83,7 @@ extern "C" { > #define PKT_RX_RSS_HASH (1ULL << 1) /**< RX packet with RSS hash > result. */ > #define PKT_RX_FDIR (1ULL << 2) /**< RX packet with FDIR match > indicate. */ > #define PKT_RX_L4_CKSUM_BAD (1ULL << 3) /**< L4 cksum of RX pkt. is not > OK. */ > -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP cksum of RX pkt. is not > OK. */ > -#define PKT_RX_EIP_CKSUM_BAD (0ULL << 0) /**< External IP header checksum > error. */ > -#define PKT_RX_OVERSIZE (0ULL << 0) /**< Num of desc of an RX pkt > oversize. */ > -#define PKT_RX_HBUF_OVERFLOW (0ULL << 0) /**< Header buffer overflow. */ > -#define PKT_RX_RECIP_ERR (0ULL << 0) /**< Hardware processing error. */ > -#define PKT_RX_MAC_ERR (0ULL << 0) /**< MAC error. */ > +#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP (or inner IP) header > checksum error. */ It can be also an outer IP header in case the device don't support tunneling or is not configured to detect it. -- Thomas > #define PKT_RX_IPV4_HDR (1ULL << 5) /**< RX packet with IPv4 header. */ > #define PKT_RX_IPV4_HDR_EXT (1ULL << 6) /**< RX packet with extended IPv4 > header. */ > #define PKT_RX_IPV6_HDR (1ULL << 7) /**< RX packet with IPv6 header. */ > @@ -99,6 +94,8 @@ extern "C" { > #define PKT_RX_TUNNEL_IPV6_HDR (1ULL << 12) /**< RX tunnel packet with IPv6 > header. */ > #define PKT_RX_FDIR_ID (1ULL << 13) /**< FD id reported if FDIR match. > */ > #define PKT_RX_FDIR_FLX (1ULL << 14) /**< Flexible bytes reported if > FDIR match. */ > +#define PKT_RX_OUTER_IP_CKSUM_BAD (1ULL << 15) /**< Outer IP header > checksum error. */ > +#define PKT_RX_ERR (1ULL << 16) /**< Other errors, e.g. MAC > error. */ > /* add new RX flags here */ > > /* add new TX flags here */
[dpdk-dev] [PATCH 3/4] kni: replace strict_strtoul with kstrtoul
- Original Message - > 2014-12-10 03:48, Jincheng Miao: > > > It seems you forgot to include the new compat.h. > > > > > > Did you do some tests with different Fedora/RHEL versions? > > > > Yes, missing compat.h in kni_vhost.c. > > > > > > And, I want to get your opinion about adding compat.h to kni and xen_dom0. > > The pros: easy to implement and minimal wrapper for older kernel. > > Yes I think it's the good approach. > > > The cons: there is so many compat.h, and the file kcompat.h also makes user > > confuse. > > Why kcompat makes user confuse? Because a lot of compat fallback in kcompat too. For example, for kni, there is compat.h and kcompat.h. I just confuse about it before. But if you agree it, I will also agree it. > > Do you think you could send a new version quickly to integrate it in the > next RC (probably today)? > > Please test it with RHEL. Yes, I am working on it. I will report my test result latter. Jincheng Miao > > Thanks > -- > Thomas >
[dpdk-dev] [PATCH] doc: add license header to link bonding diagrams
> -Original Message- > From: Doherty, Declan > Sent: Monday, December 8, 2014 11:22 AM > To: dev at dpdk.org > Cc: Iremonger, Bernard; Doherty, Declan > Subject: [PATCH] doc: add license header to link bonding diagrams > > > Signed-off-by: Declan Doherty Acked-by: Bernard Iremonger I have applied the patch to my tree next/dpdk-doc.
[dpdk-dev] A question about hugepage initialization time
On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote: > On Tue, 9 Dec 2014 11:45:07 -0800 > &rew wrote: > > > > Hey Folks, > > > > > > Our DPDK application deals with very large in memory data structures, and > > > can potentially use tens or even hundreds of gigabytes of hugepage memory. > > > During the course of development, we've noticed that as the number of huge > > > pages increases, the memory initialization time during EAL init gets to be > > > quite long, lasting several minutes at present. The growth in init time > > > doesn't appear to be linear, which is concerning. > > > > > > This is a minor inconvenience for us and our customers, as memory > > > initialization makes our boot times a lot longer than it would otherwise > > > be. Also, my experience has been that really long operations often are > > > hiding errors - what you think is merely a slow operation is actually a > > > timeout of some sort, often due to misconfiguration. This leads to two > > > questions: > > > > > > 1. Does the long initialization time suggest that there's an error > > > happening under the covers? > > > 2. If not, is there any simple way that we can shorten memory > > > initialization time? > > > > > > Thanks in advance for your insights. > > > > > > -- > > > Matt Laswell > > > laswell at infiniteio.com > > > infinite io, inc. > > > > > > > Hello, > > > > please find some quick comments on the questions: > > 1.) By our experience long initialization time is normal in case of > > large amount of memory. However this time depends on some things: > > - number of hugepages (pagefault handled by kernel is pretty expensive) > > - size of hugepages (memset at initialization) > > > > 2.) Using 1G pages instead of 2M will reduce the initialization time > > significantly. Using wmemset instead of memset adds an additional 20-30% > > boost by our measurements. Or, just by touching the pages but not cleaning > > them you can have still some more speedup. But in this case your layer or > > the applications above need to do the cleanup at allocation time > > (e.g. by using rte_zmalloc). > > > > Cheers, > > &rew > > I wonder if the whole rte_malloc code is even worth it with a modern kernel > with transparent huge pages? rte_malloc adds very little value and is less > safe > and slower than glibc or other allocators. Plus you lose the ablilty to get > all the benefit out of valgrind or electric fence. While I'd dearly love to not have our own custom malloc lib to maintain, for DPDK multiprocess, rte_malloc will be hard to replace as we would need a replacement solution that similarly guarantees that memory mapped in process A is also available at the same address in process B. :-( /Bruce
[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages
On Wed, Dec 10, 2014 at 10:25:41AM +0800, Michael Qiu wrote: > When the first address is the compared address in the loop, > it will also do memory copy, which is meaningless, > worse more, when hugepg_tbl is mostly in order. This should > be a big deal in large hugepage memory systerm(like hunderd > or thousand GB). I actually doubt the time taken by this sorting is a significant part of the initialization time taken, even for hundreds of GBs of memory. Do you have any measurements to confirm this? [However, that's not to say that we can't merge in this patch] > > Meanwhile smallest_idx never be a value of -1,so remove this check. > > This patch also includes some coding style fix. > > Signed-off-by: Michael Qiu > --- > lib/librte_eal/linuxapp/eal/eal_memory.c | 13 + > 1 file changed, 5 insertions(+), 8 deletions(-) > > diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c > b/lib/librte_eal/linuxapp/eal/eal_memory.c > index e6cb919..700aba2 100644 > --- a/lib/librte_eal/linuxapp/eal/eal_memory.c > +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c > @@ -678,14 +678,13 @@ error: > static int > sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi) > { > - unsigned i, j; > - int compare_idx; > + unsigned i, j, compare_idx; > uint64_t compare_addr; > struct hugepage_file tmp; > > for (i = 0; i < hpi->num_pages[0]; i++) { > compare_addr = 0; > - compare_idx = -1; > + compare_idx = i; > > /* >* browse all entries starting at 'i', and find the > @@ -704,11 +703,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, > struct hugepage_info *hpi) > } > } > > - /* should not happen */ > - if (compare_idx == -1) { > - RTE_LOG(ERR, EAL, "%s(): error in physaddr sorting\n", > __func__); > - return -1; > - } > + /* avoid memory copy when the first entry is the compared */ > + if (compare_idx == i) > + continue; > > /* swap the 2 entries in the table */ > memcpy(&tmp, &hugepg_tbl[compare_idx], > -- > 1.9.3 >
[dpdk-dev] [PATCH 1/2 v4] Fix compile issue with hugepage_sz in 32-bit system
lib/librte_eal/linuxapp/eal/eal_memory.c:324:4: error: comparison is always false due to limited range of data type [-Werror=type-limits] || (hugepage_sz == RTE_PGSIZE_16G)) { ^ cc1: all warnings being treated as errors This was introuduced by commit b77b5639: mem: add huge page sizes for IBM Power The root cause is that size_t is 32-bit in i686 platform, but RTE_PGSIZE_16M and RTE_PGSIZE_16G are always 64-bit. Force hugepage_sz to always 64-bit to avoid this issue. Signed-off-by: Michael Qiu --- v4 ---> v3 Change hugepage_sz from size_t to uint64_t split second bugfix to another patch v3 ---> v2 Change RTE_PGSIZE_16G from ULL to UL to keep all entries consistent V2 ---> v1 Change two type entries to one, and leave RTE_PGSIZE_16G only valid for 64-bit platform lib/librte_eal/common/eal_common_memory.c | 2 +- lib/librte_eal/common/eal_internal_cfg.h| 2 +- lib/librte_eal/common/include/rte_memory.h | 2 +- lib/librte_eal/common/include/rte_memzone.h | 2 +- lib/librte_eal/linuxapp/eal/eal_memory.c| 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index 412b432..77830f8 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -96,7 +96,7 @@ rte_dump_physmem_layout(FILE *f) fprintf(f, "Segment %u: phys:0x%"PRIx64", len:%zu, " "virt:%p, socket_id:%"PRId32", " - "hugepage_sz:%zu, nchannel:%"PRIx32", " + "hugepage_sz:%"PRIu64", nchannel:%"PRIx32", " "nrank:%"PRIx32"\n", i, mcfg->memseg[i].phys_addr, mcfg->memseg[i].len, diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h index aac6abf..e2ecb0d 100644 --- a/lib/librte_eal/common/eal_internal_cfg.h +++ b/lib/librte_eal/common/eal_internal_cfg.h @@ -49,7 +49,7 @@ * mount points of hugepages */ struct hugepage_info { - size_t hugepage_sz; /**< size of a huge page */ + uint64_t hugepage_sz; /**< size of a huge page */ const char *hugedir;/**< dir where hugetlbfs is mounted */ uint32_t num_pages[RTE_MAX_NUMA_NODES]; /**< number of hugepages of that size on each socket */ diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h index 1990833..7f8103f 100644 --- a/lib/librte_eal/common/include/rte_memory.h +++ b/lib/librte_eal/common/include/rte_memory.h @@ -92,7 +92,7 @@ struct rte_memseg { phys_addr_t ioremap_addr; /**< Real physical address inside the VM */ #endif size_t len; /**< Length of the segment. */ - size_t hugepage_sz; /**< The pagesize of underlying memory */ + uint64_t hugepage_sz; /**< The pagesize of underlying memory */ int32_t socket_id; /**< NUMA socket ID. */ uint32_t nchannel; /**< Number of channels. */ uint32_t nrank; /**< Number of ranks. */ diff --git a/lib/librte_eal/common/include/rte_memzone.h b/lib/librte_eal/common/include/rte_memzone.h index 7d47bff..3006e81 100644 --- a/lib/librte_eal/common/include/rte_memzone.h +++ b/lib/librte_eal/common/include/rte_memzone.h @@ -83,7 +83,7 @@ struct rte_memzone { #endif size_t len; /**< Length of the memzone. */ - size_t hugepage_sz; /**< The page size of underlying memory */ + uint64_t hugepage_sz; /**< The page size of underlying memory */ int32_t socket_id;/**< NUMA socket ID. */ diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 700aba2..566a052 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -300,7 +300,7 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, #endif for (i = 0; i < hpi->num_pages[0]; i++) { - size_t hugepage_sz = hpi->hugepage_sz; + uint64_t hugepage_sz = hpi->hugepage_sz; if (orig) { hugepg_tbl[i].file_id = i; -- 1.9.3
[dpdk-dev] [PATCH 2/2] Fix compile issue of eal with icc compile
lib/librte_eal/linuxapp/eal/eal.c(461): error #2259: non-pointer conversion from "long long" to "void *" may lose significant bits RTE_PTR_ALIGN_CEIL((uintptr_t)addr, RTE_PGSIZE_16M); The root cause is that "RTE_PGSIZE_16M" is defined as unsigned long long. But in i686 platform "void *" is 32-bit. It is safe to cast to size_t and make it works in both 32 & 64-bit platform. Signed-off-by: Michael Qiu --- lib/librte_eal/linuxapp/eal/eal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 40b462e..37e4419 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -458,7 +458,7 @@ eal_parse_base_virtaddr(const char *arg) * it can align to 2MB for x86. So this alignment can also be used * on x86 */ internal_config.base_virtaddr = - RTE_PTR_ALIGN_CEIL((uintptr_t)addr, RTE_PGSIZE_16M); + RTE_PTR_ALIGN_CEIL((uintptr_t)addr, (size_t)RTE_PGSIZE_16M); return 0; } -- 1.9.3
[dpdk-dev] [PATCH 0/2 v4] Fix two compile issues with i686 platform
These two issues are both introuduced by commit b77b5639: mem: add huge page sizes for IBM Power Michael Qiu (2): Fix compile issue with hugepage_sz in 32-bit system Fix compile issue of eal with icc compile lib/librte_eal/common/eal_common_memory.c | 2 +- lib/librte_eal/common/eal_internal_cfg.h| 2 +- lib/librte_eal/common/include/rte_memory.h | 2 +- lib/librte_eal/common/include/rte_memzone.h | 2 +- lib/librte_eal/linuxapp/eal/eal.c | 2 +- lib/librte_eal/linuxapp/eal/eal_memory.c| 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) -- 1.9.3
[dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race conditioning
2014-12-08 09:45, Neil Horman: > On Tue, Jul 08, 2014 at 12:16:24PM +0100, Daniel Mrzyglod wrote: > > Signed-off-by: Daniel Mrzyglod > Acked-by: Neil Horman Someone to provide an explanation for commit log? Thanks -- Thomas
[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages
On 12/10/2014 6:41 PM, Richardson, Bruce wrote: > On Wed, Dec 10, 2014 at 10:25:41AM +0800, Michael Qiu wrote: >> When the first address is the compared address in the loop, >> it will also do memory copy, which is meaningless, >> worse more, when hugepg_tbl is mostly in order. This should >> be a big deal in large hugepage memory systerm(like hunderd >> or thousand GB). > I actually doubt the time taken by this sorting is a significant part of the > initialization time taken, even for hundreds of GBs of memory. Do you have > any measurements to confirm this? > [However, that's not to say that we can't merge in this patch] I've no hardware env of so huge memory, so I haven't do measurements on this. This is not a big improvement, but indeed it may do lots of useless memory copy in initialize stat. It could, at least could save time :) Thanks, Michael > > >> Meanwhile smallest_idx never be a value of -1,so remove this check. >> >> This patch also includes some coding style fix. >> >> Signed-off-by: Michael Qiu >> --- >> lib/librte_eal/linuxapp/eal/eal_memory.c | 13 + >> 1 file changed, 5 insertions(+), 8 deletions(-) >> >> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c >> b/lib/librte_eal/linuxapp/eal/eal_memory.c >> index e6cb919..700aba2 100644 >> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c >> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c >> @@ -678,14 +678,13 @@ error: >> static int >> sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info >> *hpi) >> { >> -unsigned i, j; >> -int compare_idx; >> +unsigned i, j, compare_idx; >> uint64_t compare_addr; >> struct hugepage_file tmp; >> >> for (i = 0; i < hpi->num_pages[0]; i++) { >> compare_addr = 0; >> -compare_idx = -1; >> +compare_idx = i; >> >> /* >> * browse all entries starting at 'i', and find the >> @@ -704,11 +703,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, >> struct hugepage_info *hpi) >> } >> } >> >> -/* should not happen */ >> -if (compare_idx == -1) { >> -RTE_LOG(ERR, EAL, "%s(): error in physaddr sorting\n", >> __func__); >> -return -1; >> -} >> +/* avoid memory copy when the first entry is the compared */ >> +if (compare_idx == i) >> +continue; >> >> /* swap the 2 entries in the table */ >> memcpy(&tmp, &hugepg_tbl[compare_idx], >> -- >> 1.9.3 >>
[dpdk-dev] transmit functions of dpdk
Dear all, In my algorithm, I am interested to perform two activities - (1) transmitting packets to a tx_queue and (2) transmitting packets from tx_queue to a wire - separately. I have gone through the code by putting logs in the dpdk code and found that there is a function rte_eth_tx_burst which transmits packets to a specific queue. However, when I debugged more then I found that this function just calls eth_igb_xmit_pkts from librte_pmd_e1000, and this function just directly write the packets to the wire by writing all packets into registers. Could you please suggest how to implement these two functions if these are not implemented already in dpdk? Thanks & Regards, Sachin.
[dpdk-dev] [PATCH v4 0/4] compatibility fallback and replacement of kernel function invoking
The related kernel function is: - pci_num_vf, it is introduced from upstream linux-2.6.34. For RHEL-based kernel, it is defined from RHEL5.9. - kstrtoul, this function is united kernel API to replace strict_strtoul in the furture. It is introduced from linux-2.6.39. For RHEL6, it is defined from RHEL6.4. This patchset do some compatiblity work for these two functions, and replace strict_strtoul which is depleted from linux-3.18. Some test results for dpdk-1.8.0-rc3 with this patchset: 1. In RHEL7 GA: Build success after specify -mssse3 to CFLAGS, ``` CC ixgbe_rxtx_vec.o In file included from /root/dpdk-source/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c:41:0: /usr/lib/gcc/x86_64-redhat-linux/4.8.2/include/tmmintrin.h:31:3: error: #error "SSSE3 instruction set not enabled" # error "SSSE3 instruction set not enabled" ``` - for igb_uio, pass - for kni, pass - for xen_dom0, pass 2. In RHEL6.5 : Build fails for kni, and xen_dom0: - for igb_uio: pass - for kni: fail, the log is: ``` CC [M] /root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.o /root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c: In function ?kni_sock_poll?: /root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c:222: error: ?struct socket? has no member named ?wq? /root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c: In function ?kni_chk_vhost_rx?: /root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c:313: error: implicit declaration of function ?sk_sleep? cc1: warnings being treated as errors /root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c:313: error: passing argument 1 of ?__wake_up? makes pointer from integer without a cast include/linux/wait.h:146: note: expected ?struct wait_queue_head_t *? but argument is of type ?int? /root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c: In function ?kni_sk_write_space?: /root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c:580: error: assignment makes pointer from integer without a cast make[8]: *** [/root/dpdk-source/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.o] Error 1 ``` - for xen_dom0: fail, seems xen is not offical supported in RHEL6. v4: Indent for patches. Include compat.h for kni_vhost.c v3: Adjust pci_num_vf() introduced RHEL version number. Seperate "replace strict_strtoul with kstrtoul" into 3 patches for igb_uio, kni, and xen_dom0. Add compat.h in kni and xen_dom0 for compatible with older kernels. v2: Merge these two patch in one patchset. Compatible with old kernel for kstrtoul. Compatible with RHEL6 for pci_num_vf. Jincheng Miao (4): igb_uio: compatible with upstream longterm kernel and RHEL igb_uio: replace strict_strtoul with kstrtoul kni: replace strict_strtoul with kstrtoul xen_dom0: replace strict_strtoul with kstrtoul lib/librte_eal/linuxapp/igb_uio/compat.h| 11 ++- lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 4 ++-- lib/librte_eal/linuxapp/kni/compat.h| 16 lib/librte_eal/linuxapp/kni/kni_vhost.c | 3 ++- lib/librte_eal/linuxapp/xen_dom0/compat.h | 16 lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c | 2 +- 6 files changed, 47 insertions(+), 5 deletions(-) create mode 100644 lib/librte_eal/linuxapp/kni/compat.h create mode 100644 lib/librte_eal/linuxapp/xen_dom0/compat.h -- 1.8.3.1
[dpdk-dev] [PATCH v4 2/4] igb_uio: replace strict_strtoul with kstrtoul
>From upstream kernel commit 3db2e9cd, strict_strto* serial functions are removed. So that we should directly used kstrtoul instead. Kstrtoul exists from RHEL6.4, so for compatible with old kernel and RHEL, add some logic to igb_uio/compat.h, same as what we do for pci_num_vf(). Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/igb_uio/compat.h | 8 lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 4 ++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/linuxapp/igb_uio/compat.h b/lib/librte_eal/linuxapp/igb_uio/compat.h index a36f034..455e8b4 100644 --- a/lib/librte_eal/linuxapp/igb_uio/compat.h +++ b/lib/librte_eal/linuxapp/igb_uio/compat.h @@ -44,6 +44,14 @@ static int pci_num_vf(struct pci_dev *dev) #endif /* < 2.6.34 */ +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ + (!(defined(RHEL_RELEASE_CODE) && \ +RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) + +#define kstrtoul strict_strtoul + +#endif /* < 2.6.39 */ + #if LINUX_VERSION_CODE < KERNEL_VERSION(3, 3, 0) && \ (!(defined(RHEL_RELEASE_CODE) && \ RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 3))) diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c index d1ca26e..47ff2f3 100644 --- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c +++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c @@ -83,7 +83,7 @@ store_max_vfs(struct device *dev, struct device_attribute *attr, unsigned long max_vfs; struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); - if (0 != strict_strtoul(buf, 0, &max_vfs)) + if (0 != kstrtoul(buf, 0, &max_vfs)) return -EINVAL; if (0 == max_vfs) @@ -174,7 +174,7 @@ store_max_read_request_size(struct device *dev, unsigned long size = 0; int ret; - if (strict_strtoul(buf, 0, &size) != 0) + if (0 != kstrtoul(buf, 0, &size)) return -EINVAL; ret = pcie_set_readrq(pci_dev, (int)size); -- 1.8.3.1
[dpdk-dev] [PATCH v4 3/4] kni: replace strict_strtoul with kstrtoul
Add kni/compat.h for be compatible with older kernel. Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/kni/compat.h| 16 lib/librte_eal/linuxapp/kni/kni_vhost.c | 3 ++- 2 files changed, 18 insertions(+), 1 deletion(-) create mode 100644 lib/librte_eal/linuxapp/kni/compat.h diff --git a/lib/librte_eal/linuxapp/kni/compat.h b/lib/librte_eal/linuxapp/kni/compat.h new file mode 100644 index 000..39e07ef --- /dev/null +++ b/lib/librte_eal/linuxapp/kni/compat.h @@ -0,0 +1,16 @@ +/* + * Minimal wrappers to allow compiling kni on older kernels. + */ + +#ifndef RHEL_RELEASE_VERSION +#define RHEL_RELEASE_VERSION(a, b) (((a) << 8) + (b)) +#endif + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ + (!(defined(RHEL_RELEASE_CODE) && \ +RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) + +#define kstrtoul strict_strtoul + +#endif /* < 2.6.39 */ + diff --git a/lib/librte_eal/linuxapp/kni/kni_vhost.c b/lib/librte_eal/linuxapp/kni/kni_vhost.c index 7bcc985..35263f2 100644 --- a/lib/librte_eal/linuxapp/kni/kni_vhost.c +++ b/lib/librte_eal/linuxapp/kni/kni_vhost.c @@ -35,6 +35,7 @@ #include "kni_dev.h" #include "kni_fifo.h" +#include "compat.h" #define RX_BURST_SZ 4 @@ -740,7 +741,7 @@ set_sock_en(struct device *dev, struct device_attribute *attr, unsigned long en; int err = 0; - if (0 != strict_strtoul(buf, 0, &en)) + if (0 != kstrtoul(buf, 0, &en)) return -EINVAL; if (en) -- 1.8.3.1
[dpdk-dev] [PATCH v4 4/4] xen_dom0: replace strict_strtoul with kstrtoul
And add xen_dom0/compat.h for be compatible with older kernel. Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/xen_dom0/compat.h | 16 lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c | 2 +- 2 files changed, 17 insertions(+), 1 deletion(-) create mode 100644 lib/librte_eal/linuxapp/xen_dom0/compat.h diff --git a/lib/librte_eal/linuxapp/xen_dom0/compat.h b/lib/librte_eal/linuxapp/xen_dom0/compat.h new file mode 100644 index 000..89dab27 --- /dev/null +++ b/lib/librte_eal/linuxapp/xen_dom0/compat.h @@ -0,0 +1,16 @@ +/* + * Minimal wrappers to allow compiling xen_dom0 on older kernels. + */ + +#ifndef RHEL_RELEASE_VERSION +#define RHEL_RELEASE_VERSION(a, b) (((a) << 8) + (b)) +#endif + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ + (!(defined(RHEL_RELEASE_CODE) && \ +RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) + +#define kstrtoul strict_strtoul + +#endif /* < 2.6.39 */ + diff --git a/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c b/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c index dfb271d..8a3727d 100644 --- a/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c +++ b/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c @@ -123,7 +123,7 @@ store_memsize(struct device *dev, struct device_attribute *attr, int err = 0; unsigned long mem_size; - if (0 != strict_strtoul(buf, 0, &mem_size)) + if (0 != kstrtoul(buf, 0, &mem_size)) return -EINVAL; mutex_lock(&dom0_dev.data_lock); -- 1.8.3.1
[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages
> -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Qiu, Michael > Sent: Wednesday, December 10, 2014 10:59 AM > To: Richardson, Bruce > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages > > On 12/10/2014 6:41 PM, Richardson, Bruce wrote: > > On Wed, Dec 10, 2014 at 10:25:41AM +0800, Michael Qiu wrote: > >> When the first address is the compared address in the loop, > >> it will also do memory copy, which is meaningless, > >> worse more, when hugepg_tbl is mostly in order. This should > >> be a big deal in large hugepage memory systerm(like hunderd > >> or thousand GB). > > I actually doubt the time taken by this sorting is a significant part of the > > initialization time taken, even for hundreds of GBs of memory. Do you have > > any measurements to confirm this? > > [However, that's not to say that we can't merge in this patch] > > I've no hardware env of so huge memory, so I haven't do measurements on > this. > > This is not a big improvement, but indeed it may do lots of useless > memory copy in initialize stat. > > It could, at least could save time :) I wonder why we do need to write our own bubble sort procedure? Why we can't use standard qsort() here? Konstantin > > Thanks, > Michael > > > > > >> Meanwhile smallest_idx never be a value of -1,so remove this check. > >> > >> This patch also includes some coding style fix. > >> > >> Signed-off-by: Michael Qiu > >> --- > >> lib/librte_eal/linuxapp/eal/eal_memory.c | 13 + > >> 1 file changed, 5 insertions(+), 8 deletions(-) > >> > >> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c > >> b/lib/librte_eal/linuxapp/eal/eal_memory.c > >> index e6cb919..700aba2 100644 > >> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c > >> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c > >> @@ -678,14 +678,13 @@ error: > >> static int > >> sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info > >> *hpi) > >> { > >> - unsigned i, j; > >> - int compare_idx; > >> + unsigned i, j, compare_idx; > >>uint64_t compare_addr; > >>struct hugepage_file tmp; > >> > >>for (i = 0; i < hpi->num_pages[0]; i++) { > >>compare_addr = 0; > >> - compare_idx = -1; > >> + compare_idx = i; > >> > >>/* > >> * browse all entries starting at 'i', and find the > >> @@ -704,11 +703,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, > >> struct hugepage_info *hpi) > >>} > >>} > >> > >> - /* should not happen */ > >> - if (compare_idx == -1) { > >> - RTE_LOG(ERR, EAL, "%s(): error in physaddr sorting\n", > >> __func__); > >> - return -1; > >> - } > >> + /* avoid memory copy when the first entry is the compared */ > >> + if (compare_idx == i) > >> + continue; > >> > >>/* swap the 2 entries in the table */ > >>memcpy(&tmp, &hugepg_tbl[compare_idx], > >> -- > >> 1.9.3 > >>
[dpdk-dev] transmit functions of dpdk
On Wed, Dec 10, 2014 at 12:03:41PM +0100, Sachin Sharma wrote: > Dear all, > > In my algorithm, I am interested to perform two activities - (1) > transmitting packets to a tx_queue and (2) transmitting packets from > tx_queue to a wire - separately. I have gone through the code by putting > logs in the dpdk code and found that there is a function rte_eth_tx_burst > which transmits packets to a specific queue. However, when I debugged more > then I found that this function just calls eth_igb_xmit_pkts > from librte_pmd_e1000, and this function just directly write the packets to > the wire by writing all packets into registers. Could you please suggest > how to implement these two functions if these are not implemented already > in dpdk? > > > > Thanks & Regards, > Sachin. Hi Sachin, anything written to the NIC TX queue is automatically put onto the wire unless the NIC port is down or the wire is unplugged etc. What is your use-case that you need to do this? I would suggest doing internal buffering in your application, as many DPDK example applications do, and then call tx_burst to put your packets on the wire when you want this capability. Regards, /Bruce
[dpdk-dev] [PATCH v4 1/4] igb_uio: compatible with upstream longterm kernel and RHEL
Function pci_num_vf() is introduced from upstream linux-2.6.34. So this patch make compatible with longterm kernel linux-2.6.32.63. For RHEL, function pci_num_vf() begins from RHEL5 update9. And it is stub-defined when CONFIG_PCI_IOV is not enabled. So dropped the CONFIG_PCI_IOV checking of commit 11ba0426. For other distro like RHEL behaved to pci_num_vf(), we could simply append following condition macro: (!(defined(OTHER_RELEASE_CODE) && \ OTHER_RELEASE_CODE >= OTHER_RELEASE_VERSION(X, Y))) Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/igb_uio/compat.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/librte_eal/linuxapp/igb_uio/compat.h b/lib/librte_eal/linuxapp/igb_uio/compat.h index 676fa1b..a36f034 100644 --- a/lib/librte_eal/linuxapp/igb_uio/compat.h +++ b/lib/librte_eal/linuxapp/igb_uio/compat.h @@ -21,7 +21,8 @@ #endif #if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 34) && \ - !defined(CONFIG_PCI_IOV) + (!(defined(RHEL_RELEASE_CODE) && \ +RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(5, 9))) static int pci_num_vf(struct pci_dev *dev) { -- 1.8.3.1
[dpdk-dev] [PATCH v5] VFIO: Avoid to enable vfio while the module not loaded
On 12/10/2014 5:21 PM, Burakov, Anatoly wrote: >> When vfio module is not loaded when kernel support vfio feature, the >> routine still try to open the container to get file description. >> >> This action is not safe, and of cause got error messages: >> >> EAL: Detected 40 lcore(s) >> EAL: unsupported IOMMU type! >> EAL: VFIO support could not be initialized >> EAL: Setting up memory... >> >> This may make user confuse, this patch make it reasonable and much more >> soomth to user. >> >> Signed-off-by: Michael Qiu >> --- >> v5 --> v4 >> 1. Move rte_eal_check_module() body to eal.c >> 2. Clean up "unsupported IOMMU type" log >> >> v4 --> v3: >> 1. Remove RTE_LOG for params check >> 2. Remove "vfio" module check as "vfio_iommu_type1" >> loaded indecated "vfio" loaded >> >> v3 --> v2: >> 1. Add error log in rte_eal_check_module() >> 2. Some code clean up. >> >> v2 --> v1: >> 1. Move check_module() from rte_common.h to eal_private.h >>and rename to rte_eal_check_module(). >>To make it linuxapp only. >> 2. Some code clean up. >> >> lib/librte_eal/common/eal_private.h| 15 +++ >> lib/librte_eal/linuxapp/eal/eal.c | 27 +++ >> lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 26 >> +++--- >> 3 files changed, 65 insertions(+), 3 deletions(-) >> >> diff --git a/lib/librte_eal/common/eal_private.h >> b/lib/librte_eal/common/eal_private.h >> index 232fcec..4183b54 100644 >> --- a/lib/librte_eal/common/eal_private.h >> +++ b/lib/librte_eal/common/eal_private.h >> @@ -203,4 +203,19 @@ int rte_eal_alarm_init(void); >> */ >> int rte_eal_dev_init(void); >> >> +/** >> + * Function is to check if the kernel module(like, vfio, >> +vfio_iommu_type1, >> + * etc.) loaded. >> + * >> + * @param module_name >> + * The module's name which need to be checked >> + * >> + * @return >> + * -1 means some error happens(NULL pointer or open failure) >> + * 0 means the module not loaded >> + * 1 means the module loaded >> + */ >> +inline int >> +rte_eal_check_module(const char *module_name); > Just curious - why inline? Just want to make it inline, no strong reason. If you do not agree I will remove. > >> + >> #endif /* _EAL_PRIVATE_H_ */ >> diff --git a/lib/librte_eal/linuxapp/eal/eal.c >> b/lib/librte_eal/linuxapp/eal/eal.c >> index 89f3b5e..40b462e 100644 >> --- a/lib/librte_eal/linuxapp/eal/eal.c >> +++ b/lib/librte_eal/linuxapp/eal/eal.c >> @@ -859,3 +859,30 @@ int rte_eal_has_hugepages(void) { >> return ! internal_config.no_hugetlbfs; } >> + >> +inline int >> +rte_eal_check_module(const char *module_name) { >> +char mod_name[30]; /* Any module names can be longer than 30 >> bytes? */ >> +int ret = 0; >> + >> +if (NULL == module_name) >> +return -1; >> + >> +FILE * fd = fopen("/proc/modules", "r"); >> +if (NULL == fd) { >> +RTE_LOG(ERR, EAL, "Open /proc/modules failed!" >> +" error %i (%s)\n", errno, strerror(errno)); >> +return -1; >> +} >> +while(!feof(fd)) { >> +fscanf(fd, "%30s %*[^\n]", mod_name); > As far as I know, in fscanf terms, %30s will result in at most 30 character > string, i.e. 31 bytes (30 characters + null terminator). So it probably > should be %29s. Good catch, you are right. Thanks, Michael > >> +if(!strcmp(mod_name, module_name)) { >> +ret = 1; >> +break; >> +} >> +} >> +fclose(fd); >> + >> +return ret; >> +} >> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c >> b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c >> index c1246e8..16fe10f 100644 >> --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c >> +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c >> @@ -44,6 +44,7 @@ >> #include >> #include >> #include >> +#include >> >> #include "eal_filesystem.h" >> #include "eal_pci_init.h" >> @@ -339,10 +340,12 @@ pci_vfio_get_container_fd(void) >> ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, >> VFIO_TYPE1_IOMMU); >> if (ret != 1) { >> if (ret < 0) >> -RTE_LOG(ERR, EAL, " could not get IOMMU >> type, " >> -"error %i (%s)\n", errno, >> strerror(errno)); >> +RTE_LOG(ERR, EAL, " could not get IOMMU >> type," >> +" error %i (%s)\n", errno, >> +strerror(errno)); >> else >> -RTE_LOG(ERR, EAL, " unsupported IOMMU >> type!\n"); >> +RTE_LOG(ERR, EAL, " unsupported IOMMU >> type" >> +" detected in VFIO\n"); >> close(vfio_container_fd); >> return -1; >> } >> @@ -783,11 +786,28 @@ pci_vfio
[dpdk-dev] transmit functions of dpdk
Hi Bruce, In my use case, I want to have three NIC TX queues per port, and want to fill one NIC TX queue and want to empty the other queue. Is it possible this through tx_burst or do I need to implement these queues in applications as you suggested before. However, in this case, I would have then one NIC TX queues and three queues in an application which actually transmits packets to this NIC TX queue. Am I right? Thanks, Sachin. On Wed, Dec 10, 2014 at 12:22 PM, Bruce Richardson < bruce.richardson at intel.com> wrote: > On Wed, Dec 10, 2014 at 12:03:41PM +0100, Sachin Sharma wrote: > > Dear all, > > > > In my algorithm, I am interested to perform two activities - (1) > > transmitting packets to a tx_queue and (2) transmitting packets from > > tx_queue to a wire - separately. I have gone through the code by putting > > logs in the dpdk code and found that there is a function rte_eth_tx_burst > > which transmits packets to a specific queue. However, when I debugged > more > > then I found that this function just calls eth_igb_xmit_pkts > > from librte_pmd_e1000, and this function just directly write the packets > to > > the wire by writing all packets into registers. Could you please suggest > > how to implement these two functions if these are not implemented already > > in dpdk? > > > > > > > > Thanks & Regards, > > Sachin. > > Hi Sachin, > > anything written to the NIC TX queue is automatically put onto the wire > unless > the NIC port is down or the wire is unplugged etc. What is your use-case > that you > need to do this? I would suggest doing internal buffering in your > application, > as many DPDK example applications do, and then call tx_burst to put your > packets > on the wire when you want this capability. > > Regards, > /Bruce >
[dpdk-dev] transmit functions of dpdk
On Wed, Dec 10, 2014 at 12:31:27PM +0100, Sachin Sharma wrote: > Hi Bruce, > > In my use case, I want to have three NIC TX queues per port, and want to > fill one NIC TX queue and want to empty the other queue. Is it possible > this through tx_burst or do I need to implement these queues in > applications as you suggested before. However, in this case, I would have > then one NIC TX queues and three queues in an application which actually > transmits packets to this NIC TX queue. Am I right? > For the suggestion I made, yes, you would have three software queues in your application, and a single TX on the NIC - though you could also have a 1:1 mapping of software to HW queues if you wanted. However, I'm not entirely clear on what you mean by filling one queue and emptying another. Is this just a form of buffering you are trying to implement? > Thanks, > Sachin. > > On Wed, Dec 10, 2014 at 12:22 PM, Bruce Richardson < > bruce.richardson at intel.com> wrote: > > > On Wed, Dec 10, 2014 at 12:03:41PM +0100, Sachin Sharma wrote: > > > Dear all, > > > > > > In my algorithm, I am interested to perform two activities - (1) > > > transmitting packets to a tx_queue and (2) transmitting packets from > > > tx_queue to a wire - separately. I have gone through the code by putting > > > logs in the dpdk code and found that there is a function rte_eth_tx_burst > > > which transmits packets to a specific queue. However, when I debugged > > more > > > then I found that this function just calls eth_igb_xmit_pkts > > > from librte_pmd_e1000, and this function just directly write the packets > > to > > > the wire by writing all packets into registers. Could you please suggest > > > how to implement these two functions if these are not implemented already > > > in dpdk? > > > > > > > > > > > > Thanks & Regards, > > > Sachin. > > > > Hi Sachin, > > > > anything written to the NIC TX queue is automatically put onto the wire > > unless > > the NIC port is down or the wire is unplugged etc. What is your use-case > > that you > > need to do this? I would suggest doing internal buffering in your > > application, > > as many DPDK example applications do, and then call tx_burst to put your > > packets > > on the wire when you want this capability. > > > > Regards, > > /Bruce > >
[dpdk-dev] [PATCH v6] VFIO: Avoid to enable vfio while the module not loaded
When vfio module is not loaded when kernel support vfio feature, the routine still try to open the container to get file description. This action is not safe, and of cause got error messages: EAL: Detected 40 lcore(s) EAL: unsupported IOMMU type! EAL: VFIO support could not be initialized EAL: Setting up memory... This may make user confuse, this patch make it reasonable and much more soomth to user. Signed-off-by: Michael Qiu --- v6 --> v5 1. Change rte_eal_check_module() to normal function instead of inline 2. limit fscanf to get 29 charactors not include '\0' lib/librte_eal/common/eal_private.h| 14 ++ lib/librte_eal/linuxapp/eal/eal.c | 27 +++ lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 26 +++--- 3 files changed, 64 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h index 232fcec..2c751c6 100644 --- a/lib/librte_eal/common/eal_private.h +++ b/lib/librte_eal/common/eal_private.h @@ -203,4 +203,18 @@ int rte_eal_alarm_init(void); */ int rte_eal_dev_init(void); +/** + * Function is to check if the kernel module(like, vfio, vfio_iommu_type1, + * etc.) loaded. + * + * @param module_name + * The module's name which need to be checked + * + * @return + * -1 means some error happens(NULL pointer or open failure) + * 0 means the module not loaded + * 1 means the module loaded + */ +int rte_eal_check_module(const char *module_name); + #endif /* _EAL_PRIVATE_H_ */ diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 89f3b5e..9c1a1cc 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -859,3 +859,30 @@ int rte_eal_has_hugepages(void) { return ! internal_config.no_hugetlbfs; } + +int +rte_eal_check_module(const char *module_name) +{ + char mod_name[30]; /* Any module names can be longer than 30 bytes? */ + int ret = 0; + + if (NULL == module_name) + return -1; + + FILE * fd = fopen("/proc/modules", "r"); + if (NULL == fd) { + RTE_LOG(ERR, EAL, "Open /proc/modules failed!" + " error %i (%s)\n", errno, strerror(errno)); + return -1; + } + while(!feof(fd)) { + fscanf(fd, "%29s %*[^\n]", mod_name); + if(!strcmp(mod_name, module_name)) { + ret = 1; + break; + } + } + fclose(fd); + + return ret; +} diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c index c1246e8..16fe10f 100644 --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c @@ -44,6 +44,7 @@ #include #include #include +#include #include "eal_filesystem.h" #include "eal_pci_init.h" @@ -339,10 +340,12 @@ pci_vfio_get_container_fd(void) ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU); if (ret != 1) { if (ret < 0) - RTE_LOG(ERR, EAL, " could not get IOMMU type, " - "error %i (%s)\n", errno, strerror(errno)); + RTE_LOG(ERR, EAL, " could not get IOMMU type," + " error %i (%s)\n", errno, + strerror(errno)); else - RTE_LOG(ERR, EAL, " unsupported IOMMU type!\n"); + RTE_LOG(ERR, EAL, " unsupported IOMMU type" + " detected in VFIO\n"); close(vfio_container_fd); return -1; } @@ -783,11 +786,28 @@ pci_vfio_enable(void) { /* initialize group list */ int i; + int module_vfio_type1; for (i = 0; i < VFIO_MAX_GROUPS; i++) { vfio_cfg.vfio_groups[i].fd = -1; vfio_cfg.vfio_groups[i].group_no = -1; } + + module_vfio_type1 = rte_eal_check_module("vfio_iommu_type1"); + + /* return error directly */ + if (module_vfio_type1 == -1) { + RTE_LOG(INFO, EAL, "Could not get loaded module details!\n"); + return -1; + } + + /* return 0 if VFIO modules not loaded */ + if (module_vfio_type1 == 0) { + RTE_LOG(INFO, EAL, "VFIO modules not all loaded," + " skip VFIO support ...\n"); + return 0; + } + vfio_cfg.vfio_container_fd = pci_vfio_get_container_fd(); /* check if we have VFIO driver enabled */ -- 1.9.3
[dpdk-dev] [PATCH v6] VFIO: Avoid to enable vfio while the module not loaded
> When vfio module is not loaded when kernel support vfio feature, the > routine still try to open the container to get file description. > > This action is not safe, and of cause got error messages: > > EAL: Detected 40 lcore(s) > EAL: unsupported IOMMU type! > EAL: VFIO support could not be initialized > EAL: Setting up memory... > > This may make user confuse, this patch make it reasonable and much more > soomth to user. > > Signed-off-by: Michael Qiu > --- > v6 --> v5 > 1. Change rte_eal_check_module() to normal > function instead of inline > 2. limit fscanf to get 29 charactors not include '\0' > > lib/librte_eal/common/eal_private.h| 14 ++ > lib/librte_eal/linuxapp/eal/eal.c | 27 +++ > lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 26 > +++--- > 3 files changed, 64 insertions(+), 3 deletions(-) > > diff --git a/lib/librte_eal/common/eal_private.h > b/lib/librte_eal/common/eal_private.h > index 232fcec..2c751c6 100644 > --- a/lib/librte_eal/common/eal_private.h > +++ b/lib/librte_eal/common/eal_private.h > @@ -203,4 +203,18 @@ int rte_eal_alarm_init(void); > */ > int rte_eal_dev_init(void); > > +/** > + * Function is to check if the kernel module(like, vfio, > +vfio_iommu_type1, > + * etc.) loaded. > + * > + * @param module_name > + * The module's name which need to be checked > + * > + * @return > + * -1 means some error happens(NULL pointer or open failure) > + * 0 means the module not loaded > + * 1 means the module loaded > + */ > +int rte_eal_check_module(const char *module_name); > + > #endif /* _EAL_PRIVATE_H_ */ > diff --git a/lib/librte_eal/linuxapp/eal/eal.c > b/lib/librte_eal/linuxapp/eal/eal.c > index 89f3b5e..9c1a1cc 100644 > --- a/lib/librte_eal/linuxapp/eal/eal.c > +++ b/lib/librte_eal/linuxapp/eal/eal.c > @@ -859,3 +859,30 @@ int rte_eal_has_hugepages(void) { > return ! internal_config.no_hugetlbfs; } > + > +int > +rte_eal_check_module(const char *module_name) { > + char mod_name[30]; /* Any module names can be longer than 30 > bytes? */ > + int ret = 0; > + > + if (NULL == module_name) > + return -1; > + > + FILE * fd = fopen("/proc/modules", "r"); > + if (NULL == fd) { > + RTE_LOG(ERR, EAL, "Open /proc/modules failed!" > + " error %i (%s)\n", errno, strerror(errno)); > + return -1; > + } > + while(!feof(fd)) { > + fscanf(fd, "%29s %*[^\n]", mod_name); > + if(!strcmp(mod_name, module_name)) { > + ret = 1; > + break; > + } > + } > + fclose(fd); > + > + return ret; > +} > diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > index c1246e8..16fe10f 100644 > --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > @@ -44,6 +44,7 @@ > #include > #include > #include > +#include > > #include "eal_filesystem.h" > #include "eal_pci_init.h" > @@ -339,10 +340,12 @@ pci_vfio_get_container_fd(void) > ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, > VFIO_TYPE1_IOMMU); > if (ret != 1) { > if (ret < 0) > - RTE_LOG(ERR, EAL, " could not get IOMMU > type, " > - "error %i (%s)\n", errno, > strerror(errno)); > + RTE_LOG(ERR, EAL, " could not get IOMMU > type," > + " error %i (%s)\n", errno, > + strerror(errno)); > else > - RTE_LOG(ERR, EAL, " unsupported IOMMU > type!\n"); > + RTE_LOG(ERR, EAL, " unsupported IOMMU > type" > + " detected in VFIO\n"); > close(vfio_container_fd); > return -1; > } > @@ -783,11 +786,28 @@ pci_vfio_enable(void) { > /* initialize group list */ > int i; > + int module_vfio_type1; > > for (i = 0; i < VFIO_MAX_GROUPS; i++) { > vfio_cfg.vfio_groups[i].fd = -1; > vfio_cfg.vfio_groups[i].group_no = -1; > } > + > + module_vfio_type1 = rte_eal_check_module("vfio_iommu_type1"); > + > + /* return error directly */ > + if (module_vfio_type1 == -1) { > + RTE_LOG(INFO, EAL, "Could not get loaded module > details!\n"); > + return -1; > + } > + > + /* return 0 if VFIO modules not loaded */ > + if (module_vfio_type1 == 0) { > + RTE_LOG(INFO, EAL, "VFIO modules not all loaded," > + " skip VFIO support ...\n"); > + return 0; > + } > + > vfio_cfg.vfio_container_fd = pci_vfio_get_container_fd(); > > /* check if we have VFIO driver enable
[dpdk-dev] transmit functions of dpdk
Hi Bruce, >>>I'm not entirely clear on what you mean by filling one queue and emptying another. Is this just a form of buffering you are trying to implement? Yes, you are right! I am implementing a buffering mechanism in which a node will have three queues and it fills one queue with packets and when the queue is full then transmit the packets from the queue to the wire and while filling one queue, it can transmit packets to a wire through another queue that is already full. Thanks & Regards, Sachin. On Wed, Dec 10, 2014 at 12:45 PM, Bruce Richardson < bruce.richardson at intel.com> wrote: > On Wed, Dec 10, 2014 at 12:31:27PM +0100, Sachin Sharma wrote: > > Hi Bruce, > > > > In my use case, I want to have three NIC TX queues per port, and want to > > fill one NIC TX queue and want to empty the other queue. Is it possible > > this through tx_burst or do I need to implement these queues in > > applications as you suggested before. However, in this case, I would have > > then one NIC TX queues and three queues in an application which actually > > transmits packets to this NIC TX queue. Am I right? > > > > For the suggestion I made, yes, you would have three software queues in > your > application, and a single TX on the NIC - though you could also have a 1:1 > mapping > of software to HW queues if you wanted. > However, I'm not entirely clear on what you mean by filling one queue and > emptying > another. Is this just a form of buffering you are trying to implement? > > > Thanks, > > Sachin. > > > > On Wed, Dec 10, 2014 at 12:22 PM, Bruce Richardson < > > bruce.richardson at intel.com> wrote: > > > > > On Wed, Dec 10, 2014 at 12:03:41PM +0100, Sachin Sharma wrote: > > > > Dear all, > > > > > > > > In my algorithm, I am interested to perform two activities - (1) > > > > transmitting packets to a tx_queue and (2) transmitting packets from > > > > tx_queue to a wire - separately. I have gone through the code by > putting > > > > logs in the dpdk code and found that there is a function > rte_eth_tx_burst > > > > which transmits packets to a specific queue. However, when I debugged > > > more > > > > then I found that this function just calls eth_igb_xmit_pkts > > > > from librte_pmd_e1000, and this function just directly write the > packets > > > to > > > > the wire by writing all packets into registers. Could you please > suggest > > > > how to implement these two functions if these are not implemented > already > > > > in dpdk? > > > > > > > > > > > > > > > > Thanks & Regards, > > > > Sachin. > > > > > > Hi Sachin, > > > > > > anything written to the NIC TX queue is automatically put onto the wire > > > unless > > > the NIC port is down or the wire is unplugged etc. What is your > use-case > > > that you > > > need to do this? I would suggest doing internal buffering in your > > > application, > > > as many DPDK example applications do, and then call tx_burst to put > your > > > packets > > > on the wire when you want this capability. > > > > > > Regards, > > > /Bruce > > > >
[dpdk-dev] [PATCH] examples/vhost: Fix data len issue
Search the right segment to increase its data length, rather than wrongly early return and exit the tx function, which leads to drop all jumbo frame packets when vm2vm is in hard forward mode. Signed-off-by: Changchun Ouyang --- examples/vhost/main.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/examples/vhost/main.c b/examples/vhost/main.c index 1f1edbe..12fa4ce 100644 --- a/examples/vhost/main.c +++ b/examples/vhost/main.c @@ -1127,9 +1127,8 @@ virtio_tx_route(struct vhost_dev *vdev, struct rte_mbuf *m, uint16_t vlan_tag) return; } - if (vm2vm_mode == VM2VM_HARDWARE) { - if (find_local_dest(dev, m, &offset, &vlan_tag) != 0 || - offset > rte_pktmbuf_tailroom(m)) { + if (unlikely(vm2vm_mode == VM2VM_HARDWARE)) { + if (unlikely(find_local_dest(dev, m, &offset, &vlan_tag) != 0)) { rte_pktmbuf_free(m); return; } @@ -1143,8 +1142,24 @@ virtio_tx_route(struct vhost_dev *vdev, struct rte_mbuf *m, uint16_t vlan_tag) m->ol_flags = PKT_TX_VLAN_PKT; - m->data_len += offset; - m->pkt_len += offset; + /* +* Find the right seg to adjust the data len when offset is +* bigger than tail room size. +*/ + if (unlikely(vm2vm_mode == VM2VM_HARDWARE)) { + if (likely(offset <= rte_pktmbuf_tailroom(m))) + m->data_len += offset; + else { + struct rte_mbuf *seg = m; + + while ((seg->next != NULL) && + (offset > rte_pktmbuf_tailroom(seg))) + seg = seg->next; + + seg->data_len += offset; + } + m->pkt_len += offset; + } m->vlan_tci = vlan_tag; -- 1.8.4.2
[dpdk-dev] [PATCH v4 4/4] xen_dom0: replace strict_strtoul with kstrtoul
2014-12-10 19:04, Jincheng Miao: > And add xen_dom0/compat.h for be compatible with older kernel. > > Signed-off-by: Jincheng Miao [...] > new file mode 100644 > index 000..89dab27 > --- /dev/null > +++ b/lib/librte_eal/linuxapp/xen_dom0/compat.h [...] > --- a/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c > +++ b/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c > @@ -123,7 +123,7 @@ store_memsize(struct device *dev, struct device_attribute > *attr, > int err = 0; > unsigned long mem_size; > > - if (0 != strict_strtoul(buf, 0, &mem_size)) > + if (0 != kstrtoul(buf, 0, &mem_size)) > return -EINVAL; > > mutex_lock(&dom0_dev.data_lock); > Still no include of compat.h -- Thomas
[dpdk-dev] [PATCH] Fix power_of_2 macro. Avoid branching when calculating RTE_MIN and RTE_MAX.
2014-12-09 13:05, r k: > Subject: [PATCH] Fix power_of_2 macro. Avoid branching when > calculating RTE_MIN and RTE_MAX. Please could you add more explanations about the problem you are solving? You should also add a Signed-off like explained in this page: http://dpdk.org/dev#send Thanks -- Thomas
[dpdk-dev] [PATCH v4 4/4] xen_dom0: replace strict_strtoul with kstrtoul
- Original Message - > 2014-12-10 19:04, Jincheng Miao: > > And add xen_dom0/compat.h for be compatible with older kernel. > > > > Signed-off-by: Jincheng Miao > [...] > > new file mode 100644 > > index 000..89dab27 > > --- /dev/null > > +++ b/lib/librte_eal/linuxapp/xen_dom0/compat.h > [...] > > --- a/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c > > +++ b/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c > > @@ -123,7 +123,7 @@ store_memsize(struct device *dev, struct > > device_attribute *attr, > > int err = 0; > > unsigned long mem_size; > > > > - if (0 != strict_strtoul(buf, 0, &mem_size)) > > + if (0 != kstrtoul(buf, 0, &mem_size)) > > return -EINVAL; > > > > mutex_lock(&dom0_dev.data_lock); > > > > Still no include of compat.h Sorry for that, resend PATCH v4 4/4 again. > > -- > Thomas >
[dpdk-dev] [PATCH v4 4/4] xen_dom0: replace strict_strtoul with kstrtoul
And add xen_dom0/compat.h for be compatible with older kernel. Signed-off-by: Jincheng Miao --- lib/librte_eal/linuxapp/xen_dom0/compat.h | 16 lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c | 3 ++- 2 files changed, 18 insertions(+), 1 deletion(-) create mode 100644 lib/librte_eal/linuxapp/xen_dom0/compat.h diff --git a/lib/librte_eal/linuxapp/xen_dom0/compat.h b/lib/librte_eal/linuxapp/xen_dom0/compat.h new file mode 100644 index 000..89dab27 --- /dev/null +++ b/lib/librte_eal/linuxapp/xen_dom0/compat.h @@ -0,0 +1,16 @@ +/* + * Minimal wrappers to allow compiling xen_dom0 on older kernels. + */ + +#ifndef RHEL_RELEASE_VERSION +#define RHEL_RELEASE_VERSION(a, b) (((a) << 8) + (b)) +#endif + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 39) && \ + (!(defined(RHEL_RELEASE_CODE) && \ +RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(6, 4))) + +#define kstrtoul strict_strtoul + +#endif /* < 2.6.39 */ + diff --git a/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c b/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c index dfb271d..543bf57 100644 --- a/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c +++ b/lib/librte_eal/linuxapp/xen_dom0/dom0_mm_misc.c @@ -74,6 +74,7 @@ #include #include +#include "compat.h" #include "dom0_mm_dev.h" MODULE_LICENSE("Dual BSD/GPL"); @@ -123,7 +124,7 @@ store_memsize(struct device *dev, struct device_attribute *attr, int err = 0; unsigned long mem_size; - if (0 != strict_strtoul(buf, 0, &mem_size)) + if (0 != kstrtoul(buf, 0, &mem_size)) return -EINVAL; mutex_lock(&dom0_dev.data_lock); -- 1.8.3.1
[dpdk-dev] transmit functions of dpdk
On Wed, Dec 10, 2014 at 01:09:32PM +0100, Sachin Sharma wrote: > Hi Bruce, > > >>>I'm not entirely clear on what you mean by filling one queue and > emptying another. Is this just a form of buffering you are trying to > implement? > > Yes, you are right! I am implementing a buffering mechanism in which a node > will have three queues and it fills one queue with packets and when the > queue is full then transmit the packets from the queue to the wire and > while filling one queue, it can transmit packets to a wire through another > queue that is already full. > > Thanks & Regards, > Sachin. > The standard sample applications with DPDK use a simple buffering scheme, where we buffer the packets until a full burst of 32 are ready for sending. Once we have a full burst of packets - or a timeout occurs - we then send that burst of packets using tx_burst function. Would such a scheme not work for you? NOTE: the tx_burst function returns as soon as the packets are written to the NIC's TX ring and the NIC's tail pointer is updated. It does not actually wait until all the packets are transmitted onto the wire. This means that your core does not need to see about trying to do other tasks during the actual packet transmission time, so you don't need a buffer for new packets arriving at the core while the NIC is physically transmitting data. Regards, /Bruce > > > On Wed, Dec 10, 2014 at 12:45 PM, Bruce Richardson < > bruce.richardson at intel.com> wrote: > > > On Wed, Dec 10, 2014 at 12:31:27PM +0100, Sachin Sharma wrote: > > > Hi Bruce, > > > > > > In my use case, I want to have three NIC TX queues per port, and want to > > > fill one NIC TX queue and want to empty the other queue. Is it possible > > > this through tx_burst or do I need to implement these queues in > > > applications as you suggested before. However, in this case, I would have > > > then one NIC TX queues and three queues in an application which actually > > > transmits packets to this NIC TX queue. Am I right? > > > > > > > For the suggestion I made, yes, you would have three software queues in > > your > > application, and a single TX on the NIC - though you could also have a 1:1 > > mapping > > of software to HW queues if you wanted. > > However, I'm not entirely clear on what you mean by filling one queue and > > emptying > > another. Is this just a form of buffering you are trying to implement? > > > > > Thanks, > > > Sachin. > > > > > > On Wed, Dec 10, 2014 at 12:22 PM, Bruce Richardson < > > > bruce.richardson at intel.com> wrote: > > > > > > > On Wed, Dec 10, 2014 at 12:03:41PM +0100, Sachin Sharma wrote: > > > > > Dear all, > > > > > > > > > > In my algorithm, I am interested to perform two activities - (1) > > > > > transmitting packets to a tx_queue and (2) transmitting packets from > > > > > tx_queue to a wire - separately. I have gone through the code by > > putting > > > > > logs in the dpdk code and found that there is a function > > rte_eth_tx_burst > > > > > which transmits packets to a specific queue. However, when I debugged > > > > more > > > > > then I found that this function just calls eth_igb_xmit_pkts > > > > > from librte_pmd_e1000, and this function just directly write the > > packets > > > > to > > > > > the wire by writing all packets into registers. Could you please > > suggest > > > > > how to implement these two functions if these are not implemented > > already > > > > > in dpdk? > > > > > > > > > > > > > > > > > > > > Thanks & Regards, > > > > > Sachin. > > > > > > > > Hi Sachin, > > > > > > > > anything written to the NIC TX queue is automatically put onto the wire > > > > unless > > > > the NIC port is down or the wire is unplugged etc. What is your > > use-case > > > > that you > > > > need to do this? I would suggest doing internal buffering in your > > > > application, > > > > as many DPDK example applications do, and then call tx_burst to put > > your > > > > packets > > > > on the wire when you want this capability. > > > > > > > > Regards, > > > > /Bruce > > > > > >
[dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly added RX error flags
> -Original Message- > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com] > Sent: Wednesday, December 10, 2014 5:35 PM > To: Zhang, Helin > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly added RX > error flags > > 2014-12-10 16:55, Helin Zhang: > > Before redefining mbuf structure, there was lack of free bits in 'ol_flags' > > (32 bits in total) for new RX or TX flags. So it tried to reuse > > existant bits as most as possible, or even assigning 0 to some of bit > > flags. After new mbuf structure defined, there are quite a lot of free > > bits. So those newly added bit flags should be assigned with correct > > and valid bit values, and getting their names should be enabled as > > well. Note that 'RECIP' should be removed, as nowhere uses it. > > 'PKT_RX_ERR' is defined to replace all other error bit flags, e.g. MAC > > error, > Oversize error, header buffer overflow error. > > > > Signed-off-by: Helin Zhang > > --- > > lib/librte_mbuf/rte_mbuf.c | 7 ++- > > lib/librte_mbuf/rte_mbuf.h | 9 +++-- > > lib/librte_pmd_i40e/i40e_rxtx.c | 37 > > - > > 3 files changed, 25 insertions(+), 28 deletions(-) > > > > v2 changes: > > * Removed error flag of 'ECIPE' processing only in both i40e PMD and mbuf. > All > > other error flags were added back. > > * Assigned error flags with correct and valid values, as their previous > > values > > were invalid. > > * Enabled getting all error flag names. > > > > v3 changes: > > * 'PKT_RX_ERR' is defined to replace error bit flags of MAC error, Oversize > > error, header buffer overflow error. > > * Removed assigning values to PKT_TX_* bit flags, as it has already been > done > > in another packet set. > > > > v4 changes: > > * Renamed 'PKT_RX_EIP_CKSUM_BAD' to 'PKT_RX_OUTER_IP_CKSUM_BAD'. > > * Fixed the bug of checking error bits of 'Descriptor oversize' and > > 'Header buffer oversize'. > > * Added debug logs for each RX error. > [...] > > --- a/lib/librte_mbuf/rte_mbuf.h > > +++ b/lib/librte_mbuf/rte_mbuf.h > > @@ -83,12 +83,7 @@ extern "C" { > > #define PKT_RX_RSS_HASH (1ULL << 1) /**< RX packet with RSS > hash result. */ > > #define PKT_RX_FDIR (1ULL << 2) /**< RX packet with FDIR > match indicate. */ > > #define PKT_RX_L4_CKSUM_BAD (1ULL << 3) /**< L4 cksum of RX pkt. > is > > not OK. */ -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP cksum of > > RX pkt. is not OK. */ -#define PKT_RX_EIP_CKSUM_BAD (0ULL << 0) /**< > External IP header checksum error. */ > > -#define PKT_RX_OVERSIZE (0ULL << 0) /**< Num of desc of an RX > pkt oversize. */ > > -#define PKT_RX_HBUF_OVERFLOW (0ULL << 0) /**< Header buffer > overflow. */ > > -#define PKT_RX_RECIP_ERR (0ULL << 0) /**< Hardware processing > error. */ > > -#define PKT_RX_MAC_ERR (0ULL << 0) /**< MAC error. */ > > +#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP (or inner IP) > > +header checksum error. */ > > It can be also an outer IP header in case the device don't support tunneling > or is > not configured to detect it. For non-tunneling case, no outer/inner at all, it just has the IP header. The bit flag indicates the IP header checksum error. For tunneling case, this bit flag indicates the inner IP header checksum error, another one for outer IP header checksum error. So I don't think this bit can be treated as outer. Regards, Helin > -- > Thomas > > > #define PKT_RX_IPV4_HDR (1ULL << 5) /**< RX packet with IPv4 > header. */ > > #define PKT_RX_IPV4_HDR_EXT (1ULL << 6) /**< RX packet with > extended IPv4 header. */ > > #define PKT_RX_IPV6_HDR (1ULL << 7) /**< RX packet with IPv6 > header. */ > > @@ -99,6 +94,8 @@ extern "C" { > > #define PKT_RX_TUNNEL_IPV6_HDR (1ULL << 12) /**< RX tunnel packet > with IPv6 header. */ > > #define PKT_RX_FDIR_ID (1ULL << 13) /**< FD id reported if FDIR > match. */ > > #define PKT_RX_FDIR_FLX (1ULL << 14) /**< Flexible bytes reported > if FDIR match. */ > > +#define PKT_RX_OUTER_IP_CKSUM_BAD (1ULL << 15) /**< Outer IP > header checksum error. */ > > +#define PKT_RX_ERR (1ULL << 16) /**< Other errors, e.g. > MAC error. */ > > /* add new RX flags here */ > > > > /* add new TX flags here */
[dpdk-dev] [PATCH v2] ixgbe: fix multi-process support
2014-12-05 13:46, Bruce Richardson: > When using multiple processes, the TX function used in all processes > should be the same, otherwise the secondary processes cannot transmit > more than tx-ring-size - 1 packets. > To achieve this, we extract out the code to select the ixgbe TX function > to be used into a separate function inside the ixgbe driver, and call > that from a secondary process when it is attaching to an > already-configured NIC. > > Testing with symmetric MP app shows that we are able to RX and TX from > both primary and secondary processes once this patch is applied. > > Signed-off-by: Bruce Richardson > > > V2 Changes: > * Moved check for primary/secondary process into set_tx_function instead > of ixgbe_txq_vec_setup, which reduces scope of diff. > * Added/cleaned up some code comments for this change Anyone to review this patch? Thanks -- Thomas
[dpdk-dev] transmit functions of dpdk
Hi Bruce, >>>The standard sample applications with DPDK use a simple buffering scheme, where >>>we buffer the packets until a full burst of 32 are ready for sending. Once we >>>have a full burst of packets - or a timeout occurs - we then send that burst >>>of packets using tx_burst function. Would such a scheme not work for you? Perhaps, this scheme will give some timing issues to me. For me, time is very critical issue. In your example, NIC TX ring also does some kind of buffering in uncontrolled manner (i.e., packets are transmitted to it without knowing that it will buffer or will directly forward on wire). However, in my case, I want to control the transmission onto wire and buffering into NIC TX ring through my software. Therefore, I think that I need to create three NIC TX ring per port rather than creating three software queues and need to transmit on wire from those ring through my software code. So, is there any way that I can control transmission of packets through these NIC TX rings i.e, buffering and transmission on the wire? or this is not possible? Thanks & kind regards, Sachin. On Wed, Dec 10, 2014 at 2:37 PM, Bruce Richardson < bruce.richardson at intel.com> wrote: > On Wed, Dec 10, 2014 at 01:09:32PM +0100, Sachin Sharma wrote: > > Hi Bruce, > > > > >>>I'm not entirely clear on what you mean by filling one queue and > > emptying another. Is this just a form of buffering you are trying to > > implement? > > > > Yes, you are right! I am implementing a buffering mechanism in which a > node > > will have three queues and it fills one queue with packets and when the > > queue is full then transmit the packets from the queue to the wire and > > while filling one queue, it can transmit packets to a wire through > another > > queue that is already full. > > > > Thanks & Regards, > > Sachin. > > > The standard sample applications with DPDK use a simple buffering scheme, > where > we buffer the packets until a full burst of 32 are ready for sending. Once > we > have a full burst of packets - or a timeout occurs - we then send that > burst > of packets using tx_burst function. Would such a scheme not work for you? > > NOTE: the tx_burst function returns as soon as the packets are written to > the > NIC's TX ring and the NIC's tail pointer is updated. It does not actually > wait > until all the packets are transmitted onto the wire. This means that your > core > does not need to see about trying to do other tasks during the actual > packet > transmission time, so you don't need a buffer for new packets arriving at > the > core while the NIC is physically transmitting data. > > Regards, > /Bruce > > > > > > On Wed, Dec 10, 2014 at 12:45 PM, Bruce Richardson < > > bruce.richardson at intel.com> wrote: > > > > > On Wed, Dec 10, 2014 at 12:31:27PM +0100, Sachin Sharma wrote: > > > > Hi Bruce, > > > > > > > > In my use case, I want to have three NIC TX queues per port, and > want to > > > > fill one NIC TX queue and want to empty the other queue. Is it > possible > > > > this through tx_burst or do I need to implement these queues in > > > > applications as you suggested before. However, in this case, I would > have > > > > then one NIC TX queues and three queues in an application which > actually > > > > transmits packets to this NIC TX queue. Am I right? > > > > > > > > > > For the suggestion I made, yes, you would have three software queues in > > > your > > > application, and a single TX on the NIC - though you could also have a > 1:1 > > > mapping > > > of software to HW queues if you wanted. > > > However, I'm not entirely clear on what you mean by filling one queue > and > > > emptying > > > another. Is this just a form of buffering you are trying to implement? > > > > > > > Thanks, > > > > Sachin. > > > > > > > > On Wed, Dec 10, 2014 at 12:22 PM, Bruce Richardson < > > > > bruce.richardson at intel.com> wrote: > > > > > > > > > On Wed, Dec 10, 2014 at 12:03:41PM +0100, Sachin Sharma wrote: > > > > > > Dear all, > > > > > > > > > > > > In my algorithm, I am interested to perform two activities - > (1) > > > > > > transmitting packets to a tx_queue and (2) transmitting packets > from > > > > > > tx_queue to a wire - separately. I have gone through the code by > > > putting > > > > > > logs in the dpdk code and found that there is a function > > > rte_eth_tx_burst > > > > > > which transmits packets to a specific queue. However, when I > debugged > > > > > more > > > > > > then I found that this function just calls eth_igb_xmit_pkts > > > > > > from librte_pmd_e1000, and this function just directly write the > > > packets > > > > > to > > > > > > the wire by writing all packets into registers. Could you please > > > suggest > > > > > > how to implement these two functions if these are not implemented > > > already > > > > > > in dpdk? > > > > > > > > > > > > > > > > > > > > > > > > Thanks & Regards, > > > > > > Sachin. > > > > > > > > > > Hi Sachin, > > > > > > > > > > anything written to the
[dpdk-dev] A question about hugepage initialization time
On Wed, Dec 10, 2014 at 10:32:25AM +, Bruce Richardson wrote: > On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote: > > On Tue, 9 Dec 2014 11:45:07 -0800 > > &rew wrote: > > > > > > Hey Folks, > > > > > > > > Our DPDK application deals with very large in memory data structures, > > > > and > > > > can potentially use tens or even hundreds of gigabytes of hugepage > > > > memory. > > > > During the course of development, we've noticed that as the number of > > > > huge > > > > pages increases, the memory initialization time during EAL init gets to > > > > be > > > > quite long, lasting several minutes at present. The growth in init time > > > > doesn't appear to be linear, which is concerning. > > > > > > > > This is a minor inconvenience for us and our customers, as memory > > > > initialization makes our boot times a lot longer than it would otherwise > > > > be. Also, my experience has been that really long operations often are > > > > hiding errors - what you think is merely a slow operation is actually a > > > > timeout of some sort, often due to misconfiguration. This leads to two > > > > questions: > > > > > > > > 1. Does the long initialization time suggest that there's an error > > > > happening under the covers? > > > > 2. If not, is there any simple way that we can shorten memory > > > > initialization time? > > > > > > > > Thanks in advance for your insights. > > > > > > > > -- > > > > Matt Laswell > > > > laswell at infiniteio.com > > > > infinite io, inc. > > > > > > > > > > Hello, > > > > > > please find some quick comments on the questions: > > > 1.) By our experience long initialization time is normal in case of > > > large amount of memory. However this time depends on some things: > > > - number of hugepages (pagefault handled by kernel is pretty expensive) > > > - size of hugepages (memset at initialization) > > > > > > 2.) Using 1G pages instead of 2M will reduce the initialization time > > > significantly. Using wmemset instead of memset adds an additional 20-30% > > > boost by our measurements. Or, just by touching the pages but not > > > cleaning > > > them you can have still some more speedup. But in this case your layer or > > > the applications above need to do the cleanup at allocation time > > > (e.g. by using rte_zmalloc). > > > > > > Cheers, > > > &rew > > > > I wonder if the whole rte_malloc code is even worth it with a modern kernel > > with transparent huge pages? rte_malloc adds very little value and is less > > safe > > and slower than glibc or other allocators. Plus you lose the ablilty to get > > all the benefit out of valgrind or electric fence. > > While I'd dearly love to not have our own custom malloc lib to maintain, for > DPDK > multiprocess, rte_malloc will be hard to replace as we would need a > replacement > solution that similarly guarantees that memory mapped in process A is also > available at the same address in process B. :-( > Just out of curiosity, why even bother with multiprocess support? What you're talking about above is a multithread model, and your shoehorning multiple processes into it. Neil > /Bruce >
[dpdk-dev] A question about hugepage initialization time
On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote: > On Wed, Dec 10, 2014 at 10:32:25AM +, Bruce Richardson wrote: > > On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote: > > > On Tue, 9 Dec 2014 11:45:07 -0800 > > > &rew wrote: > > > > > > > > Hey Folks, > > > > > > > > > > Our DPDK application deals with very large in memory data structures, > > > > > and > > > > > can potentially use tens or even hundreds of gigabytes of hugepage > > > > > memory. > > > > > During the course of development, we've noticed that as the number of > > > > > huge > > > > > pages increases, the memory initialization time during EAL init gets > > > > > to be > > > > > quite long, lasting several minutes at present. The growth in init > > > > > time > > > > > doesn't appear to be linear, which is concerning. > > > > > > > > > > This is a minor inconvenience for us and our customers, as memory > > > > > initialization makes our boot times a lot longer than it would > > > > > otherwise > > > > > be. Also, my experience has been that really long operations often > > > > > are > > > > > hiding errors - what you think is merely a slow operation is actually > > > > > a > > > > > timeout of some sort, often due to misconfiguration. This leads to two > > > > > questions: > > > > > > > > > > 1. Does the long initialization time suggest that there's an error > > > > > happening under the covers? > > > > > 2. If not, is there any simple way that we can shorten memory > > > > > initialization time? > > > > > > > > > > Thanks in advance for your insights. > > > > > > > > > > -- > > > > > Matt Laswell > > > > > laswell at infiniteio.com > > > > > infinite io, inc. > > > > > > > > > > > > > Hello, > > > > > > > > please find some quick comments on the questions: > > > > 1.) By our experience long initialization time is normal in case of > > > > large amount of memory. However this time depends on some things: > > > > - number of hugepages (pagefault handled by kernel is pretty expensive) > > > > - size of hugepages (memset at initialization) > > > > > > > > 2.) Using 1G pages instead of 2M will reduce the initialization time > > > > significantly. Using wmemset instead of memset adds an additional > > > > 20-30% > > > > boost by our measurements. Or, just by touching the pages but not > > > > cleaning > > > > them you can have still some more speedup. But in this case your layer > > > > or > > > > the applications above need to do the cleanup at allocation time > > > > (e.g. by using rte_zmalloc). > > > > > > > > Cheers, > > > > &rew > > > > > > I wonder if the whole rte_malloc code is even worth it with a modern > > > kernel > > > with transparent huge pages? rte_malloc adds very little value and is > > > less safe > > > and slower than glibc or other allocators. Plus you lose the ablilty to > > > get > > > all the benefit out of valgrind or electric fence. > > > > While I'd dearly love to not have our own custom malloc lib to maintain, > > for DPDK > > multiprocess, rte_malloc will be hard to replace as we would need a > > replacement > > solution that similarly guarantees that memory mapped in process A is also > > available at the same address in process B. :-( > > > Just out of curiosity, why even bother with multiprocess support? What you're > talking about above is a multithread model, and your shoehorning multiple > processes into it. > Neil > Yep, that's pretty much what it is alright. However, this multiprocess support is very widely used by our customers in building their applications, and has been in place and supported since some of the earliest DPDK releases. If it is to be removed, it needs to be replaced by something that provides equivalent capabilities to application writers (perhaps something with more fine-grained sharing etc.) /Bruce
[dpdk-dev] [PATCH v2] app/test: fix memory needs after RTE_MAX_LCORE was increased to 128
2014-12-09 10:11, Pablo de Lara: > Since commit b91c67e5a693211862aa7dc3b78630b4e856c2af, > maximum number of cores is 128, which has increase > the total memory necessary for a rte_mempool structure, > as the per-lcore local cache has been doubled in size. > Therefore, eal_flags unit test was broken since it needed > to use more hugepages. > > Changes in v2: Increased memory to 18MB, as that is the > actual minimum memory necessary (depending on the physical memory segments, > DPDK may need less memory) > > Signed-off-by: Pablo de Lara Independently of RTE_MAX_LCORE being increased to 128, I still have some problems with test_invalid_vdev_flag(). First, DEFAULT_MEM_SIZE is not used in this function. Also, even with your patch, I get this error in my VM: Cannot get hugepage information But if this patch solves an issue for you, I guess it should be applied. That said, I would appreciate a complete description of the commands you use to launch this test in a VM (starting with the Qemu command). Thanks -- Thomas
[dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race conditioning
On Wed, Dec 10, 2014 at 08:18:36AM +, Wodkowski, PawelX wrote: > > Though, that said, doesn't it seem to anyone else like serialization of > > enqueue > > to a port should be the responsibility of the library, not the application? > > > > Neil > > From my knowledge it is an application responsibility to serialize access to > queue on particular port. > I understand thats the way it currently is, I'm advocating for the fact that it should not be. Neil > Pawel >
[dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race conditioning
On Wed, Dec 10, 2014 at 09:47:45AM -0500, Neil Horman wrote: > On Wed, Dec 10, 2014 at 08:18:36AM +, Wodkowski, PawelX wrote: > > > Though, that said, doesn't it seem to anyone else like serialization of > > > enqueue > > > to a port should be the responsibility of the library, not the > > > application? > > > > > > Neil > > > > From my knowledge it is an application responsibility to serialize access > > to > > queue on particular port. > > > I understand thats the way it currently is, I'm advocating for the fact that > it > should not be. > Neil > It could be done, but I think we'd need to add a new API (or new parameter to existing API) to do so, as the cost of adding the locks would be severe, even in the uncontented case. This is why it hasn't been done up till now, obviously enough. In general, where we don't provide performant multi-thread safe APIs, we generally don't try and provide versions with locks, we just document the limitation and then leave it up to the app to determine how best to handle things. /Bruce
[dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly added RX error flags
2014-12-10 13:50, Zhang, Helin: > > > -Original Message- > > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com] > > Sent: Wednesday, December 10, 2014 5:35 PM > > To: Zhang, Helin > > Cc: dev at dpdk.org > > Subject: Re: [dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly added RX > > error flags > > > > 2014-12-10 16:55, Helin Zhang: > > > Before redefining mbuf structure, there was lack of free bits in > > > 'ol_flags' > > > (32 bits in total) for new RX or TX flags. So it tried to reuse > > > existant bits as most as possible, or even assigning 0 to some of bit > > > flags. After new mbuf structure defined, there are quite a lot of free > > > bits. So those newly added bit flags should be assigned with correct > > > and valid bit values, and getting their names should be enabled as > > > well. Note that 'RECIP' should be removed, as nowhere uses it. > > > 'PKT_RX_ERR' is defined to replace all other error bit flags, e.g. MAC > > > error, > > Oversize error, header buffer overflow error. > > > > > > Signed-off-by: Helin Zhang > > > --- > > > lib/librte_mbuf/rte_mbuf.c | 7 ++- > > > lib/librte_mbuf/rte_mbuf.h | 9 +++-- > > > lib/librte_pmd_i40e/i40e_rxtx.c | 37 > > > - > > > 3 files changed, 25 insertions(+), 28 deletions(-) > > > > > > v2 changes: > > > * Removed error flag of 'ECIPE' processing only in both i40e PMD and mbuf. > > All > > > other error flags were added back. > > > * Assigned error flags with correct and valid values, as their previous > > > values > > > were invalid. > > > * Enabled getting all error flag names. > > > > > > v3 changes: > > > * 'PKT_RX_ERR' is defined to replace error bit flags of MAC error, > > > Oversize > > > error, header buffer overflow error. > > > * Removed assigning values to PKT_TX_* bit flags, as it has already been > > done > > > in another packet set. > > > > > > v4 changes: > > > * Renamed 'PKT_RX_EIP_CKSUM_BAD' to 'PKT_RX_OUTER_IP_CKSUM_BAD'. > > > * Fixed the bug of checking error bits of 'Descriptor oversize' and > > > 'Header buffer oversize'. > > > * Added debug logs for each RX error. > > [...] > > > --- a/lib/librte_mbuf/rte_mbuf.h > > > +++ b/lib/librte_mbuf/rte_mbuf.h > > > @@ -83,12 +83,7 @@ extern "C" { > > > #define PKT_RX_RSS_HASH (1ULL << 1) /**< RX packet with RSS > > hash result. */ > > > #define PKT_RX_FDIR (1ULL << 2) /**< RX packet with FDIR > > match indicate. */ > > > #define PKT_RX_L4_CKSUM_BAD (1ULL << 3) /**< L4 cksum of RX pkt. > > is > > > not OK. */ -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP cksum of > > > RX pkt. is not OK. */ -#define PKT_RX_EIP_CKSUM_BAD (0ULL << 0) /**< > > External IP header checksum error. */ > > > -#define PKT_RX_OVERSIZE (0ULL << 0) /**< Num of desc of an RX > > pkt oversize. */ > > > -#define PKT_RX_HBUF_OVERFLOW (0ULL << 0) /**< Header buffer > > overflow. */ > > > -#define PKT_RX_RECIP_ERR (0ULL << 0) /**< Hardware processing > > error. */ > > > -#define PKT_RX_MAC_ERR (0ULL << 0) /**< MAC error. */ > > > +#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP (or inner IP) > > > +header checksum error. */ > > > > It can be also an outer IP header in case the device don't support > > tunneling or is > > not configured to detect it. > > For non-tunneling case, no outer/inner at all, it just has the IP header. The > bit flag > indicates the IP header checksum error. > For tunneling case, this bit flag indicates the inner IP header checksum > error, another > one for outer IP header checksum error. > So I don't think this bit can be treated as outer. I think you didn't understand my comment. I talk about NICs which don't have tunneling support. In this case, the outer IP header is seen as a simple IP header. So, depending on which port is receiving a tunneled packet, this flag or the dedicated one can be used for outer IP checksum. I just suggest to remove the part "(or inner IP)" of the comment. Do you agree? -- Thomas
[dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race conditioning
> -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson > Sent: Wednesday, December 10, 2014 3:55 PM > To: Neil Horman > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race > conditioning > > On Wed, Dec 10, 2014 at 09:47:45AM -0500, Neil Horman wrote: > > On Wed, Dec 10, 2014 at 08:18:36AM +, Wodkowski, PawelX wrote: > > > > Though, that said, doesn't it seem to anyone else like serialization of > enqueue > > > > to a port should be the responsibility of the library, not the > > > > application? > > > > > > > > Neil > > > > > > From my knowledge it is an application responsibility to serialize > > > access to > > > queue on particular port. > > > > > I understand thats the way it currently is, I'm advocating for the fact > > that it > > should not be. > > Neil > > > It could be done, but I think we'd need to add a new API (or new parameter to > existing API) to do so, as the cost of adding the locks would be severe, even > in > the uncontented case. > This is why it hasn't been done up till now, obviously enough. In general, > where > we don't provide performant multi-thread safe APIs, we generally don't try and > provide versions with locks, we just document the limitation and then leave it > up to the app to determine how best to handle things. > > /Bruce the problem is when the routing is through the same queue the app crashed. example: traffic to 1.1.1.1 from port 0 and 1.1.1.1 from port 1. You all are right :) So the only solution are spinlocks, or we must modify intel-dpdk-sample-applications-user-guide.pdf to inform users about limitations.
[dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race conditioning
On Wed, Dec 10, 2014 at 02:54:56PM +, Bruce Richardson wrote: > On Wed, Dec 10, 2014 at 09:47:45AM -0500, Neil Horman wrote: > > On Wed, Dec 10, 2014 at 08:18:36AM +, Wodkowski, PawelX wrote: > > > > Though, that said, doesn't it seem to anyone else like serialization of > > > > enqueue > > > > to a port should be the responsibility of the library, not the > > > > application? > > > > > > > > Neil > > > > > > From my knowledge it is an application responsibility to serialize > > > access to > > > queue on particular port. > > > > > I understand thats the way it currently is, I'm advocating for the fact > > that it > > should not be. > > Neil > > > It could be done, but I think we'd need to add a new API (or new parameter to > existing API) to do so, as the cost of adding the locks would be severe, even > in > the uncontented case. > This is why it hasn't been done up till now, obviously enough. In general, > where > we don't provide performant multi-thread safe APIs, we generally don't try and > provide versions with locks, we just document the limitation and then leave > it > up to the app to determine how best to handle things. > This really seems like a false savings to me. If an application intends to use multiple processes (which by all rights it seems like the use case that the dpdk is mostly designed for) then you need locking one way or another, and you've just made application coding harder, because the application now needs to know which functions might have internal critical sections that they need to provide locking for. I agree that, in the single process case, there might be a slight performance loss (though I contend it wouldn't be greatly significant). That said, I would argue that the right approach is to do the locking internally to the DPDK, then provide a configuration point which toggles the spinlock defintions to either do proper locking, or just reduce to empty definitions, the same way the Linux and BSD kernels do in the uniprocessor case. That way applications never have to worry about internal locking, and you can still build for the optimal case when you need to. Neil > /Bruce >
[dpdk-dev] [PATCH] Simple fix in rte_is_power_of_2 and RTE_MIN/RTE_MAX
Subject: [PATCH] Simple fix in rte_is_power_of_2 and RTE_MIN/RTE_MAX Number 0 is not a power_of_2, fix it in rte_is_power_of_2 function. Avoid code which generates branching when calculating min and max. Ravi Kerur (1): Fix rte_is_power_of_2() function since it returns true for zero and zero is not a power of 2. Avoid branching in RTE_MIN and RTE_MAX when calculating minimum and maximum of two numbers. lib/librte_eal/common/include/rte_common.h | 6 +++--- lib/librte_pmd_e1000/igb_pf.c | 4 ++-- lib/librte_pmd_ixgbe/ixgbe_pf.c| 4 ++-- 3 files changed, 7 insertions(+), 7 deletions(-) -- 1.9.1
[dpdk-dev] [PATCH] Fix rte_is_power_of_2() function since it returns true for zero and zero is not a power of 2. Avoid branching in RTE_MIN and RTE_MAX when calculating minimum and maximum of two numb
Subject: [PATCH] Fix rte_is_power_of_2() function since it returns true for zero and zero is not a power of 2. Avoid branching in RTE_MIN and RTE_MAX when calculating minimum and maximum of two numbers. Signed-off-by: Ravi Kerur --- lib/librte_eal/common/include/rte_common.h | 6 +++--- lib/librte_pmd_e1000/igb_pf.c | 4 ++-- lib/librte_pmd_ixgbe/ixgbe_pf.c| 4 ++-- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h index 921b91f..21b17ad 100644 --- a/lib/librte_eal/common/include/rte_common.h +++ b/lib/librte_eal/common/include/rte_common.h @@ -203,7 +203,7 @@ extern int RTE_BUILD_BUG_ON_detected_error; static inline int rte_is_power_of_2(uint32_t n) { - return ((n-1) & n) == 0; + return n && !(n & (n - 1)); } /** @@ -259,7 +259,7 @@ rte_align64pow2(uint64_t v) #define RTE_MIN(a, b) ({ \ typeof (a) _a = (a); \ typeof (b) _b = (b); \ - _a < _b ? _a : _b; \ +_b ^ ((_a ^ _b) & -(_a < _b)); \ }) /** @@ -268,7 +268,7 @@ rte_align64pow2(uint64_t v) #define RTE_MAX(a, b) ({ \ typeof (a) _a = (a); \ typeof (b) _b = (b); \ - _a > _b ? _a : _b; \ +_a ^ ((_a ^ _b) & -(_a < _b)); \ }) /*** Other general functions / macros / diff --git a/lib/librte_pmd_e1000/igb_pf.c b/lib/librte_pmd_e1000/igb_pf.c index bc3816a..546499c 100644 --- a/lib/librte_pmd_e1000/igb_pf.c +++ b/lib/librte_pmd_e1000/igb_pf.c @@ -321,11 +321,11 @@ igb_vf_set_mac_addr(struct rte_eth_dev *dev, uint32_t vf, uint32_t *msgbuf) static int igb_vf_set_multicast(struct rte_eth_dev *dev, __rte_unused uint32_t vf, uint32_t *msgbuf) { - int i; + int16_t i; uint32_t vector_bit; uint32_t vector_reg; uint32_t mta_reg; - int entries = (msgbuf[0] & E1000_VT_MSGINFO_MASK) >> + int32_t entries = (msgbuf[0] & E1000_VT_MSGINFO_MASK) >> E1000_VT_MSGINFO_SHIFT; uint16_t *hash_list = (uint16_t *)&msgbuf[1]; struct e1000_hw *hw = E1000_DEV_PRIVATE_TO_HW(dev->data->dev_private); diff --git a/lib/librte_pmd_ixgbe/ixgbe_pf.c b/lib/librte_pmd_ixgbe/ixgbe_pf.c index 51da1fd..426caf9 100644 --- a/lib/librte_pmd_ixgbe/ixgbe_pf.c +++ b/lib/librte_pmd_ixgbe/ixgbe_pf.c @@ -390,7 +390,7 @@ ixgbe_vf_set_multicast(struct rte_eth_dev *dev, __rte_unused uint32_t vf, uint32 struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); struct ixgbe_vf_info *vfinfo = *(IXGBE_DEV_PRIVATE_TO_P_VFDATA(dev->data->dev_private)); - int nb_entries = (msgbuf[0] & IXGBE_VT_MSGINFO_MASK) >> + int32_t nb_entries = (msgbuf[0] & IXGBE_VT_MSGINFO_MASK) >> IXGBE_VT_MSGINFO_SHIFT; uint16_t *hash_list = (uint16_t *)&msgbuf[1]; uint32_t mta_idx; @@ -399,7 +399,7 @@ ixgbe_vf_set_multicast(struct rte_eth_dev *dev, __rte_unused uint32_t vf, uint32 const uint32_t IXGBE_MTA_BIT_SHIFT = 5; const uint32_t IXGBE_MTA_BIT_MASK = (0x1 << IXGBE_MTA_BIT_SHIFT) - 1; uint32_t reg_val; - int i; + int16_t i; /* only so many hash values supported */ nb_entries = RTE_MIN(nb_entries, IXGBE_MAX_VF_MC_ENTRIES); -- 1.9.1
[dpdk-dev] [PATCH] Fix power_of_2 macro. Avoid branching when calculating RTE_MIN and RTE_MAX.
Just sent another patch explaining the fix and signed-off. Thanks, Ravi On Wed, Dec 10, 2014 at 4:58 AM, Thomas Monjalon wrote: > 2014-12-09 13:05, r k: >> Subject: [PATCH] Fix power_of_2 macro. Avoid branching when >> calculating RTE_MIN and RTE_MAX. > > Please could you add more explanations about the problem you are solving? > > You should also add a Signed-off like explained in this page: > http://dpdk.org/dev#send > > Thanks > -- > Thomas
[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages
On Wed, Dec 10, 2014 at 5:08 AM, Ananyev, Konstantin < konstantin.ananyev at intel.com> wrote: > I wonder why we do need to write our own bubble sort procedure? > Why we can't use standard qsort() here? > Sadly, even bubble sort would be better than the selection sort being used here. It's guaranteed to be O(n^2) in all cases. I just got through replacing that entire function in my repo with a call to qsort() from the standard library last night myself. Faster (although probably not material to most deployments) and less code. Jay
[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages
> From: Jay Rolette [mailto:rolette at infiniteio.com] > Sent: Wednesday, December 10, 2014 5:58 PM > To: Ananyev, Konstantin > Cc: Qiu, Michael; Richardson, Bruce; dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages > > > On Wed, Dec 10, 2014 at 5:08 AM, Ananyev, Konstantin intel.com> wrote: > I wonder why we do need to write our own bubble sort procedure? > Why we can't use standard qsort() here? > > Sadly, even bubble sort would be better than the selection sort being used > here. It's guaranteed to be O(n^2) in all cases. Ah yes, your right it is a selection sort here. > > I just got through replacing that entire function in my repo with a call to > qsort() from the standard library last night myself. Faster > (although probably not material to most deployments) and less code. If you feel like it is worth it, why not to submit a patch? :) Thanks Konstantin > > Jay
[dpdk-dev] lib: include rte_memory.h for __rte_cache_aligned
Hi Neil, Moving __rte_cache_aligned right after struct keyword will help. On the other hand, enforcing this rule for existing (100+) and future definitions will be difficult. It?s clearer and a good practice to include header file explicitly. Thanks, Jia On 12/9/14, 7:22 AM, "Neil Horman" wrote: >On Tue, Dec 09, 2014 at 09:53:18AM +0100, Olivier MATZ wrote: >> Hi Neil, >> >> On 12/08/2014 04:04 PM, Neil Horman wrote: >> >On Fri, Nov 07, 2014 at 09:28:09AM -0800, Jia Yu wrote: >> >>Include rte_memory.h for lib files that use __rte_cache_aligned >> >>attribute. >> >> >> >>Signed-off-by: Jia Yu >> >> >> >Why? I presume there was a build break or something. Please repost >>with a >> >changelog that details what this patch is for. >> >Neil >> >> I don't know if Yu's issue was the same, but I had a very "fun" issue >> with __rte_cache_aligned in my application. Consider the following code: >> >> struct per_core_foo { >> ... >> } __rte_cache_aligned; >> >> struct global_foo { >> struct per_core_foo foo[RTE_MAX_CORE]; >> }; >> >> If __rte_cache_aligned is not defined (rte_memory.h is not included), >> the code compiles but the structure is not aligned... it defines the >> structure and creates a global variable called __rte_cache_aligned. >> And this can lead to really bad things if this code is in a .h that >> is included by files that may or may not include rte_memory.h >> >> I have no idea about how we could prevent this issue, except using >> __attribute__((aligned(CACHE_LINE))) instead of __rte_cache_aligned. >> >> Anyway this could probably explain the willing to include rte_memory.h >> everywhere. >> >> Regards, >> Olivier >> >> > >So, that is a great explination, and would be good to have in the >changelog. > >Also, to avoid the problem that you describe, while its preferred to have >it at >the end of a struct, you can also put the alignment attribute right after >the >struct keyword in gcc: >https://urldefense.proofpoint.com/v2/url?u=https-3A__gcc.gnu.org_onlinedoc >s_gcc_Attribute-2DSyntax.html-23Attribute-2DSyntax&d=AAIBAg&c=Sqcl0Ez6M0X8 >aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=q34pQj5yiMxs5OseYCxXLQ&m=mIyHF3ASZxRmbPs >acyMyIABQlSafUdV9PqknKAtfOuI&s=pKoAAkIYRX31K-gR5oSwpcA5mLj4nG7uEzyh0z_uwxU >&e= > >That seems like it would solve the problem going forward. > >Neil >
[dpdk-dev] A question about hugepage initialization time
na ez :) On Wed, 10 Dec 2014, Bruce Richardson wrote: > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote: >> On Wed, Dec 10, 2014 at 10:32:25AM +, Bruce Richardson wrote: >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote: On Tue, 9 Dec 2014 11:45:07 -0800 &rew wrote: >> Hey Folks, >> >> Our DPDK application deals with very large in memory data >> structures, and can potentially use tens or even hundreds of gigabytes >> of hugepage memory. >> During the course of development, we've noticed that as the >> number of huge pages increases, the memory initialization time >> during EAL init gets to be quite long, lasting several minutes at >> present. The growth in init time doesn't appear to be linear, which is >> concerning. >> >> This is a minor inconvenience for us and our customers, as memory >> initialization makes our boot times a lot longer than it would >> otherwise be. Also, my experience has been that really long >> operations often are hiding errors - what you think is merely a >> slow operation is actually a timeout of some sort, often due to >> misconfiguration. This leads to two >> questions: >> >> 1. Does the long initialization time suggest that there's an >> error happening under the covers? >> 2. If not, is there any simple way that we can shorten memory >> initialization time? >> >> Thanks in advance for your insights. >> >> -- >> Matt Laswell >> laswell at infiniteio.com >> infinite io, inc. >> > > Hello, > > please find some quick comments on the questions: > 1.) By our experience long initialization time is normal in case > of large amount of memory. However this time depends on some things: > - number of hugepages (pagefault handled by kernel is pretty > expensive) > - size of hugepages (memset at initialization) > > 2.) Using 1G pages instead of 2M will reduce the initialization > time significantly. Using wmemset instead of memset adds an > additional 20-30% boost by our measurements. Or, just by touching > the pages but not cleaning them you can have still some more > speedup. But in this case your layer or the applications above > need to do the cleanup at allocation time (e.g. by using rte_zmalloc). > > Cheers, > &rew I wonder if the whole rte_malloc code is even worth it with a modern kernel with transparent huge pages? rte_malloc adds very little value and is less safe and slower than glibc or other allocators. Plus you lose the ablilty to get all the benefit out of valgrind or electric fence. >>> >>> While I'd dearly love to not have our own custom malloc lib to >>> maintain, for DPDK multiprocess, rte_malloc will be hard to replace >>> as we would need a replacement solution that similarly guarantees >>> that memory mapped in process A is also available at the same >>> address in process B. :-( >>> >> Just out of curiosity, why even bother with multiprocess support? >> What you're talking about above is a multithread model, and your >> shoehorning multiple processes into it. >> Neil >> > > Yep, that's pretty much what it is alright. However, this multiprocess > support is very widely used by our customers in building their > applications, and has been in place and supported since some of the > earliest DPDK releases. If it is to be removed, it needs to be > replaced by something that provides equivalent capabilities to > application writers (perhaps something with more fine-grained sharing > etc.) > > /Bruce > It is probably time to start discussing how to pull in our multi process and memory management improvements we were talking about in our DPDK Summit presentation: https://www.youtube.com/watch?v=907VShi799k#t=647 Multi-process model could have several benefits mostly in the high availability area (telco requirement) due to better separation, controlling permissions (per process RO or RW page mappings), single process restartability, improved startup and core dumping time etc. As a summary of our memory management additions, it allows an application to describe their memory model in a configuration (or via an API), e.g. a simplified config would say that every instance will need 4GB private memory and 2GB shared memory. In a multi process model this will result mapping only 6GB memory in each process instead of the current DPDK model where the 4GB per process private memory is mapped into all other processes resulting in unnecessary mappings, e.g. 16x4GB + 2GB in every processes. What we've chosen is to use DPDK's NUMA aware allocator for this purpose, e.g. the above example for 16 instances will result allocating 17 DPDK NUMA sockets (1 default shared + 16 private) and we can selectively map a given "NUMA socket" (set of memsegs) into a process. This also
[dpdk-dev] [PATCH] Avoid possible memory cpoy when sort hugepages
On Wed, Dec 10, 2014 at 12:39 PM, Ananyev, Konstantin < konstantin.ananyev at intel.com> wrote: > > I just got through replacing that entire function in my repo with a call > to qsort() from the standard library last night myself. Faster > > (although probably not material to most deployments) and less code. > > If you feel like it is worth it, why not to submit a patch? :) On Haswell and IvyBridge Xeons, with 128 1G huge pages, it doesn't make a user-noticeable difference in the time required for rte_eal_hugepage_init(). The reason I went ahead and checked it in my repo is because: a) it eats at my soul to see an O(n^2) case for something where qsort() is trivial to use b) we will increase that up to ~232 1G huge pages soon. Likely doesn't matter at that point either, but since it was already written... What *does* chew up a lot of time in init is where the huge pages are being explicitly zeroed in map_all_hugepages(). Removing that memset() makes find_numasocket() blow up, but I was able to do a quick test where I only memset 1 byte on each page. That cut init time by 30% (~20 seconds in my test). Significant, but since I'm not entirely sure it is safe, I'm not making that change right now. On Linux, shared memory that isn't file-backed is automatically zeroed before the app gets it. However, I haven't had a chance to chase down whether that applies to huge pages or not, much less how hugetlbfs factors into the equation. Back to the question about the patch, if you guys are interested in it, I'll have to figure out your patch submission process. Shouldn't be a huge deal other than the fact that we are on DPDK 1.6 (r2). Cheers, Jay
[dpdk-dev] [PATCH RFC v2 00/12] lib/librte_vhost: vhost-user support
This patch set is based on latest vhost. It fixes vhost-user memory map/unmap alignment issue. It uses VHOST_USER_GET_VRING_BASE as the message for vhost device stop in vhost-user. It uses VHOST_SET_VRING_KICK as the message that tells us vhost device is ready in vhost-user. Tetsuya: Your abstraction layer hasn't been integrated due to time issue, so now we support both vhost-cuse and vhost-user, but only one driver could be registered. This is not the final patch, and there might be resource leak issue(which is critical for vhost enabled switch, as vSwitch needs to run endlessly, and virtio in guest VM and VM itself could be restarted repeatedly. Will take effort to check if all the memory region has been unmapped, and if all fds are closed properly in all possible path.) I have tried to restart virtio-PMD repeatedly and switch between virtio-PMD and virtio-pci several times, it runs smoothly and no resource leak so far. There are other features like multiple socket file, client mode to be supported. What if the vhost is terminated due to program error? What if vhost needs to be upgraded dynamically? How could we reconnect to virtio frontend? That is very very important feature if you are working on it. :). Please help review. Thanks! > -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Huawei Xie > Sent: Wednesday, December 10, 2014 2:38 PM > To: dev at dpdk.org > Cc: haifeng.lin at intel.com > Subject: [dpdk-dev] [PATCH RFC v2 00/12] lib/librte_vhost: vhost-user support > > This patchset refines vhost library to support both vhost-cuse and vhost-user. > > > Huawei Xie (12): > create vhost_cuse directory and move vhost-net-cdev.c to vhost_cuse > directory > rename vhost-net-cdev.h as vhost-net.h > move eventfd_copy logic out from virtio-net.c to vhost-net-cdev.c > exact copy of host_memory_map from virtio-net.c to new file > virtio-net-cdev.c > host_memory_map refine: map partial memory of target process into current > process > cuse_set_memory_table is the VHOST_SET_MEMORY_TABLE message handler > for cuse > fd management for vhost user > vhost-user support > minor fix > vhost-user memory region map/unmap > kick/callfd fix > cleanup when vhost user connection is closed > > lib/librte_vhost/Makefile | 5 +- > lib/librte_vhost/rte_virtio_net.h | 2 + > lib/librte_vhost/vhost-net-cdev.c | 389 -- > lib/librte_vhost/vhost-net-cdev.h | 113 --- > lib/librte_vhost/vhost-net.h | 117 +++ > lib/librte_vhost/vhost_cuse/vhost-net-cdev.c | 452 > ++ > lib/librte_vhost/vhost_cuse/virtio-net-cdev.c | 349 > lib/librte_vhost/vhost_cuse/virtio-net-cdev.h | 45 +++ > lib/librte_vhost/vhost_rxtx.c | 2 +- > lib/librte_vhost/vhost_user/fd_man.c | 205 > lib/librte_vhost/vhost_user/fd_man.h | 64 > lib/librte_vhost/vhost_user/vhost-net-user.c | 423 > > lib/librte_vhost/vhost_user/vhost-net-user.h | 107 ++ > lib/librte_vhost/vhost_user/virtio-net-user.c | 313 ++ > lib/librte_vhost/vhost_user/virtio-net-user.h | 49 +++ > lib/librte_vhost/virtio-net.c | 394 ++ > lib/librte_vhost/virtio-net.h | 43 +++ > 17 files changed, 2199 insertions(+), 873 deletions(-) > delete mode 100644 lib/librte_vhost/vhost-net-cdev.c > delete mode 100644 lib/librte_vhost/vhost-net-cdev.h > create mode 100644 lib/librte_vhost/vhost-net.h > create mode 100644 lib/librte_vhost/vhost_cuse/vhost-net-cdev.c > create mode 100644 lib/librte_vhost/vhost_cuse/virtio-net-cdev.c > create mode 100644 lib/librte_vhost/vhost_cuse/virtio-net-cdev.h > create mode 100644 lib/librte_vhost/vhost_user/fd_man.c > create mode 100644 lib/librte_vhost/vhost_user/fd_man.h > create mode 100644 lib/librte_vhost/vhost_user/vhost-net-user.c > create mode 100644 lib/librte_vhost/vhost_user/vhost-net-user.h > create mode 100644 lib/librte_vhost/vhost_user/virtio-net-user.c > create mode 100644 lib/librte_vhost/vhost_user/virtio-net-user.h > create mode 100644 lib/librte_vhost/virtio-net.h > > -- > 1.8.1.4
[dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly added RX error flags
> -Original Message- > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com] > Sent: Wednesday, December 10, 2014 11:26 PM > To: Zhang, Helin > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly added RX > error flags > > 2014-12-10 13:50, Zhang, Helin: > > > > > -Original Message- > > > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com] > > > Sent: Wednesday, December 10, 2014 5:35 PM > > > To: Zhang, Helin > > > Cc: dev at dpdk.org > > > Subject: Re: [dpdk-dev] [PATCH v4] mbuf: fix of enabling all newly > > > added RX error flags > > > > > > 2014-12-10 16:55, Helin Zhang: > > > > Before redefining mbuf structure, there was lack of free bits in > > > > 'ol_flags' > > > > (32 bits in total) for new RX or TX flags. So it tried to reuse > > > > existant bits as most as possible, or even assigning 0 to some of > > > > bit flags. After new mbuf structure defined, there are quite a lot > > > > of free bits. So those newly added bit flags should be assigned > > > > with correct and valid bit values, and getting their names should > > > > be enabled as well. Note that 'RECIP' should be removed, as nowhere > uses it. > > > > 'PKT_RX_ERR' is defined to replace all other error bit flags, e.g. > > > > MAC error, > > > Oversize error, header buffer overflow error. > > > > > > > > Signed-off-by: Helin Zhang > > > > --- > > > > lib/librte_mbuf/rte_mbuf.c | 7 ++- > > > > lib/librte_mbuf/rte_mbuf.h | 9 +++-- > > > > lib/librte_pmd_i40e/i40e_rxtx.c | 37 > > > > - > > > > 3 files changed, 25 insertions(+), 28 deletions(-) > > > > > > > > v2 changes: > > > > * Removed error flag of 'ECIPE' processing only in both i40e PMD and > mbuf. > > > All > > > > other error flags were added back. > > > > * Assigned error flags with correct and valid values, as their previous > values > > > > were invalid. > > > > * Enabled getting all error flag names. > > > > > > > > v3 changes: > > > > * 'PKT_RX_ERR' is defined to replace error bit flags of MAC error, > > > > Oversize > > > > error, header buffer overflow error. > > > > * Removed assigning values to PKT_TX_* bit flags, as it has > > > > already been > > > done > > > > in another packet set. > > > > > > > > v4 changes: > > > > * Renamed 'PKT_RX_EIP_CKSUM_BAD' to > 'PKT_RX_OUTER_IP_CKSUM_BAD'. > > > > * Fixed the bug of checking error bits of 'Descriptor oversize' and > > > > 'Header buffer oversize'. > > > > * Added debug logs for each RX error. > > > [...] > > > > --- a/lib/librte_mbuf/rte_mbuf.h > > > > +++ b/lib/librte_mbuf/rte_mbuf.h > > > > @@ -83,12 +83,7 @@ extern "C" { > > > > #define PKT_RX_RSS_HASH (1ULL << 1) /**< RX packet with > RSS > > > hash result. */ > > > > #define PKT_RX_FDIR (1ULL << 2) /**< RX packet with > FDIR > > > match indicate. */ > > > > #define PKT_RX_L4_CKSUM_BAD (1ULL << 3) /**< L4 cksum of RX > pkt. > > > is > > > > not OK. */ -#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP > > > > cksum of RX pkt. is not OK. */ -#define PKT_RX_EIP_CKSUM_BAD (0ULL > > > > << 0) /**< > > > External IP header checksum error. */ > > > > -#define PKT_RX_OVERSIZE (0ULL << 0) /**< Num of desc of an > RX > > > pkt oversize. */ > > > > -#define PKT_RX_HBUF_OVERFLOW (0ULL << 0) /**< Header buffer > > > overflow. */ > > > > -#define PKT_RX_RECIP_ERR (0ULL << 0) /**< Hardware > processing > > > error. */ > > > > -#define PKT_RX_MAC_ERR (0ULL << 0) /**< MAC error. */ > > > > +#define PKT_RX_IP_CKSUM_BAD (1ULL << 4) /**< IP (or inner IP) > > > > +header checksum error. */ > > > > > > It can be also an outer IP header in case the device don't support > > > tunneling or is not configured to detect it. > > > > For non-tunneling case, no outer/inner at all, it just has the IP > > header. The bit flag indicates the IP header checksum error. > > For tunneling case, this bit flag indicates the inner IP header > > checksum error, another one for outer IP header checksum error. > > So I don't think this bit can be treated as outer. > > I think you didn't understand my comment. > I talk about NICs which don't have tunneling support. > In this case, the outer IP header is seen as a simple IP header. > So, depending on which port is receiving a tunneled packet, this flag or the > dedicated one can be used for outer IP checksum. I think I did understand your point. For those port which does not support tunneling, if a 'tunneling' packet received, it never knows that's tunneling packet, it always treats it as a general IP packet. The "inner" IP is just part of its data. For this case, no outer or inner at all, just an IP header. > > I just suggest to remove the part "(or inner IP)" of the comment. > Do you agree? I got it, actually I wanted to describe it as (or inner IP for tunneling), as the macro name does not tell it could be inner IP header checksum error for tunneling case. Regards, Helin > > -
[dpdk-dev] [PATCH] Fix top level 'make clean'
When 'make clean' is executed, following message is thrown on console due to missing 'comma' definition. tr: missing operand after '.-' Two strings must be given when translating. Ravi Kerur (1): When "make clean" is executed following msg is thrown ==Clean lib/librte_eal/linuxapp/kni tr: missing operand after '.-' Two strings must be given when translating. lib/librte_eal/linuxapp/kni/Makefile | 1 + 1 file changed, 1 insertion(+) -- 1.9.1
[dpdk-dev] [PATCH] When "make clean" is executed following error msg is thrown == Clean lib/librte_eal/linuxapp/kni tr: missing operand after '.-' Two strings must be given when translating.
Subject: [PATCH] When "make clean" is executed following error msg is thrown == Clean lib/librte_eal/linuxapp/kni tr: missing operand after '.-' Two strings must be given when translating. due to missing "comma" definition in Makefile. Signed-off-by: Ravi Kerur --- lib/librte_eal/linuxapp/kni/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/librte_eal/linuxapp/kni/Makefile b/lib/librte_eal/linuxapp/kni/Makefile index fb673d9..02ed5da 100644 --- a/lib/librte_eal/linuxapp/kni/Makefile +++ b/lib/librte_eal/linuxapp/kni/Makefile @@ -29,6 +29,7 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +include $(RTE_SDK)/mk/internal/rte.build-pre.mk include $(RTE_SDK)/mk/rte.vars.mk # -- 1.9.1
[dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race conditioning
On Wed, 10 Dec 2014 11:16:46 -0500 Neil Horman wrote: > This really seems like a false savings to me. If an application intends to > use > multiple processes (which by all rights it seems like the use case that the > dpdk > is mostly designed for) then you need locking one way or another, and you've > just made application coding harder, because the application now needs to know > which functions might have internal critical sections that they need to > provide > locking for. The DPDK is not Linux. See the examples of how to route without using locks by doing asymmetric multiprocessing. I.e queues are only serviced by one CPU. The cost of a locked operation (even uncontended) is often enough to drop packet performance by several million PPS.
[dpdk-dev] [PATCH] ixgbe: support X540 VF
On Thu, 11 Dec 2014 00:23:24 +0100 Thomas Monjalon wrote: > 2014-12-09 08:37, Stephen Hemminger: > > Add missing setup for X540 MAC type when setting up VF. > > Additional check exists in Linux driver but not in DPDK. > > > > Signed-off-yb: Bill Hong > > Signed-off-by: Stephen Hemminger > > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c 2014-12-08 09:26:18.150170081 > > -0800 > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c 2014-12-08 09:26:18.150170081 > > -0800 > > @@ -1911,7 +1911,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_ > > /* > > * Modification to set VFTDT for virtual function if vf is detected > > */ > > - if (hw->mac.type == ixgbe_mac_82599_vf) > > + if (hw->mac.type == ixgbe_mac_82599_vf || > > + hw->mac.type == ixgbe_mac_X540_vf) > > What about X550? Should it be listed there? > > > txq->tdt_reg_addr = IXGBE_PCI_REG_ADDR(hw, > > IXGBE_VFTDT(queue_idx)); > > else > > txq->tdt_reg_addr = IXGBE_PCI_REG_ADDR(hw, > > IXGBE_TDT(txq->reg_idx)); > > @@ -2198,7 +2199,8 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_ > > /* > > * Modified to setup VFRDT for Virtual Function > > */ > > - if (hw->mac.type == ixgbe_mac_82599_vf) { > > + if (hw->mac.type == ixgbe_mac_82599_vf || > > + hw->mac.type == ixgbe_mac_X540_vf) { > > rxq->rdt_reg_addr = > > IXGBE_PCI_REG_ADDR(hw, IXGBE_VFRDT(queue_idx)); > > rxq->rdh_reg_addr = > > It appears to be a fix which is candidate for 1.8.0. > > Thanks Not sure, we didnt have X550 to test, and code for X550 is not in upstream Linux kernel driver.
[dpdk-dev] lib: include rte_memory.h for __rte_cache_aligned
On Wed, Dec 10, 2014 at 07:09:03PM +, Jia Yu wrote: > Hi Neil, > > Moving __rte_cache_aligned right after struct keyword will help. On the > other hand, enforcing this rule for existing (100+) and future definitions > will be difficult. It?s clearer and a good practice to include header file > explicitly. > You need to include the header file regardless of what you do. The advantage to placing the macro right after the struct keyword is that failure to include the header will result in a compiler error, rather than silent behavioral changes and run time breakage. You don't have to enforce putting the attribute after the struct keyword, I'd say just move them now to protect the current code. Subsequent patch authors will see the existing style and follow suit. Or they won't, and we'll point out the issue during review. Regards Neil > Thanks, > Jia > > > On 12/9/14, 7:22 AM, "Neil Horman" wrote: > > >On Tue, Dec 09, 2014 at 09:53:18AM +0100, Olivier MATZ wrote: > >> Hi Neil, > >> > >> On 12/08/2014 04:04 PM, Neil Horman wrote: > >> >On Fri, Nov 07, 2014 at 09:28:09AM -0800, Jia Yu wrote: > >> >>Include rte_memory.h for lib files that use __rte_cache_aligned > >> >>attribute. > >> >> > >> >>Signed-off-by: Jia Yu > >> >> > >> >Why? I presume there was a build break or something. Please repost > >>with a > >> >changelog that details what this patch is for. > >> >Neil > >> > >> I don't know if Yu's issue was the same, but I had a very "fun" issue > >> with __rte_cache_aligned in my application. Consider the following code: > >> > >>struct per_core_foo { > >>... > >>} __rte_cache_aligned; > >> > >>struct global_foo { > >>struct per_core_foo foo[RTE_MAX_CORE]; > >>}; > >> > >> If __rte_cache_aligned is not defined (rte_memory.h is not included), > >> the code compiles but the structure is not aligned... it defines the > >> structure and creates a global variable called __rte_cache_aligned. > >> And this can lead to really bad things if this code is in a .h that > >> is included by files that may or may not include rte_memory.h > >> > >> I have no idea about how we could prevent this issue, except using > >> __attribute__((aligned(CACHE_LINE))) instead of __rte_cache_aligned. > >> > >> Anyway this could probably explain the willing to include rte_memory.h > >> everywhere. > >> > >> Regards, > >> Olivier > >> > >> > > > >So, that is a great explination, and would be good to have in the > >changelog. > > > >Also, to avoid the problem that you describe, while its preferred to have > >it at > >the end of a struct, you can also put the alignment attribute right after > >the > >struct keyword in gcc: > >https://urldefense.proofpoint.com/v2/url?u=https-3A__gcc.gnu.org_onlinedoc > >s_gcc_Attribute-2DSyntax.html-23Attribute-2DSyntax&d=AAIBAg&c=Sqcl0Ez6M0X8 > >aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=q34pQj5yiMxs5OseYCxXLQ&m=mIyHF3ASZxRmbPs > >acyMyIABQlSafUdV9PqknKAtfOuI&s=pKoAAkIYRX31K-gR5oSwpcA5mLj4nG7uEzyh0z_uwxU > >&e= > > > >That seems like it would solve the problem going forward. > > > >Neil > > > >
[dpdk-dev] Added Spinlock to l3fwd-vf example to prevent race conditioning
On Wed, Dec 10, 2014 at 03:38:37PM -0800, Stephen Hemminger wrote: > On Wed, 10 Dec 2014 11:16:46 -0500 > Neil Horman wrote: > > > This really seems like a false savings to me. If an application intends to > > use > > multiple processes (which by all rights it seems like the use case that the > > dpdk > > is mostly designed for) then you need locking one way or another, and you've > > just made application coding harder, because the application now needs to > > know > > which functions might have internal critical sections that they need to > > provide > > locking for. > > The DPDK is not Linux. I never indicated that it was. > See the examples of how to route without using locks by doing asymmetric > multiprocessing. > I.e queues are only serviced by one CPU. > Yes, I've seen it. > The cost of a locked operation (even uncontended) is often enough to drop > packet performance by several million PPS. Please re-read my note, I clearly stated that a single process use case was a valid one, but that didn't preclude the need to provide mutual exclusion internally to the api. Theres no reason that this locking can't be moved into the api, and the spinlock api itself either be defined to do locking at compile time, or defined out as empty macros based on a build variable (CONFIG_SINGLE_ACCESSOR or some such). That way you save the application the headache of having to guess which api calls need locking around them, and you still get maximal performance if the application being written can guarantee single accessor status to the dpdk library. Neil