[ovs-dev] Hi
-- My name is Mike, from United States. I want to get to know you better, if I may be so bold. I consider myself an easy-going man, and I am currently looking for a relationship in which I feel loved. I want us to be friends. Please tell me more about yourself, if you don't mind. Regards, Mike. ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
Re: [ovs-dev] RFC: OVN database options
On 03/10/2016 06:50 PM, Ben Pfaff wrote: I've been a fan of Postgres since I used in the 1990s for a web-based application. It didn't occur to me that it was appropriate here. Julien, thanks so much for joining the discussion. So yes, it has everything OVN needs. It can push notifications to clients via the NOTIFY¹ command (that you can use in any procedure/trigger). For example, you could imagine creating a trigger that sends a JSON payload for each new update/insert in the database. That's literally 10 lines of PL/SQL. That's good to know. I hadn't figured out how to do this kind of thing with SQL-based systems. ¹ http://www.postgresql.org/docs/9.5/static/sql-notify.html I think that PostgreSQL would be the safer bet in this move, as: - building something on top of etcd would seem weak w.r.t your schema/table requirements - investing in OVSDB (though keep in mind I don't know it :-) would probably end up in redoing a job PostgreSQL people already have done better than you would ;-) The only questions that this raises to me are: - whether PostgreSQL is too large/complex to deploy for OVN. Seeing the list of candidates that were evaluated, I wouldn't think so, but there can be a lot of different opinions on that based on different perception of PostgreSQL. And since you're targeting a network DB, you definitely need a daemon configured and set-up so I'm only partially worried here. :) Hi there, Russell Bryant invited me to this list to chime in on this discussion. If it were me, I *might* not build out based on NOTIFY as the core system of notifying clients, and I'd likely stick with a tool that's designed for cluster communication and in this case the custom service that's already there seems like it might be the best bet; I'd actually build out the service and use RAFT to keep it in sync with itself. The reason is because Postgresql is not supplying you with an easy out-of-the-box HA component in any case (Galera does, but then you don't get NOTIFY), so you're going to have to build out something like RAFT or such on the PG side in any case in order to handle failover. Postgresql's HA story is not very good right now, it's very much roll-your-own, and it is nowhere near the sophistication of Galera's multi-master approach which would be an enormous muilt-year undertaking to recreate on Posgtresql. IMO building out the HA part from scratch is the difficult part; being able to send events to clients is pretty easy from any kind of custom service. Since to do HA in PG you'd have to build your own event-dispatch system anyway (e.g. to determine a node is down and send out the call to pick a new master node as well as some method to get all the clients to send data updates to this node), might as well just build your custom service to do just the thing you need. ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] Gold dust/Bar
We are engaged in mechanized mining of gold and precious minerals in the West Africa Sub-Region. In view of your correspondence of interest, this is our Full Corporate Offer (F.C.O) for your perusal and subsequent action. 1. Product : Alluvial Gold Dust 2. Quality: 23+carats 3. Purity : 92% to 94% 4. Quantity: 50kg to 100kg 5. Origin: West Africa/Ghana 6. Price: US 38,000 per kilo 7. The goods would be delivered to buyer?s destination. 8. Buyer will have to come to (Ghana) but in the situaition where by buyer cannot come to Ghana we can make an alternative arrangement to send the Gold via London to buyers destination for verification and testing of the gold. When buyer/mandate is satisfied with the inspection of the gold and sample of the product taken for the assay analysis the sellers mandate and the buyer will then convey the consignment to the buyer?s refinery to be refined. After Final Assay at buyer?s destination the buyer will then wire the full amount into sellers account or supply machinery. 9. The buyer will have to be responsible for the expense cost for the transport of GOLD from London to its country after being satisfied with assay made in London. 10. Upon acceptance of these conditions our Lawyers will send to the buyer/buyers a Contractual Agreement binding both parties ?seller and buyer/mandate? together to honour and respect the seal and legal documents as signed by the Lawyers. Any parties who the breach or are in default of the said contract will then be held fully accountable for the cost incurred. Upon receiving this FCO a reply is needed as soon as possible as acknowledgement of the FCO. Best Regards Mike Amoasie ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
Re: [ovs-dev] [PATCH] netdev-linux: Don't restrict policing to IPv4 and don't call "tc".
I'm certainly very happy with a re-write: this seems like a much nicer way of doing things. -Mike. > -Original Message- > From: Justin Pettit [mailto:jpet...@nicira.com] > Sent: 05 December 2011 00:57 > To: dev@openvswitch.org > Cc: Mike Bursell; Jamal Hadi Salim > Subject: [PATCH] netdev-linux: Don't restrict policing to IPv4 and don't call > "tc". > > Mike Bursell pointed out that our policer only works on IPv4 traffic--and > specifically not IPv6. By using the "basic" filter, we can enforce policing > on all > traffic for a particular interface. > > Jamal Hadi Salim pointed out that calling "tc" directly with system() is > pretty > ugly. This commit switches our remaining "tc" calls to directly sending the > appropriate netlink messages. > > Suggested-by: Mike Bursell > Suggested-by: Jamal Hadi Salim > --- > AUTHORS|2 + > INSTALL.Linux |6 +- > lib/netdev-linux.c | 191 +++--- > - > 3 files changed, 136 insertions(+), 63 deletions(-) > > diff --git a/AUTHORS b/AUTHORS > index 6cf99da..964e32d 100644 > --- a/AUTHORS > +++ b/AUTHORS > @@ -78,6 +78,7 @@ Hassan Khan hassan.k...@seecs.edu.pk > Hector Oron hector.o...@gmail.com > Henrik Amrenhen...@nicira.com > Jad Naous jna...@gmail.com > +Jamal Hadi Salimh...@cyberus.ca > Jan Medved jmed...@juniper.net > Janis Hamme janis.ha...@student.kit.edu > Jari Sundellsundell.softw...@gmail.com > @@ -90,6 +91,7 @@ Krishna Miriyalakris...@nicira.com > Luiz Henrique Ozaki luiz.oz...@gmail.com > Michael Hu m...@nicira.com > Michael Mao m...@nicira.com > +Mike Bursellmike.burs...@citrix.com > Murphy McCauley murphy.mccau...@gmail.com > Mikael Doverhag mdover...@nicira.com > Niklas Anderssonnanders...@nicira.com > diff --git a/INSTALL.Linux b/INSTALL.Linux index 4477a60..7a55ccd 100644 > --- a/INSTALL.Linux > +++ b/INSTALL.Linux > @@ -46,9 +46,9 @@ INSTALL.userspace for more information. >bridge") before starting the datapath. > >For optional support of ingress policing, you must enable kernel > - configuration options NET_CLS_ACT, NET_CLS_U32, NET_SCH_INGRESS, > - and NET_ACT_POLICE, either built-in or as modules. > - (NET_CLS_POLICE is obsolete and not needed.) > + configuration options NET_CLS_BASIC, NET_SCH_INGRESS, and > + NET_ACT_POLICE, either built-in or as modules. (NET_CLS_POLICE is > + obsolete and not needed.) > >If GRE tunneling is being used it is recommended that the kernel >be compiled with IPv6 support (CONFIG_IPV6). This allows for diff > --git > a/lib/netdev-linux.c b/lib/netdev-linux.c index 90e88c7..8293bb1 100644 > --- a/lib/netdev-linux.c > +++ b/lib/netdev-linux.c > @@ -30,6 +30,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -326,6 +327,9 @@ static unsigned int tc_buffer_per_jiffy(unsigned int > rate); static struct tcmsg *tc_make_request(const struct netdev *, int type, > unsigned int flags, struct ofpbuf *); > static int > tc_transact(struct ofpbuf *request, struct ofpbuf **replyp); > +static int tc_add_del_ingress_qdisc(struct netdev *netdev, bool add); > +static int tc_add_policer(struct netdev *netdev, int kbits_rate, > + int kbits_burst); > > static int tc_parse_qdisc(const struct ofpbuf *, const char **kind, >struct nlattr **options); @@ -1564,50 +1568,8 @@ > netdev_linux_set_advertisements(struct netdev *netdev, uint32_t > advertise) > ETHTOOL_SSET, "ETHTOOL_SSET"); } > > -#define POLICE_ADD_CMD "/sbin/tc qdisc add dev %s handle : ingress" > -#define POLICE_CONFIG_CMD "/sbin/tc filter add dev %s parent : > protocol ip prio 50 u32 match ip src 0.0.0.0/0 police rate %dkbit burst %dk > mtu > 65535 drop flowid :1" > - > -/* Remove ingress policing from 'netdev'. Returns 0 if successful, otherwise > a > - * positive errno value. > - * > - * This function is equivalent to running > - * /sbin/tc qdisc del dev %s handle : ingress > - * but it is much, much faster. > - */ > -static int > -netdev_linux_remove_policing(struct netdev *netdev) -{ > -struct netdev_dev_linux *netdev_dev = > -netdev_dev_linux_cast(netdev_get_dev(netdev)); > -const char *netdev_name = netdev_get_n
[ovs-dev] hello
Hello i am rechel I guess you will not be surprise to receive my mail? i saw your profile and it sound well.I will like us to exchange good relationship.I am rechel by name,No kid and never marry.so i hope we can be good friend i hope both of us can make it together.so reply me at (rechelm...@yahoo.com)so that i will send you my pic.yours rechel___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] [Patch] Fixes DPDK Queue size for IVSHMEM VM communications
Separates loop list process size from the endpoint DPDK queue size. Corrected DPDK queue size to be a power of 2 which allows dpdkr interface to be created. Increased queue size to improve zero loss data rate. Changed NIC queue size comment to make NIC queue size formula more clear. Signed-off-by: Mike A. Polehn diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c old mode 100644 new mode 100755 index 6ee9803..3a19db0 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -71,8 +71,11 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); #define NON_PMD_THREAD_TX_QUEUE 0 -#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue, Max (n+32<=4096)*/ -#define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue, Max (n+32<=4096)*/ +#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue, Size: (x*32<=4064)*/ +#define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue, Size: (x*32<=4064)*/ + +#define DPDK_RX_Q_SIZE 2048 /* Size of DPDK RX Client Queue, Size: (x**2)*/ +#define DPDK_TX_Q_SIZE 2048 /* Size of DPDK TX Client Queue, Size: (x**2)*/ /* XXX: Needs per NIC value for these constants. */ #define RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ @@ -1236,7 +1239,7 @@ dpdk_ring_create(const char dev_name[], unsigned int port_no, return -err; } -ivshmem->cring_tx = rte_ring_create(ring_name, MAX_RX_QUEUE_LEN, SOCKET0, 0); +ivshmem->cring_tx = rte_ring_create(ring_name, DPDK_TX_Q_SIZE, SOCKET0, 0); if (ivshmem->cring_tx == NULL) { rte_free(ivshmem); return ENOMEM; @@ -1247,7 +1250,7 @@ dpdk_ring_create(const char dev_name[], unsigned int port_no, return -err; } -ivshmem->cring_rx = rte_ring_create(ring_name, MAX_RX_QUEUE_LEN, SOCKET0, 0); +ivshmem->cring_rx = rte_ring_create(ring_name, DPDK_RX_Q_SIZE, SOCKET0, 0); if (ivshmem->cring_rx == NULL) { rte_free(ivshmem); return ENOMEM; ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] [Patch] Documentation for DPDK IVSHMEM VM Communications
Adds documentation on how to run IVSHMEM communication through VM. Signed-off-by: Mike A. Polehn diff --git a/INSTALL.DPDK b/INSTALL.DPDK index 4551f4c..8d866e9 100644 --- a/INSTALL.DPDK +++ b/INSTALL.DPDK @@ -19,10 +19,14 @@ Recommended to use DPDK 1.6. DPDK: Set dir i.g.: export DPDK_DIR=/usr/src/dpdk-1.6.0r2 cd $DPDK_DIR -update config/defconfig_x86_64-default-linuxapp-gcc so that dpdk generate single lib file. +update config/defconfig_x86_64-default-linuxapp-gcc so that dpdk generate +single lib file (modification also required for IVSHMEM build). CONFIG_RTE_BUILD_COMBINE_LIBS=y -make install T=x86_64-default-linuxapp-gcc +For default install without IVSHMEM (old): + make install T=x86_64-default-linuxapp-gcc +To include IVSHMEM (shared memory): + make install T=x86_64-ivshmem-linuxapp-gcc For details refer to http://dpdk.org/ Linux kernel: @@ -32,7 +36,10 @@ DPDK kernel requirement. OVS: cd $(OVS_DIR)/openvswitch ./boot.sh -export DPDK_BUILD=/usr/src/dpdk-1.6.0r2/x86_64-default-linuxapp-gcc +Without IVSHMEM + export DPDK_BUILD=/usr/src/dpdk-1.6.0r2/x86_64-default-linuxapp-gcc +With IVSHMEM: + export DPDK_BUILD=/usr/src/dpdk-1.6.0r2/x86_64-ivshmem-linuxapp-gcc ./configure --with-dpdk=$DPDK_BUILD make @@ -44,12 +51,18 @@ Using the DPDK with ovs-vswitchd: Setup system boot: kernel bootline, add: default_hugepagesz=1GB hugepagesz=1G hugepages=1 +To include 3 GB memory for VM (2 socket system, half on each NUMA node) + kernel bootline, add: default_hugepagesz=1GB hugepagesz=1G hugepages=8 First setup DPDK devices: - insert uio.ko e.g. modprobe uio - - insert igb_uio.ko + + - insert igb_uio.ko (non-IVSHMEM case) e.g. insmod DPDK/x86_64-default-linuxapp-gcc/kmod/igb_uio.ko + - insert igb_uio.ko (IVSHMEM case) +e.g. insmod DPDK/x86_64-ivshmem-linuxapp-gcc/kmod/igb_uio.ko + - Bind network device to ibg_uio. e.g. DPDK/tools/pci_unbind.py --bind=igb_uio eth1 Alternate binding method: @@ -73,7 +86,7 @@ First setup DPDK devices: Prepare system: - mount hugetlbfs -e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/ +e.g. mount -t hugetlbfs -o pagesize=1G none /dev/hugepages Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. @@ -91,7 +104,7 @@ Start ovsdb-server as discussed in INSTALL doc: ./ovsdb/ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ --remote=db:Open_vSwitch,Open_vSwitch,manager_options \ --private-key=db:Open_vSwitch,SSL,private_key \ - --certificate=dbitch,SSL,certificate \ + --certificate=Open_vSwitch,SSL,certificate \ --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach First time after db creation, initialize: cd $OVS_DIR @@ -105,12 +118,13 @@ for dpdk initialization. e.g. export DB_SOCK=/usr/local/var/run/openvswitch/db.sock - ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach + ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach -If allocated more than 1 GB huge pages, set amount and use NUMA node 0 memory: +If allocated more than one 1 GB hugepage (as for IVSHMEM), set amount and use NUMA +node 0 memory: ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \ - -- unix:$DB_SOCK --pidfile --detach + -- unix:$DB_SOCK --pidfile --detach To use ovs-vswitchd with DPDK, create a bridge with datapath_type "netdev" in the configuration database. For example: @@ -136,9 +150,7 @@ Test flow script across NICs (assuming ovs in /usr/src/ovs): # Script: #! /bin/sh - # Move to command directory - cd /usr/src/ovs/utilities/ # Clear current flows @@ -158,7 +170,8 @@ help. At this time all ovs-vswitchd tasks end up being affinitized to cpu core 0 but this may change. Lets pick a target core for 100% task to run on, i.e. core 7. -Also assume a dual 8 core sandy bridge system with hyperthreading enabled. +Also assume a dual 8 core sandy bridge system with hyperthreading enabled +where CPU1 has cores 0,...,7 and 16,...,23 & CPU2 cores 8,...,15 & 24,...,31. (A different cpu configuration will have different core mask requirements). To give better ownership of 100%, isolation maybe useful. @@ -178,11 +191,11 @@ taskset -p 080 1762 pid 1762's new affinity mask: 80 Assume that all other ovs-vswitchd threads to be on other socket 0 cores. -Affinitize the rest of the ovs-vswitchd thread ids to 0x0FF007F +Affinitize the rest of the ovs-vswitchd thread ids to 0x07F007F -taskset -p 0x0FF007F {thread pid, e.g 1738} +taskset -p 0x07F007F {thread pid, e.g 1738} pid 1738's current affinity mask: 1 - pid 1738's new affinity mask: ff007f + pid 1738's new affinity mask: 7f007f . . . The core 23 is left idle, which allows core 7 to run at full rate. @@ -207,8 +220,8 @@ with the ring naming used within ovs. l
Re: [ovs-dev] [Patch] Documentation for DPDK IVSHMEM VM Communications
The setup for packet transfer between the switch and VM by shared memory (IVSHMEM) is moderately complex and most details are not easily found. Also this is a different transfer method than user side vhost which copies between the separate memory spaces at a cost of slower packet rate or higher CPU core load(s). Shared memory transfer is much more efficient transfer method since it is only copying packet pointers and not packet data. However lacks security since the VM can see all the packet buffer memory space at all times. However efficiency vs security is something only the user of the system can determine since they know if the system is a closed environment or not. I put in enough details to allow someone, who has not been intimately involved in doing IVSHMEM packet processing work to setup and get the shared memory transfer working in a short time given today's current build state. Including the system packages to allow a proper build (qemu in particular) maybe over kill, but figuring out these required packages can be very time consuming. Unless it is working, you cannot experiment with or test the IVSHMEM shared memory operation or even move the method forward as an alternative setup, the correct information is just not readily available, weeks can easily be spent (only if you are very determined to get it to work). The INSTALL.DPDK also needs to be update for DPDK 1.7 ... Would you like to have this put in as separate doc, INSTALL.DPDK.IVSHMEM? Mike -Original Message- From: Pravin Shelar [mailto:pshe...@nicira.com] Sent: Friday, August 29, 2014 3:54 PM To: Polehn, Mike A Cc: d...@openvswitch.com Subject: Re: [ovs-dev] [Patch] Documentation for DPDK IVSHMEM VM Communications On Fri, Aug 15, 2014 at 7:07 AM, Polehn, Mike A wrote: > Adds documentation on how to run IVSHMEM communication through VM. > I think INSTALL.DPDK is getting rather large and hard to understand with all details. so I dropped "Alternative method to get QEMU, download and build from OVDK" section. We can add this documentation to separate file once vhost support is added. Thanks. ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] Patch [ 1/1] User space dpdk setup documentation addition
Added details to dpdk poll mode setup and use to make it easier for some not familiar to get it operating. Signed-off-by: Mike A. Polehn diff --git a/INSTALL.DPDK b/INSTALL.DPDK index 3e0247a..f55ae8b 100644 --- a/INSTALL.DPDK +++ b/INSTALL.DPDK @@ -17,7 +17,8 @@ Building and Installing: Recommended to use DPDK 1.6. DPDK: -cd DPDK +Set dir i.g.: export DPDK_DIR=/usr/src/dpdk-1.6.0r2 +cd $DPDK_DIR update config/defconfig_x86_64-default-linuxapp-gcc so that dpdk generate single lib file. CONFIG_RTE_BUILD_COMBINE_LIBS=y @@ -31,7 +32,8 @@ DPDK kernel requirement. OVS: cd $(OVS_DIR)/openvswitch ./boot.sh -./configure --with-dpdk=$(DPDK_BUILD) +export DPDK_BUILD=/usr/src/dpdk-1.6.0r2/x86_64-default-linuxapp-gcc +./configure --with-dpdk=$DPDK_BUILD make Refer to INSTALL.userspace for general requirements of building @@ -40,25 +42,77 @@ userspace OVS. Using the DPDK with ovs-vswitchd: - +Setup system boot: + kernel bootline, add: default_hugepagesz=1GB hugepagesz=1G hugepages=1 + First setup DPDK devices: - insert uio.ko +e.g. modprobe uio - insert igb_uio.ko e.g. insmod DPDK/x86_64-default-linuxapp-gcc/kmod/igb_uio.ko - - mount hugetlbfs -e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/ - Bind network device to ibg_uio. e.g. DPDK/tools/pci_unbind.py --bind=igb_uio eth1 +Alternate binding method: + Find target Ethernet devices + lspci -nn|grep Ethernet + Bring Down (e.g. eth2, eth3) + ifconfig eth2 down + ifconfig eth3 down + Look at current devices (e.g ixgbe devices) + ls /sys/bus/pci/drivers/ixgbe/ + :02:00.0 :02:00.1 bind module new_id remove_id uevent unbind + Unbind target pci devices from current driver (e.g. 02:00.0 ...) + echo :02:00.0 > /sys/bus/pci/drivers/ixgbe/unbind + echo :02:00.1 > /sys/bus/pci/drivers/ixgbe/unbind + Bind to target driver (e.g. igb_uio) + echo :02:00.0 > /sys/bus/pci/drivers/igb_uio/bind + echo :02:00.1 > /sys/bus/pci/drivers/igb_uio/bind + Check binding for listed devices + ls /sys/bus/pci/drivers/igb_uio + :02:00.0 :02:00.1 bind module new_id remove_id uevent unbind + +Prepare system: + - load ovs kernel module +e.g modprobe openvswitch + - mount hugetlbfs +e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/ Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. +Start vsdb-server as discussed in INSTALL doc: + Summary e.g.: +First time only db creation (or clearing): + mkdir -p /usr/local/etc/openvswitch + mkdir -p /usr/local/var/run/openvswitch + rm /usr/local/etc/openvswitch/conf.db + cd $OVS_DIR + ./ovsdb/ovsdb-tool create /usr/local/etc/openvswitch/conf.db \ +./vswitchd/vswitch.ovsschema +start vsdb-server + cd $OVS_DIR + ./ovsdb/ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ + --remote=db:OpenOpen_vSwitch,manager_options \ + --private-key=db:Open_vSwitch,SSL,private_key \ + --certificate=dbitch,SSL,certificate \ + --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach +First time after db creation, initialize: + cd $OVS_DIR + ./utilities/ovs-vsctl --no-wait init + Start vswitchd: DPDK configuration arguments can be passed to vswitchd via `--dpdk` -argument. dpdk arg -c is ignored by ovs-dpdk, but it is required parameter +argument. dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter for dpdk initialization. e.g. + export DB_SOCK=/usr/local/var/run/openvswitch/db.sock ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach +If allocated more than 1 GB huge pages, set amount and use NUMA node 0 memory: + + ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \ + -- unix:$DB_SOCK --pidfile --detach + To use ovs-vswitchd with DPDK, create a bridge with datapath_type "netdev" in the configuration database. For example: @@ -69,11 +123,72 @@ Now you can add dpdk devices. OVS expect DPDK device name start with dpdk and end with portid. vswitchd should print number of dpdk devices found. ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk +ovs-vsctl add-port br0 dpdki -- set Interface dpdk1 type=dpdk -Once first DPDK port is added vswitchd, it creates Polling thread and +Once first DPDK port is added to vswitchd, it creates a Polling thread and polls dpdk device in continuous loop. Therefore CPU utilization for that thread is always 100%. +Test flow script across NICs (assuming ovs in /usr/src/ovs): + Assume 1.1.1.1 on NIC port 1 (dpdk0) + Assume 1.1.1.2 on NIC port 2 (dpdk1) + Execute script: + +# Script: + +#! /bin/sh + +# Move to command directory + +cd /usr/src/ovs/utilities/ + +# Clear current flows +./ovs-ofctl del-flows br0
[ovs-dev] Patch [ 1/1] User space dpdk setup documentation addition, rev 1
Added details to dpdk poll mode setup and use to make it easier for some not familiar to get it operating. Signed-off-by: Mike A. Polehn diff --git a/INSTALL.DPDK b/INSTALL.DPDK index 3e0247a..689d95d 100644 --- a/INSTALL.DPDK +++ b/INSTALL.DPDK @@ -17,7 +17,8 @@ Building and Installing: Recommended to use DPDK 1.6. DPDK: -cd DPDK +Set dir i.g.: export DPDK_DIR=/usr/src/dpdk-1.6.0r2 +cd $DPDK_DIR update config/defconfig_x86_64-default-linuxapp-gcc so that dpdk generate single lib file. CONFIG_RTE_BUILD_COMBINE_LIBS=y @@ -31,7 +32,8 @@ DPDK kernel requirement. OVS: cd $(OVS_DIR)/openvswitch ./boot.sh -./configure --with-dpdk=$(DPDK_BUILD) +export DPDK_BUILD=/usr/src/dpdk-1.6.0r2/x86_64-default-linuxapp-gcc +./configure --with-dpdk=$DPDK_BUILD make Refer to INSTALL.userspace for general requirements of building @@ -40,25 +42,77 @@ userspace OVS. Using the DPDK with ovs-vswitchd: - +Setup system boot: + kernel bootline, add: default_hugepagesz=1GB hugepagesz=1G hugepages=1 + First setup DPDK devices: - insert uio.ko +e.g. modprobe uio - insert igb_uio.ko e.g. insmod DPDK/x86_64-default-linuxapp-gcc/kmod/igb_uio.ko - - mount hugetlbfs -e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/ - Bind network device to ibg_uio. e.g. DPDK/tools/pci_unbind.py --bind=igb_uio eth1 +Alternate binding method: + Find target Ethernet devices + lspci -nn|grep Ethernet + Bring Down (e.g. eth2, eth3) + ifconfig eth2 down + ifconfig eth3 down + Look at current devices (e.g ixgbe devices) + ls /sys/bus/pci/drivers/ixgbe/ + :02:00.0 :02:00.1 bind module new_id remove_id uevent unbind + Unbind target pci devices from current driver (e.g. 02:00.0 ...) + echo :02:00.0 > /sys/bus/pci/drivers/ixgbe/unbind + echo :02:00.1 > /sys/bus/pci/drivers/ixgbe/unbind + Bind to target driver (e.g. igb_uio) + echo :02:00.0 > /sys/bus/pci/drivers/igb_uio/bind + echo :02:00.1 > /sys/bus/pci/drivers/igb_uio/bind + Check binding for listed devices + ls /sys/bus/pci/drivers/igb_uio + :02:00.0 :02:00.1 bind module new_id remove_id uevent unbind + +Prepare system: + - load ovs kernel module +e.g modprobe openvswitch + - mount hugetlbfs +e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/ Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. +Start vsdb-server as discussed in INSTALL doc: + Summary e.g.: +First time only db creation (or clearing): + mkdir -p /usr/local/etc/openvswitch + mkdir -p /usr/local/var/run/openvswitch + rm /usr/local/etc/openvswitch/conf.db + cd $OVS_DIR + ./ovsdb/ovsdb-tool create /usr/local/etc/openvswitch/conf.db \ +./vswitchd/vswitch.ovsschema +start vsdb-server + cd $OVS_DIR + ./ovsdb/ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ + --remote=db:OpenOpen_vSwitch,manager_options \ + --private-key=db:Open_vSwitch,SSL,private_key \ + --certificate=dbitch,SSL,certificate \ + --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach +First time after db creation, initialize: + cd $OVS_DIR + ./utilities/ovs-vsctl --no-wait init + Start vswitchd: DPDK configuration arguments can be passed to vswitchd via `--dpdk` -argument. dpdk arg -c is ignored by ovs-dpdk, but it is required parameter +argument. dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter for dpdk initialization. e.g. + export DB_SOCK=/usr/local/var/run/openvswitch/db.sock ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach +If allocated more than 1 GB huge pages, set amount and use NUMA node 0 memory: + + ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \ + -- unix:$DB_SOCK --pidfile --detach + To use ovs-vswitchd with DPDK, create a bridge with datapath_type "netdev" in the configuration database. For example: @@ -69,11 +123,72 @@ Now you can add dpdk devices. OVS expect DPDK device name start with dpdk and end with portid. vswitchd should print number of dpdk devices found. ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk +ovs-vsctl add-port br0 dpdki -- set Interface dpdk1 type=dpdk -Once first DPDK port is added vswitchd, it creates Polling thread and +Once first DPDK port is added to vswitchd, it creates a Polling thread and polls dpdk device in continuous loop. Therefore CPU utilization for that thread is always 100%. +Test flow script across NICs (assuming ovs in /usr/src/ovs): + Assume 1.1.1.1 on NIC port 1 (dpdk0) + Assume 1.1.1.2 on NIC port 2 (dpdk1) + Execute script: + +# Script: + +#! /bin/sh + +# Move to command directory + +cd /usr/src/ovs/utilities/ + +# Clear current flows +./ovs-ofctl del-flows br0
Re: [ovs-dev] Patch [ 1/1] User space dpdk setup documentation addition
Yes, good catch! Mike Polehn -Original Message- From: John W. Linville [mailto:linvi...@tuxdriver.com] Sent: Monday, June 09, 2014 8:16 AM To: Polehn, Mike A Cc: dev@openvswitch.org Subject: Re: [ovs-dev] Patch [ 1/1] User space dpdk setup documentation addition On Mon, Jun 09, 2014 at 02:47:05PM +, Polehn, Mike A wrote: > @@ -69,11 +123,72 @@ Now you can add dpdk devices. OVS expect DPDK > device name start with dpdk and end with portid. vswitchd should print > number of dpdk devices found. > > ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk > +ovs-vsctl add-port br0 dpdki -- set Interface dpdk1 type=dpdk Drive-by comment -- is that supposed to be "dpdki"? Or should it be "dpdk1"? -- John W. LinvilleSomeday the world will need a hero, and you linvi...@tuxdriver.com might be all we have. Be ready. ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] Patch [ 1/1] User space dpdk setup documentation addition, rev 2
Added details to dpdk poll mode setup and use to make it easier for some not familiar to get it operating. Signed-off-by: Mike A. Polehn diff --git a/INSTALL.DPDK b/INSTALL.DPDK index 3e0247a..df497fb 100644 --- a/INSTALL.DPDK +++ b/INSTALL.DPDK @@ -17,7 +17,8 @@ Building and Installing: Recommended to use DPDK 1.6. DPDK: -cd DPDK +Set dir i.g.: export DPDK_DIR=/usr/src/dpdk-1.6.0r2 +cd $DPDK_DIR update config/defconfig_x86_64-default-linuxapp-gcc so that dpdk generate single lib file. CONFIG_RTE_BUILD_COMBINE_LIBS=y @@ -31,7 +32,8 @@ DPDK kernel requirement. OVS: cd $(OVS_DIR)/openvswitch ./boot.sh -./configure --with-dpdk=$(DPDK_BUILD) +export DPDK_BUILD=/usr/src/dpdk-1.6.0r2/x86_64-default-linuxapp-gcc +./configure --with-dpdk=$DPDK_BUILD make Refer to INSTALL.userspace for general requirements of building @@ -40,25 +42,77 @@ userspace OVS. Using the DPDK with ovs-vswitchd: - +Setup system boot: + kernel bootline, add: default_hugepagesz=1GB hugepagesz=1G hugepages=1 + First setup DPDK devices: - insert uio.ko +e.g. modprobe uio - insert igb_uio.ko e.g. insmod DPDK/x86_64-default-linuxapp-gcc/kmod/igb_uio.ko - - mount hugetlbfs -e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/ - Bind network device to ibg_uio. e.g. DPDK/tools/pci_unbind.py --bind=igb_uio eth1 +Alternate binding method: + Find target Ethernet devices + lspci -nn|grep Ethernet + Bring Down (e.g. eth2, eth3) + ifconfig eth2 down + ifconfig eth3 down + Look at current devices (e.g ixgbe devices) + ls /sys/bus/pci/drivers/ixgbe/ + :02:00.0 :02:00.1 bind module new_id remove_id uevent unbind + Unbind target pci devices from current driver (e.g. 02:00.0 ...) + echo :02:00.0 > /sys/bus/pci/drivers/ixgbe/unbind + echo :02:00.1 > /sys/bus/pci/drivers/ixgbe/unbind + Bind to target driver (e.g. igb_uio) + echo :02:00.0 > /sys/bus/pci/drivers/igb_uio/bind + echo :02:00.1 > /sys/bus/pci/drivers/igb_uio/bind + Check binding for listed devices + ls /sys/bus/pci/drivers/igb_uio + :02:00.0 :02:00.1 bind module new_id remove_id uevent unbind + +Prepare system: + - load ovs kernel module +e.g modprobe openvswitch + - mount hugetlbfs +e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/ Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. +Start vsdb-server as discussed in INSTALL doc: + Summary e.g.: +First time only db creation (or clearing): + mkdir -p /usr/local/etc/openvswitch + mkdir -p /usr/local/var/run/openvswitch + rm /usr/local/etc/openvswitch/conf.db + cd $OVS_DIR + ./ovsdb/ovsdb-tool create /usr/local/etc/openvswitch/conf.db \ +./vswitchd/vswitch.ovsschema +start vsdb-server + cd $OVS_DIR + ./ovsdb/ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \ + --remote=db:OpenOpen_vSwitch,manager_options \ + --private-key=db:Open_vSwitch,SSL,private_key \ + --certificate=dbitch,SSL,certificate \ + --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach +First time after db creation, initialize: + cd $OVS_DIR + ./utilities/ovs-vsctl --no-wait init + Start vswitchd: DPDK configuration arguments can be passed to vswitchd via `--dpdk` -argument. dpdk arg -c is ignored by ovs-dpdk, but it is required parameter +argument. dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter for dpdk initialization. e.g. + export DB_SOCK=/usr/local/var/run/openvswitch/db.sock ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach +If allocated more than 1 GB huge pages, set amount and use NUMA node 0 memory: + + ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \ + -- unix:$DB_SOCK --pidfile --detach + To use ovs-vswitchd with DPDK, create a bridge with datapath_type "netdev" in the configuration database. For example: @@ -69,11 +123,72 @@ Now you can add dpdk devices. OVS expect DPDK device name start with dpdk and end with portid. vswitchd should print number of dpdk devices found. ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk +ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk -Once first DPDK port is added vswitchd, it creates Polling thread and +Once first DPDK port is added to vswitchd, it creates a Polling thread and polls dpdk device in continuous loop. Therefore CPU utilization for that thread is always 100%. +Test flow script across NICs (assuming ovs in /usr/src/ovs): + Assume 1.1.1.1 on NIC port 1 (dpdk0) + Assume 1.1.1.2 on NIC port 2 (dpdk1) + Execute script: + +# Script: + +#! /bin/sh + +# Move to command directory + +cd /usr/src/ovs/utilities/ + +# Clear current flows +./ovs-ofctl del-flows br0
Re: [ovs-dev] [PATCH v2] dpif-netdev: Polling threads directly call ofproto upcall functions.
A good reason to offload an ofproto upcall function in polling mode is to allow a different CPU to do a time consuming inexact rule matches while the polling thread maintains fast packet switching. At low data and packet rates or low rate Ethernet interface (1 GbE and lower) this does not matter, however when higher packet rates are achieved this is going to be critical since the input queue will get easily overrun at 10 GbE rates with moderate delays, especially with smaller packet sizes. At this time, for polling mode DPDK, all threads have default affinitization to only one core. So ofproto upcalls are being run on the same core, so a change to call ofproto upcall functions will not show little or no performance difference, and may even show as a packet rate gain since no Linux process scheduling overhead will be present. Currently, changing to affiliation to allow polling thread to execute on different cpu cores than the other ovs-vswitchd results in occasional polling halts/hangs which gives very unpredictable zero loss performance, resulting in poorer zero loss operation then with all ovs-vswitchd threads affinitized to one core. However this is SMP synchronization issue(s) that hopefully will eventually get solved. Calling some ofproto upcall functions directly has potential benefits. It is very desirable to setup exact match flow entries while sequentially processing packets if this can be done in just several microseconds at most and the NIC RX queue has room to absorb and average this time across multiple loop and with the loop processing packets faster than average input rate. This would require vary optimized inexact rule lookup code. A trip to an open flow controller would still need to be offloaded to a non-realtime thread. Directly calling ofproto upcall functions, before inexact rule lookup code is highly optimized for lookup speed when having large number of rules, would make it more difficult to get the DPDK packet processing rate up and also to test and verify fast packet processing rates. Mike Polehn -Original Message- From: dev [mailto:dev-boun...@openvswitch.org] On Behalf Of Ryan Wilson Sent: Monday, June 16, 2014 11:45 PM To: dev@openvswitch.org Subject: [ovs-dev] [PATCH v2] dpif-netdev: Polling threads directly call ofproto upcall functions. Typically, kernel datapath threads send upcalls to userspace where handler threads process the upcalls. For TAP and DPDK devices, the datapath threads operate in userspace, so there is no need for separate handler threads. This patch allows userspace datapath threads to directly call the ofproto upcall functions, eliminating the need for handler threads for datapaths of type 'netdev'. Signed-off-by: Ryan Wilson --- v2: Fix race condition found during perf test --- lib/dpif-netdev.c | 327 +++-- lib/dpif-netdev.h |7 + lib/dpif.c| 68 ++--- lib/dpif.h|1 + ofproto/ofproto-dpif-upcall.c | 227 +--- ofproto/ofproto-dpif-upcall.h |4 + 6 files changed, 258 insertions(+), 376 deletions(-) diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 6c281fe..626b3e6 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -15,8 +15,6 @@ */ #include -#include "dpif.h" - #include #include #include @@ -35,6 +33,7 @@ #include "cmap.h" #include "csum.h" #include "dpif.h" +#include "dpif-netdev.h" #include "dpif-provider.h" #include "dummy.h" #include "dynamic-string.h" @@ -76,10 +75,7 @@ DEFINE_STATIC_PER_THREAD_DATA(uint32_t, recirc_depth, 0) /* Configuration parameters. */ enum { MAX_FLOWS = 65536 }; /* Maximum number of flows in flow table. */ -/* Queues. */ -enum { MAX_QUEUE_LEN = 128 }; /* Maximum number of packets per queue. */ -enum { QUEUE_MASK = MAX_QUEUE_LEN - 1 }; -BUILD_ASSERT_DECL(IS_POW2(MAX_QUEUE_LEN)); +static exec_upcall_func *exec_upcall_cb = NULL; /* Protects against changes to 'dp_netdevs'. */ static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER; @@ -88,27 +84,6 @@ static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER; static struct shash dp_netdevs OVS_GUARDED_BY(dp_netdev_mutex) = SHASH_INITIALIZER(&dp_netdevs); -struct dp_netdev_upcall { -struct dpif_upcall upcall; /* Queued upcall information. */ -struct ofpbuf buf; /* ofpbuf instance for upcall.packet. */ -}; - -/* A queue passing packets from a struct dp_netdev to its clients (handlers). - * - * - * Thread-safety - * = - * - * Any access at all requires the owning 'dp_netdev''s queue_rwlock and - * its own mutex. */ -struct dp_netdev_queue { -struct ovs_mutex mutex; -struct seq *seq; /* Incremented whenever a packet is queued. */ -struct dp_netdev_upcall upcalls
[ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size
Large TX and RX queues are needed for high speed 10 GbE physical NICS. Signed-off-by: Mike A. Polehn diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index fbdb6b3..d1bcc73 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -70,6 +70,9 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); #define NON_PMD_THREAD_TX_QUEUE 0 +#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue (n*32<4096)*/ +#define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue (n*32<4096)*/ + /* TODO: Needs per NIC value for these constants. */ #define RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ #define RX_HTHRESH 32 /* Default values of RX host threshold reg. */ @@ -369,7 +372,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) OVS_REQUIRES(dpdk_mutex) } for (i = 0; i < NR_QUEUE; i++) { -diag = rte_eth_tx_queue_setup(dev->port_id, i, MAX_TX_QUEUE_LEN, +diag = rte_eth_tx_queue_setup(dev->port_id, i, NIC_PORT_TX_Q_SIZE, dev->socket_id, &tx_conf); if (diag) { VLOG_ERR("eth dev tx queue setup error %d",diag); @@ -378,7 +381,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) OVS_REQUIRES(dpdk_mutex) } for (i = 0; i < NR_QUEUE; i++) { -diag = rte_eth_rx_queue_setup(dev->port_id, i, MAX_RX_QUEUE_LEN, +diag = rte_eth_rx_queue_setup(dev->port_id, i, NIC_PORT_RX_Q_SIZE, dev->socket_id, &rx_conf, dev->dpdk_mp->mp); if (diag) { ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size
I coming from an earlier version that had the arguments first setup was as a number, then used in several places including the tx cache size and didn't catch that new 3rd definition were used as I moved the patch forward to try on the latest git updates before sending. There is also a queue sizing formula in the comment that is not obvious. Mike Polehn -Original Message- From: Ethan Jackson [mailto:et...@nicira.com] Sent: Thursday, June 19, 2014 10:21 AM To: Polehn, Mike A Cc: dev@openvswitch.org Subject: Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size One question: why not just increase MAX_RX_QUEUE_LEN and MAX_TX_QUEUE_LEN instead of creating new #defines? Just a thought. I'd like Pravin to review this as I don't know this code as well as him. Ethan On Thu, Jun 19, 2014 at 9:59 AM, Polehn, Mike A wrote: > Large TX and RX queues are needed for high speed 10 GbE physical NICS. > > Signed-off-by: Mike A. Polehn > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index > fbdb6b3..d1bcc73 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -70,6 +70,9 @@ static struct vlog_rate_limit rl = > VLOG_RATE_LIMIT_INIT(5, 20); > > #define NON_PMD_THREAD_TX_QUEUE 0 > > +#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue > +(n*32<4096)*/ #define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical > +NIC TX Queue (n*32<4096)*/ > + > /* TODO: Needs per NIC value for these constants. */ #define > RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ > #define RX_HTHRESH 32 /* Default values of RX host threshold reg. */ > @@ -369,7 +372,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > OVS_REQUIRES(dpdk_mutex) > } > > for (i = 0; i < NR_QUEUE; i++) { > -diag = rte_eth_tx_queue_setup(dev->port_id, i, MAX_TX_QUEUE_LEN, > +diag = rte_eth_tx_queue_setup(dev->port_id, i, > + NIC_PORT_TX_Q_SIZE, >dev->socket_id, &tx_conf); > if (diag) { > VLOG_ERR("eth dev tx queue setup error %d",diag); @@ > -378,7 +381,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > OVS_REQUIRES(dpdk_mutex) > } > > for (i = 0; i < NR_QUEUE; i++) { > -diag = rte_eth_rx_queue_setup(dev->port_id, i, MAX_RX_QUEUE_LEN, > +diag = rte_eth_rx_queue_setup(dev->port_id, i, > + NIC_PORT_RX_Q_SIZE, >dev->socket_id, >&rx_conf, dev->dpdk_mp->mp); > if (diag) { > ___ > dev mailing list > dev@openvswitch.org > http://openvswitch.org/mailman/listinfo/dev ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size
There is an improvement in 2544 zero loss measurements, but it takes another patch to actually be able to get a reasonable measurement with standard test equipment. Should I redo it with the new enum change. I am not sure of using an enum for a single constant. Mike Polehn -Original Message- From: Ethan Jackson [mailto:et...@nicira.com] Sent: Thursday, June 19, 2014 2:54 PM To: Polehn, Mike A Cc: dev@openvswitch.org Subject: Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size Also another question. Does this patch result in a measurable improvement in any benchmarks? If so, would you please note it in the commit message? If not, I'm not sure we should merge this yet. Ethan On Thu, Jun 19, 2014 at 2:45 PM, Polehn, Mike A wrote: > I coming from an earlier version that had the arguments first setup > was as a number, then used in several places including the tx cache > size and didn't catch that new 3rd definition were used as I moved the patch > forward to try on the latest git updates before sending. > > There is also a queue sizing formula in the comment that is not obvious. > > Mike Polehn > > -Original Message- > From: Ethan Jackson [mailto:et...@nicira.com] > Sent: Thursday, June 19, 2014 10:21 AM > To: Polehn, Mike A > Cc: dev@openvswitch.org > Subject: Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue > size > > One question: why not just increase MAX_RX_QUEUE_LEN and MAX_TX_QUEUE_LEN > instead of creating new #defines? > > Just a thought. I'd like Pravin to review this as I don't know this code as > well as him. > > Ethan > > On Thu, Jun 19, 2014 at 9:59 AM, Polehn, Mike A > wrote: >> Large TX and RX queues are needed for high speed 10 GbE physical NICS. >> >> Signed-off-by: Mike A. Polehn >> >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index >> fbdb6b3..d1bcc73 100644 >> --- a/lib/netdev-dpdk.c >> +++ b/lib/netdev-dpdk.c >> @@ -70,6 +70,9 @@ static struct vlog_rate_limit rl = >> VLOG_RATE_LIMIT_INIT(5, 20); >> >> #define NON_PMD_THREAD_TX_QUEUE 0 >> >> +#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue >> +(n*32<4096)*/ #define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical >> +NIC TX Queue (n*32<4096)*/ >> + >> /* TODO: Needs per NIC value for these constants. */ #define >> RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ >> #define RX_HTHRESH 32 /* Default values of RX host threshold reg. */ >> @@ -369,7 +372,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) >> OVS_REQUIRES(dpdk_mutex) >> } >> >> for (i = 0; i < NR_QUEUE; i++) { >> -diag = rte_eth_tx_queue_setup(dev->port_id, i, MAX_TX_QUEUE_LEN, >> +diag = rte_eth_tx_queue_setup(dev->port_id, i, >> + NIC_PORT_TX_Q_SIZE, >>dev->socket_id, &tx_conf); >> if (diag) { >> VLOG_ERR("eth dev tx queue setup error %d",diag); @@ >> -378,7 +381,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) >> OVS_REQUIRES(dpdk_mutex) >> } >> >> for (i = 0; i < NR_QUEUE; i++) { >> -diag = rte_eth_rx_queue_setup(dev->port_id, i, MAX_RX_QUEUE_LEN, >> +diag = rte_eth_rx_queue_setup(dev->port_id, i, >> + NIC_PORT_RX_Q_SIZE, >>dev->socket_id, >>&rx_conf, dev->dpdk_mp->mp); >> if (diag) { >> ___ >> dev mailing list >> dev@openvswitch.org >> http://openvswitch.org/mailman/listinfo/dev ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size
Checked out the code that was modified by the patch and found that both MAX_RX_QUEUE_LEN and MAX_TX_QUEUE_LEN definitions are dually used for different meaning. Also the name implies something different then a set NIC queue size. Resubmitting patch with zero loss gain in comment following this. -Original Message- From: dev [mailto:dev-boun...@openvswitch.org] On Behalf Of Polehn, Mike A Sent: Thursday, June 19, 2014 2:45 PM To: Ethan Jackson Cc: dev@openvswitch.org Subject: Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size I coming from an earlier version that had the arguments first setup was as a number, then used in several places including the tx cache size and didn't catch that new 3rd definition were used as I moved the patch forward to try on the latest git updates before sending. There is also a queue sizing formula in the comment that is not obvious. Mike Polehn -Original Message- From: Ethan Jackson [mailto:et...@nicira.com] Sent: Thursday, June 19, 2014 10:21 AM To: Polehn, Mike A Cc: dev@openvswitch.org Subject: Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size One question: why not just increase MAX_RX_QUEUE_LEN and MAX_TX_QUEUE_LEN instead of creating new #defines? Just a thought. I'd like Pravin to review this as I don't know this code as well as him. Ethan On Thu, Jun 19, 2014 at 9:59 AM, Polehn, Mike A wrote: > Large TX and RX queues are needed for high speed 10 GbE physical NICS. > > Signed-off-by: Mike A. Polehn > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index > fbdb6b3..d1bcc73 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -70,6 +70,9 @@ static struct vlog_rate_limit rl = > VLOG_RATE_LIMIT_INIT(5, 20); > > #define NON_PMD_THREAD_TX_QUEUE 0 > > +#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue > +(n*32<4096)*/ #define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical > +NIC TX Queue (n*32<4096)*/ > + > /* TODO: Needs per NIC value for these constants. */ #define > RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ > #define RX_HTHRESH 32 /* Default values of RX host threshold reg. */ > @@ -369,7 +372,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > OVS_REQUIRES(dpdk_mutex) > } > > for (i = 0; i < NR_QUEUE; i++) { > -diag = rte_eth_tx_queue_setup(dev->port_id, i, MAX_TX_QUEUE_LEN, > +diag = rte_eth_tx_queue_setup(dev->port_id, i, > + NIC_PORT_TX_Q_SIZE, >dev->socket_id, &tx_conf); > if (diag) { > VLOG_ERR("eth dev tx queue setup error %d",diag); @@ > -378,7 +381,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > OVS_REQUIRES(dpdk_mutex) > } > > for (i = 0; i < NR_QUEUE; i++) { > -diag = rte_eth_rx_queue_setup(dev->port_id, i, MAX_RX_QUEUE_LEN, > +diag = rte_eth_rx_queue_setup(dev->port_id, i, > + NIC_PORT_RX_Q_SIZE, >dev->socket_id, >&rx_conf, dev->dpdk_mp->mp); > if (diag) { > ___ > dev mailing list > dev@openvswitch.org > http://openvswitch.org/mailman/listinfo/dev ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size, resubmit
Large TX and RX queues are needed for high speed 10 GbE physical NICS. Observed a 250% zero loss improvement over small NIC queue test for A port to port flow test. Signed-off-by: Mike A. Polehn diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index fbdb6b3..d1bcc73 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -70,6 +70,9 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); #define NON_PMD_THREAD_TX_QUEUE 0 +#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue (n*32<4096)*/ +#define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue (n*32<4096)*/ + /* TODO: Needs per NIC value for these constants. */ #define RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ #define RX_HTHRESH 32 /* Default values of RX host threshold reg. */ @@ -369,7 +372,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) OVS_REQUIRES(dpdk_mutex) } for (i = 0; i < NR_QUEUE; i++) { -diag = rte_eth_tx_queue_setup(dev->port_id, i, MAX_TX_QUEUE_LEN, +diag = rte_eth_tx_queue_setup(dev->port_id, i, NIC_PORT_TX_Q_SIZE, dev->socket_id, &tx_conf); if (diag) { VLOG_ERR("eth dev tx queue setup error %d",diag); @@ -378,7 +381,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) OVS_REQUIRES(dpdk_mutex) } for (i = 0; i < NR_QUEUE; i++) { -diag = rte_eth_rx_queue_setup(dev->port_id, i, MAX_RX_QUEUE_LEN, +diag = rte_eth_rx_queue_setup(dev->port_id, i, NIC_PORT_RX_Q_SIZE, dev->socket_id, &rx_conf, dev->dpdk_mp->mp); if (diag) { ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
Re: [ovs-dev] [Backport] Backport max-idle to branch-1.10 - branch-2.1.
The idle timeout of 1.5 seconds for exact match flows creates big problems for testing. I suspect that other than for internal generated flows within the system, that 1.5 seconds is not a reasonable timeout. I worry about thrashing of the flows in a congested setting where the load is so high that packets are discarded from not having enough time to setup flows and the discarded packets cause flows to be deleted since the packets that would maintain the flows are not seen. This type of an issue combined with other similar type issues can greatly reduce the overall network performance. One way to approach this is to get the network performance working well and then retune such parameters that could have an impact. Imagine someone using ssh terminal, there would be new flows constantly being created and destroyed for each small user hesitation. Imagine that a lot the communications through to the system happen to be through ssh terminals and the system was gateway for these communications to a lot of systems. The network performance for this case would probably degrade down to the flow setup performance. This is just one example of many that could exist, while there are of course scenarios that 1.5 seconds would be ideal. I think the setting of 1.5 seconds is due to inexperience and needs to be drastically changed. If flow timeout is specified on the OpenFlow command, the exact match flow timeout should use the OpenFlow set timeout and not an arbitrary value since there was probably a reason for setting the particular value. Currently I use a patch for testing that sets the idle timeout to 15 seconds and it solves a big issue of the OVS not being able to setup flows fast enough for high speed flows, since my equipment (which is industry standard network test equipment that I cannot modify) can be set to send small set of packets at a low packet rate before sending the high speed flows, but there is a moderate time gap of 11 seconds between these two. It is not obvious to me how to set the OVS idle from an exterior interface. Can an example, or documentation update to indicate exactly how to set this new idle timeout parameter. Hopefully this is a global setting and not flow specific setting. Mike Polehn -Original Message- From: dev [mailto:dev-boun...@openvswitch.org] On Behalf Of Alex Wang Sent: Thursday, June 19, 2014 9:20 PM To: dev@openvswitch.org Subject: [ovs-dev] [Backport] Backport max-idle to branch-1.10 - branch-2.1. This series backports the commit 72310b04 (upcall: Configure datapath max-idle through ovs-vsctl.) to branch-1.10 - branch-2.1, for testing purpose. -- 1.7.9.5 ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
Re: [ovs-dev] [Backport] Backport max-idle to branch-1.10 - branch-2.1.
I had done an internal patch on OVS 2.0 code, does not seem like years ago, but the default timeout was 5 seconds for flow counts less than flow_eviction_threshold. The histogram as written had some algorithm issues, so had the potential to thrash the system for excessive flow removal counts per loop that exceeded flow_eviction_threshold. It might have been this discard of 99% (total / 100) of the flows (in reality all flows over flow_eviction_threshold) each revalidator loop could have been causing the thrashing problem that was observed with the earlier OVS versions. Maybe making a more efficient revalidator might be another way to help this issue. Mike diff --git a/openvswitch/ofproto/ofproto-dpif.c b/openvswitch/ofproto/ofproto-dpif.c index 92f3262..3dd297f 100644 --- a/openvswitch/ofproto/ofproto-dpif.c +++ b/openvswitch/ofproto/ofproto-dpif.c @@ -76,6 +76,19 @@ COVERAGE_DEFINE(subfacet_install_fail); COVERAGE_DEFINE(packet_in_overflow); COVERAGE_DEFINE(flow_mod_overflow); +/* Flow IDLE Timeout definitions */ + +/* Millseconds to timeout flow, original 5000 */ +#define IDLE_FLOW_TIMEOUT 3 +/* Millseconds to timeout flow minimum, original 100 */ +#define IDLE_FLOW_TIMEOUT_MIN 5000 +/* Millseconds to timeout special flow, original 1 */ +#define IDLE_FLOW_TIMEOUT_SPECIAL 4 +/* Idle histogram bucket width (to keep same number of buckets), original 100 */ +#define IDLE_HIST_TIME_WIDTH 500 +/* Idle upper amount to keep each discard cycle, original 0.01 */ +#define IDLE_EVICTION_KEEP_RATE0.9 + /* Number of implemented OpenFlow tables. */ enum { N_TABLES = 255 }; enum { TBL_INTERNAL = N_TABLES - 1 };/* Used for internal hidden rules. */ @@ -3786,17 +3799,17 @@ subfacet_max_idle(const struct dpif_backer *backer) * pass made by update_stats(), because the former function never looks at * uninstallable subfacets. */ -enum { BUCKET_WIDTH = ROUND_UP(100, TIME_UPDATE_INTERVAL) }; -enum { N_BUCKETS = 5000 / BUCKET_WIDTH }; +enum { BUCKET_WIDTH = ROUND_UP(IDLE_HIST_TIME_WIDTH, TIME_UPDATE_INTERVAL) }; +enum { N_BUCKETS = IDLE_FLOW_TIMEOUT / BUCKET_WIDTH }; int buckets[N_BUCKETS] = { 0 }; -int total, subtotal, bucket; +int total, subtotal, bucket, keep, idle_timeout; struct subfacet *subfacet; long long int now; int i; total = hmap_count(&backer->subfacets); if (total <= flow_eviction_threshold) { -return N_BUCKETS * BUCKET_WIDTH; +return IDLE_FLOW_TIMEOUT; } /* Build histogram. */ @@ -3810,11 +3823,13 @@ subfacet_max_idle(const struct dpif_backer *backer) } /* Find the first bucket whose flows should be expired. */ -subtotal = bucket = 0; +keep = MAX(flow_eviction_threshold, (int)(total * IDLE_EVICTION_KEEP_RATE)); +subtotal = bucket = idle_timeout = 0; do { subtotal += buckets[bucket++]; + idle_timeout += BUCKET_WIDTH; } while (bucket < N_BUCKETS && - subtotal < MAX(flow_eviction_threshold, total / 100)); + (subtotal < keep || idle_timeout < IDLE_FLOW_TIMEOUT_MIN)); if (VLOG_IS_DBG_ENABLED()) { struct ds s; @@ -3833,7 +3848,7 @@ subfacet_max_idle(const struct dpif_backer *backer) ds_destroy(&s); } -return bucket * BUCKET_WIDTH; +return idle_timeout; } static void @@ -3844,7 +3859,7 @@ expire_subfacets(struct dpif_backer *backer, int dp_max_idle) /* We really want to keep flows for special protocols around, so use a more * conservative cutoff. */ -long long int special_cutoff = time_msec() - 1; +long long int special_cutoff = time_msec() - IDLE_FLOW_TIMEOUT_SPECIAL; struct subfacet *subfacet, *next_subfacet; struct subfacet *batch[SUBFACET_DESTROY_MAX_BATCH]; -Original Message- From: Ethan Jackson [mailto:et...@nicira.com] Sent: Friday, June 20, 2014 11:11 AM To: Polehn, Mike A Cc: Alex Wang; dev@openvswitch.org Subject: Re: [ovs-dev] [Backport] Backport max-idle to branch-1.10 - branch-2.1. > I think the setting of 1.5 seconds is due to inexperience and needs to be > drastically changed. If flow timeout is specified on the OpenFlow command, > the exact match flow timeout should use the OpenFlow set timeout and not an > arbitrary value since there was probably a reason for setting the particular > value. The 1.5 second number does not come from inexperience, in fact exactly the opposite. Over years of running Open vSwitch in multiple production deployments, we've found that a key factor in maintaining reasonable performance is management of the datapath flow cache. If the idle timeout is too large, then the datapath fills up with unused flows which stress the revalidators and take up space that newer more useful flows could occupy. I can see that when doing performance testing a larger nu
[ovs-dev] [PATCH 1/1] PMD dpdk TX output SMP dpdk queue use and packet surge assoprtion.
Put in a DPDK queue to receive from multiple core SMP input from vSwitch for NIC TX output. Eliminated the inside polling loop SMP TX output lock (DPDK queue handles SMP). Added a SMP lock for non-polling operation to allow TX output by the non-polling thread when interface not being polled. Lock accessed only when polling is not enabled. Added new netdev subroutine to control polling lock and enable and disable flag. Packets do not get discarded between TX pre-queue and NIC queue to handle surges. Measured improved average PMD port to port 2544 zero loss packet rate of 268,000 for packets 512 bytes and smaller. Predict double that when using 1 cpu core/interface. Observed better persistence of obtaining 100% 10 GbE for larger packets with the added DPDK queue, consistent with other tests outside of OVS where large surges from fast path interfaces transferring larger sized packets from VMs were being absorbed in the NIC TX pre-queue and TX queue and packet loss was suppressed. Signed-off-by: Mike A. Polehn diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 6c281fe..478a0d9 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -1873,6 +1873,10 @@ reload: poll_cnt = pmd_load_queues(f, &poll_list, poll_cnt); atomic_read(&f->change_seq, &port_seq); +/* get poll ownership */ +for (i = 0; i < poll_cnt; i++) + netdev_rxq_do_polling(poll_list[i].rx, true); + for (;;) { unsigned int c_port_seq; int i; @@ -1895,6 +1899,10 @@ reload: } } +/* release poll ownership */ +for (i = 0; i < poll_cnt; i++) + netdev_rxq_do_polling(poll_list[i].rx, false); + if (!latch_is_set(&f->dp->exit_latch)){ goto reload; } diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 35a8da4..0f61777 100644 --- a/lib/netdev-bsd.c +++ b/lib/netdev-bsd.c @@ -1596,6 +1596,7 @@ netdev_bsd_update_flags(struct netdev *netdev_, enum netdev_flags off, netdev_bsd_rxq_recv, \ netdev_bsd_rxq_wait, \ netdev_bsd_rxq_drain,\ +NULL, /* rxq_do_polling */ \ } const struct netdev_class netdev_bsd_class = diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index d1bcc73..78f0329 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -73,6 +73,9 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); #define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue */ #define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue */ +#define NIC_TX_PRE_Q_SIZE 4096 /* Size of Physical NIC TX Pre Queue (2**n)*/ +#define NIC_TX_PRE_Q_TRANS 64 /* Pre Queue to Physical NIC Transfer */ + /* TODO: Needs per NIC value for these constants. */ #define RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ #define RX_HTHRESH 32 /* Default values of RX host threshold reg. */ @@ -122,8 +125,6 @@ static const struct rte_eth_txconf tx_conf = { }; enum { MAX_RX_QUEUE_LEN = 64 }; -enum { MAX_TX_QUEUE_LEN = 64 }; -enum { DRAIN_TSC = 20ULL }; static int rte_eal_init_ret = ENODEV; @@ -145,10 +146,12 @@ struct dpdk_mp { }; struct dpdk_tx_queue { -rte_spinlock_t tx_lock; +bool is_polled; +int port_id; int count; -uint64_t tsc; -struct rte_mbuf *burst_pkts[MAX_TX_QUEUE_LEN]; +struct rte_mbuf *tx_trans[NIC_TX_PRE_Q_TRANS]; +struct rte_ring *tx_preq; +rte_spinlock_t tx_lock; }; struct netdev_dpdk { @@ -360,6 +363,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) OVS_REQUIRES(dpdk_mutex) struct ether_addr eth_addr; int diag; int i; +char qname[32]; if (dev->port_id < 0 || dev->port_id >= rte_eth_dev_count()) { return -ENODEV; @@ -372,12 +376,21 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) OVS_REQUIRES(dpdk_mutex) } for (i = 0; i < NR_QUEUE; i++) { +dev->tx_q[i].port_id = dev->port_id; diag = rte_eth_tx_queue_setup(dev->port_id, i, NIC_PORT_TX_Q_SIZE, dev->socket_id, &tx_conf); if (diag) { VLOG_ERR("eth dev tx queue setup error %d",diag); return diag; } + +snprintf(qname, sizeof(qname),"NIC_TX_Pre_Q_%u_%u", dev->port_id, i); +dev->tx_q[i].tx_preq = rte_ring_create(qname, NIC_TX_PRE_Q_SIZE, + dev->socket_id, RING_F_SC_DEQ); +if (NULL == dev->tx_q[i].tx_preq) { +VLOG_ERR("eth dev tx pre-queue alloc error"); +return -ENOMEM; +} } for (i = 0; i < NR_QUEUE; i++) { @@ -451,6 +464,7 @@ netdev_dpdk_construct(struct netdev *netdev_) port_no = strtol(cport, 0, 0); /* string must be null terminated */ for (i = 0; i < NR_QUEUE; i++) { +netdev
Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size, resubmit
(n*32<4096) is the formula for calculating the buffer size. The maximum size is not 4096 but 4096 - 32 = 4064. I ran into this issue on a different project where I needed to use the largest buffers size available. Often 2**n is used for queue sizing, but is not the case for the PMD driver. Mike Polehn -Original Message- From: Pravin Shelar [mailto:pshe...@nicira.com] Sent: Wednesday, June 25, 2014 4:19 PM To: Polehn, Mike A Cc: dev@openvswitch.org Subject: Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size, resubmit On Thu, Jun 19, 2014 at 3:58 PM, Polehn, Mike A wrote: > Large TX and RX queues are needed for high speed 10 GbE physical NICS. > Observed a 250% zero loss improvement over small NIC queue test for A > port to port flow test. > > Signed-off-by: Mike A. Polehn I am fine with the > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index > fbdb6b3..d1bcc73 100644 > --- a/lib/netdev-dpdk.c > +++ b/lib/netdev-dpdk.c > @@ -70,6 +70,9 @@ static struct vlog_rate_limit rl = > VLOG_RATE_LIMIT_INIT(5, 20); > > #define NON_PMD_THREAD_TX_QUEUE 0 > > +#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue > +(n*32<4096)*/ I am not sure what does "(n*32<4096)" means. Can you elaborate it bit? > +#define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue > +(n*32<4096)*/ > + > /* TODO: Needs per NIC value for these constants. */ #define > RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ > #define RX_HTHRESH 32 /* Default values of RX host threshold reg. */ > @@ -369,7 +372,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > OVS_REQUIRES(dpdk_mutex) > } > > for (i = 0; i < NR_QUEUE; i++) { > -diag = rte_eth_tx_queue_setup(dev->port_id, i, MAX_TX_QUEUE_LEN, > +diag = rte_eth_tx_queue_setup(dev->port_id, i, > + NIC_PORT_TX_Q_SIZE, >dev->socket_id, &tx_conf); > if (diag) { > VLOG_ERR("eth dev tx queue setup error %d",diag); @@ > -378,7 +381,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) > OVS_REQUIRES(dpdk_mutex) > } > > for (i = 0; i < NR_QUEUE; i++) { > -diag = rte_eth_rx_queue_setup(dev->port_id, i, MAX_RX_QUEUE_LEN, > +diag = rte_eth_rx_queue_setup(dev->port_id, i, > + NIC_PORT_RX_Q_SIZE, >dev->socket_id, >&rx_conf, dev->dpdk_mp->mp); > if (diag) { > > > ___ > dev mailing list > dev@openvswitch.org > http://openvswitch.org/mailman/listinfo/dev ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
Re: [ovs-dev] [PATCH 1/1] PMD dpdk TX output SMP dpdk queue use and packet surge assoprtion.
I'll rebase, study and correct to OVS coding style and repost. There is a very good reason for putting constants on the left hand side of a compare statement. For example: if (NULL = x) will be a compiler error, while the following will compile and need debugging: if (x = NULL) Although I try not making the comparison mistakes, I have recently made that exact mistake and had to debug. If I had used the second format, the complier would have output an error and saved the time of debugging. Mike Polehn -Original Message- From: Ben Pfaff [mailto:b...@nicira.com] Sent: Wednesday, June 25, 2014 1:48 PM To: Polehn, Mike A Cc: dev@openvswitch.org Subject: Re: [ovs-dev] [PATCH 1/1] PMD dpdk TX output SMP dpdk queue use and packet surge assoprtion. On Fri, Jun 20, 2014 at 10:24:33PM +, Polehn, Mike A wrote: > Put in a DPDK queue to receive from multiple core SMP input from vSwitch for > NIC TX output. > Eliminated the inside polling loop SMP TX output lock (DPDK queue handles > SMP). > Added a SMP lock for non-polling operation to allow TX output by the > non-polling thread > when interface not being polled. Lock accessed only when polling is not > enabled. > Added new netdev subroutine to control polling lock and enable and disable > flag. > Packets do not get discarded between TX pre-queue and NIC queue to handle > surges. > > Measured improved average PMD port to port 2544 zero loss packet rate > of 268,000 for packets 512 bytes and smaller. Predict double that when using > 1 cpu core/interface. > > Observed better persistence of obtaining 100% 10 GbE for larger > packets with the added DPDK queue, consistent with other tests outside > of OVS where large surges from fast path interfaces transferring > larger sized packets from VMs were being absorbed in the NIC TX pre-queue and > TX queue and packet loss was suppressed. > > Signed-off-by: Mike A. Polehn This doesn't apply to the current tree. You'll need to rebase and repost it. I have some stylistic comments. Most of the following are cut-and-paste from CONTRIBUTING or CodingStyle (please read both). Many of them apply in multiple places, but I only pointed them out once. Please limit lines in the commit message to 79 characters in width. Comments should be written as full sentences that start with a capital letter and end with a period: > +/* get poll ownership */ Enclose single statements in braces: if (a > b) { return a; } else { return b; } > +for (i = 0; i < poll_cnt; i++) > + netdev_rxq_do_polling(poll_list[i].rx, true); > + > for (;;) { > unsigned int c_port_seq; > int i; When using a relational operator like "<" or "==", put an expression or variable argument on the left and a constant argument on the right, e.g. "x == 0", *not* "0 == x": > +if (NULL == dev->tx_q[i].tx_preq) { > +VLOG_ERR("eth dev tx pre-queue alloc error"); > +return -ENOMEM; > +} > } We don't generally put "inline" on functions in C files, since it suppresses otherwise useful "function not used" warnings and doesn't usually help code generation: > inline static void > -dpdk_queue_flush(struct netdev_dpdk *dev, int qid) > +dpdk_port_out(struct dpdk_tx_queue *tx_q, int qid) > { > -struct dpdk_tx_queue *txq = &dev->tx_q[qid]; > -uint32_t nb_tx; > +/* get packets from NIC tx staging queue */ > +if (likely(tx_q->count == 0)) > +tx_q->count = rte_ring_sc_dequeue_burst(tx_q->tx_preq, > +(void **)&tx_q->tx_trans[0], NIC_TX_PRE_Q_TRANS); > + > +/* send packets to NIC tx queue */ > +if (likely(tx_q->count != 0)) { > +unsigned sent = rte_eth_tx_burst(tx_q->port_id, qid, tx_q->tx_trans, > + tx_q->count); > +tx_q->count -= sent; > + > +if (unlikely((tx_q->count != 0) && (sent > 0))) > +/* move unsent packets to front of list */ > +memmove(&tx_q->tx_trans[0], &tx_q->tx_trans[sent], > +(sizeof(struct rte_mbuf *) * tx_q->count)); > +} > +} > > -if (txq->count == 0) { > -return; Put the return type, function name, and the braces that surround the function's code on separate lines, all starting in column 0: > +static void netdev_dpdk_do_poll(struct netdev_rxq *rxq_, unsigned > +enable) { > +struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq_); > +struct netdev *netdev = rx->up.netdev; > +struct netdev_dpdk *dev = netdev_dpdk_cast
Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size, resubmit
If someone wants to change the queue size, maybe bigger, this provides a formula to allow this. There are advantages and also some disadvantages on using a larger queue. Doubling the size does not work since 4096 is invalid and will fail compilation. Unless they researched this carefully, they may think that 2048 is the largest size possible. This is to give a hint of what values the defined value can be set to. Mike Polehn -Original Message- From: Pravin Shelar [mailto:pshe...@nicira.com] Sent: Thursday, June 26, 2014 2:18 PM To: Polehn, Mike A Cc: dev@openvswitch.org Subject: Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue size, resubmit On Thu, Jun 26, 2014 at 7:08 AM, Polehn, Mike A wrote: > (n*32<4096) is the formula for calculating the buffer size. The > maximum size is not > 4096 but 4096 - 32 = 4064. I ran into this issue on a different > project where I needed to use the largest buffers size available. > Often 2**n is used for queue sizing, but is not the case for the PMD driver. > I still not sure how is related to the patch where you set queue size of 2048. > Mike Polehn > > -Original Message- > From: Pravin Shelar [mailto:pshe...@nicira.com] > Sent: Wednesday, June 25, 2014 4:19 PM > To: Polehn, Mike A > Cc: dev@openvswitch.org > Subject: Re: [ovs-dev] PATCH [1/1] High speed PMD physical NIC queue > size, resubmit > > On Thu, Jun 19, 2014 at 3:58 PM, Polehn, Mike A > wrote: >> Large TX and RX queues are needed for high speed 10 GbE physical NICS. >> Observed a 250% zero loss improvement over small NIC queue test for A >> port to port flow test. >> >> Signed-off-by: Mike A. Polehn > I am fine with the >> >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index >> fbdb6b3..d1bcc73 100644 >> --- a/lib/netdev-dpdk.c >> +++ b/lib/netdev-dpdk.c >> @@ -70,6 +70,9 @@ static struct vlog_rate_limit rl = >> VLOG_RATE_LIMIT_INIT(5, 20); >> >> #define NON_PMD_THREAD_TX_QUEUE 0 >> >> +#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue >> +(n*32<4096)*/ > > I am not sure what does "(n*32<4096)" means. Can you elaborate it bit? > >> +#define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue >> +(n*32<4096)*/ >> + >> /* TODO: Needs per NIC value for these constants. */ #define >> RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ >> #define RX_HTHRESH 32 /* Default values of RX host threshold reg. */ >> @@ -369,7 +372,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) >> OVS_REQUIRES(dpdk_mutex) >> } >> >> for (i = 0; i < NR_QUEUE; i++) { >> -diag = rte_eth_tx_queue_setup(dev->port_id, i, MAX_TX_QUEUE_LEN, >> +diag = rte_eth_tx_queue_setup(dev->port_id, i, >> + NIC_PORT_TX_Q_SIZE, >>dev->socket_id, &tx_conf); >> if (diag) { >> VLOG_ERR("eth dev tx queue setup error %d",diag); @@ >> -378,7 +381,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) >> OVS_REQUIRES(dpdk_mutex) >> } >> >> for (i = 0; i < NR_QUEUE; i++) { >> -diag = rte_eth_rx_queue_setup(dev->port_id, i, MAX_RX_QUEUE_LEN, >> +diag = rte_eth_rx_queue_setup(dev->port_id, i, >> + NIC_PORT_RX_Q_SIZE, >>dev->socket_id, >>&rx_conf, dev->dpdk_mp->mp); >> if (diag) { >> >> >> ___ >> dev mailing list >> dev@openvswitch.org >> http://openvswitch.org/mailman/listinfo/dev ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] [Patch v2 1/1] PMD SMP DPDK TX queue added with packet output surge adsorption
Version 2: Changes: Rebased due to recent changes in code. Made coding style changes based on feedback from Ben Pfaff. Put in a DPDK queue to receive multiple SMP input from vSwitch for NIC TX output. Eliminated the inside polling loop SMP TX output lock (DPDK queue handles SMP). Reused SMP tx-lock for non-polling operation to allow TX output by a non-polling thread when interface not being polled. Lock only accessed only when polling is not enabled. Added new netdev subroutine to control polling lock and enable and disable flag. Packets do not get discarded between TX pre-queue and NIC queue to handle surges. Removed new code packet buffer leak. Measured improved port to port packet rates. Measured improved average PMD port to port 2544 zero loss packet rate of 299,830 for packets 256 bytes and smaller. Predict double that when using 1 cpu core/interface. Observed better persistence of obtaining 100% 10 GbE for larger packets with the added DPDK queue, consistent with other tests outside of OVS where large surges from fast path interfaces transferring larger sized packets from VMs were being absorbed in the NIC TX pre-queue and TX queue and packet loss was suppressed. Requires earlier patch: PATCH [1/1] High speed PMD physical NIC queue size Signed-off-by: Mike A. Polehn diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c old mode 100644 new mode 100755 index f490900..2a6d79f --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -1894,6 +1894,11 @@ reload: poll_cnt = pmd_load_queues(f, &poll_list, poll_cnt); atomic_read(&f->change_seq, &port_seq); +/* Get polling ownership of interfaces. */ +for (i = 0; i < poll_cnt; i++) { + netdev_rxq_set_polling(poll_list[i].rx, true); +} + for (;;) { unsigned int c_port_seq; int i; @@ -1916,6 +1921,11 @@ reload: } } +/* Release polling ownership of interfaces */ +for (i = 0; i < poll_cnt; i++) { + netdev_rxq_set_polling(poll_list[i].rx, false); +} + if (!latch_is_set(&f->dp->exit_latch)){ goto reload; } diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 65ae9f9..fc77b6a 100644 --- a/lib/netdev-bsd.c +++ b/lib/netdev-bsd.c @@ -1609,6 +1609,7 @@ netdev_bsd_update_flags(struct netdev *netdev_, enum netdev_flags off, netdev_bsd_rxq_recv, \ netdev_bsd_rxq_wait, \ netdev_bsd_rxq_drain,\ +NULL, /* rxq_set_polling */ \ } const struct netdev_class netdev_bsd_class = diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index eb06595..e26c6fb 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -73,6 +73,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); #define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue (n*32<4096)*/ #define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue (n*32<4096)*/ +#define NIC_TX_PRE_Q_SIZE 4096 /* Size of Physical NIC TX Pre-Que (2**n)*/ +#define NIC_TX_PRE_Q_TRANS 64 /* Pre-Que to Physical NIC Que Transfer */ /* TODO: Needs per NIC value for these constants. */ #define RX_PTHRESH 32 /* Default values of RX prefetch threshold reg. */ @@ -124,8 +126,6 @@ static const struct rte_eth_txconf tx_conf = { }; enum { MAX_RX_QUEUE_LEN = 64 }; -enum { MAX_TX_QUEUE_LEN = 64 }; -enum { DRAIN_TSC = 20ULL }; static int rte_eal_init_ret = ENODEV; @@ -147,10 +147,12 @@ struct dpdk_mp { }; struct dpdk_tx_queue { -rte_spinlock_t tx_lock; +bool is_polled; +int port_id; int count; -uint64_t tsc; -struct rte_mbuf *burst_pkts[MAX_TX_QUEUE_LEN]; +struct rte_mbuf *tx_trans[NIC_TX_PRE_Q_TRANS]; +struct rte_ring *tx_preq; +rte_spinlock_t tx_lock; }; struct netdev_dpdk { @@ -363,6 +365,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) OVS_REQUIRES(dpdk_mutex) struct ether_addr eth_addr; int diag; int i; +char qname[32]; if (dev->port_id < 0 || dev->port_id >= rte_eth_dev_count()) { return -ENODEV; @@ -375,12 +378,21 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev) OVS_REQUIRES(dpdk_mutex) } for (i = 0; i < NR_QUEUE; i++) { +dev->tx_q[i].port_id = dev->port_id; diag = rte_eth_tx_queue_setup(dev->port_id, i, NIC_PORT_TX_Q_SIZE, dev->socket_id, &tx_conf); if (diag) { VLOG_ERR("eth dev tx queue setup error %d",diag); return diag; } + +snprintf(qname, sizeof(qname),"NIC_TX_Pre_Q_%u_%u", dev->port_id, i); +dev->tx_q[i].tx_preq = rte_ring_create(qname, NIC_TX_PRE_Q_SIZE, + dev->socket_id, RING_F_SC_DEQ); +if (dev->tx_q[i].tx_preq == NULL) { +VLOG_ERR("eth dev tx pre-queue alloc error&q
Re: [ovs-dev] [PATCH v2 5/5] netdev-dpdk: Add OVS_UNLIKELY annotations in dpdk_do_tx_copy().
These are already in the git repository code. Mike Polehn -Original Message- From: dev [mailto:dev-boun...@openvswitch.org] On Behalf Of Daniele Di Proietto Sent: Monday, June 30, 2014 10:00 AM To: Ryan Wilson Cc: dev@openvswitch.org Subject: Re: [ovs-dev] [PATCH v2 5/5] netdev-dpdk: Add OVS_UNLIKELY annotations in dpdk_do_tx_copy(). Acked-by: Daniele Di Proietto On Jun 26, 2014, at 7:17 PM, Ben Pfaff wrote: > Great, thanks. Looks good. > > I'll leave it to whoever reviews the series as a whole to push this. > On Jun 26, 2014 6:36 PM, "Ryan Wilson 76511" wrote: > >> Crap, its late in the day and I can't think / type apparently. Yes >> 0.04 million is what I meant. >> >> And I ran 2 more tests in the meantime with and without the patch and >> I got a 0.03 and 0.04 million PPS increase, respectively. >> Nonetheless, the increase is fairly consistent over 5 different tests. >> >> Ryan >> >> From: Ben Pfaff >> Date: Thursday, June 26, 2014 6:26 PM >> To: Ryan Wilson >> Cc: Ryan Wilson , "dev@openvswitch.org" < >> dev@openvswitch.org> >> Subject: Re: [ovs-dev] [PATCH v2 5/5] netdev-dpdk: Add OVS_UNLIKELY >> annotations in dpdk_do_tx_copy(). >> >> .4 million or .04 million? There's a big difference. >> On Jun 26, 2014 6:24 PM, "Ryan Wilson 76511" wrote: >> >>> Its between 0.2 - 0.6 million PPS increase after running 3 tests >>> with and without this patch. So I went with the average of 0.4 :) >>> >>> And we actually use these annotations elsewhere in >>> netdev_dpdk_send() where we measure size of packets and dropped >>> packets, so it would be nice to add these annotations for code consistency >>> as well. >>> >>> Ryan >>> >>> From: Ben Pfaff >>> Date: Thursday, June 26, 2014 6:20 PM >>> To: Ryan Wilson >>> Cc: "dev@openvswitch.org" >>> Subject: Re: [ovs-dev] [PATCH v2 5/5] netdev-dpdk: Add OVS_UNLIKELY >>> annotations in dpdk_do_tx_copy(). >>> >>> That's pretty impressive. Is the performance consistent enough to >>> be sure, then? >>> >>> In either case I don't object to the patch. >>> On Jun 26, 2014 6:17 PM, "Ryan Wilson" wrote: >>> >>>> Since dropped packets due to large packet size or lack of memory >>>> are unlikely, it is best to add OVS_UNLIKELY annotations to these >>>> conditions. >>>> >>>> With DPDK fast path forwarding test, this increased throughtput >>>> from 4.12 to 4.16 million packets per second. >>>> >>>> Signed-off-by: Ryan Wilson >>>> --- >>>> lib/netdev-dpdk.c |4 ++-- >>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index >>>> 0aee14e..03f1e02 100644 >>>> --- a/lib/netdev-dpdk.c >>>> +++ b/lib/netdev-dpdk.c >>>> @@ -664,7 +664,7 @@ dpdk_do_tx_copy(struct netdev *netdev, struct >>>> dpif_packet ** pkts, int cnt) >>>> >>>> for (i = 0; i < cnt; i++) { >>>> int size = ofpbuf_size(&pkts[i]->ofpbuf); >>>> -if (size > dev->max_packet_len) { >>>> +if (OVS_UNLIKELY(size > dev->max_packet_len)) { >>>> VLOG_WARN_RL(&rl, "Too big size %d max_packet_len %d", >>>> (int)size , dev->max_packet_len); >>>> >>>> @@ -688,7 +688,7 @@ dpdk_do_tx_copy(struct netdev *netdev, struct >>>> dpif_packet ** pkts, int cnt) >>>> newcnt++; >>>> } >>>> >>>> -if (dropped) { >>>> +if (OVS_UNLIKELY(dropped)) { >>>> ovs_mutex_lock(&dev->mutex); >>>> dev->stats.tx_dropped += dropped; >>>> ovs_mutex_unlock(&dev->mutex); >>>> -- >>>> 1.7.9.5 >>>> >>>> ___ >>>> dev mailing list >>>> dev@openvswitch.org >>>> https://urldefense.proofpoint.com/v1/url?u=http://openvswitch.org/m >>>> ailman/listinfo/dev&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=MV9BdLjtF >>>> IdhBDBaw5z%2BU6SSA2gAfY4L%2F1HCy3VjlKU%3D%0A&m=B%2BD2KiuphwYDp1kjSp >>>> IP5KeaBvJJGWoiQ7P6URgnkvM%3D%0A&s=9ce118c52fc0ec372ba651cd20cfd5e5b >>>> 2f4692865c242bb3adea3834b82fb5f >>>> <https://urldefense.proofpoint.com/v1/url?u=http://openvswitch.org/ >>>> mailman/listinfo/dev&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=TfBS78Vw >>>> 3dzttvXidhbffg%3D%3D%0A&m=wtH3lN2ST0E5hR7ESg7AwzXseDogoZZdb1KOoAV5u >>>> Q0%3D%0A&s=1542518c0ff9ce83f83a308a7e942d661a79c78b4fbac3e67a27b268 >>>> c9d58df0> >>>> >>> > ___ > dev mailing list > dev@openvswitch.org > https://urldefense.proofpoint.com/v1/url?u=http://openvswitch.org/mail > man/listinfo/dev&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=MV9BdLjtFIdhBDB > aw5z%2BU6SSA2gAfY4L%2F1HCy3VjlKU%3D%0A&m=B%2BD2KiuphwYDp1kjSpIP5KeaBvJ > JGWoiQ7P6URgnkvM%3D%0A&s=9ce118c52fc0ec372ba651cd20cfd5e5b2f4692865c24 > 2bb3adea3834b82fb5f ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] Urgent Attention
Attn Pls Good day dear beneificiary and pls i want to inform you that the sum of($3.6 m)has been released and the transfer began today through Western union transfer $8000.00 per a day.So contact Western union Director Rev.Mike Edward and ask him what youneedto do to enable them activate your account file so that you will be able to pick your $8000.00, their email is (rev.m...@inbox.lv) Tel: +22998074957. Your name Your Address Your country Your city Your tell Your test question Your test answer Your id forward the information or call Sincerely, Sincerely, Mrs Susan David ___ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev
[ovs-dev] [PATCH 1/1] netdev IVSHMEM shared memory usage documentation
This adds documentation for DPDK netdev to do an IVSHMEM shared memory to host app or VM app test using current OVS code. This example allows people to do learn how it is done, so that they can develop their own shared IVSHMEM memory applications. Also adds knowledge to better system setup for realtime task operation. Signed-off-by: Mike A. Polehn diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index cdef6cf..64ae6f1 100644 --- a/INSTALL.DPDK.md +++ b/INSTALL.DPDK.md @@ -49,6 +49,41 @@ on Debian/Ubuntu) For further details refer to http://dpdk.org/ +1b. Alternative DPDK 2.0 install + 1. Get DPDK from git repository + + cd /usr/src + git clone git://dpdk.org/dpdk + cd /usr/src/dpdk + git checkout -b test_v2.0.0 v2.0.0 + export DPDK_DIR=/usr/src/dpdk + + 2 If DPDK already installed with different version or build parameters. + Ideally done before checking out a new version. + + cd $(DPDK_DIR) + make uninstall + + 3. Build DPDK with IVSHMEM and User Side vHost support. + Note: Split ring (optional CONFIG_RTE_RING_SPLIT_PROD_CONS=y) has + notably better performance for two simaltanious data sources, as in + the case of two simultaneous port tasks or threads, writing into an + IVSHMEM ring (in either host or VM) at the same time. However for just + one task or thread, for example one port of data being switched a full + rate into IVSHMEM ring buffer, little or no data rate difference will + be observed. + + cd $(DPDK_DIR) + make install T=x86_64-ivshmem-linuxapp-gcc CONFIG_RTE_LIBRTE_VHOST=y \ + CONFIG_RTE_BUILD_COMBINE_LIBS=y CONFIG_RTE_LIBRTE_VHOST_USER=n \ + CONFIG_RTE_RING_SPLIT_PROD_CONS=y + + Note: Any host or VM task using shared memory as in the case of IVSHMEM, + must have DPDK built and installed in exactly the same way for all + DPDK programs on the system. DPDK install in VM needs same DPDK source + and build. Any changes in DPDK build requires all apps, including OVS, + host apps, and VM apps to be rebuilt and relinked. + 2. Configure & build the Linux kernel: Refer to intel-dpdk-getting-started-guide.pdf for understanding @@ -85,9 +120,24 @@ Using the DPDK with ovs-vswitchd: - 1. Setup system boot - Add the following options to the kernel bootline: + Add the following options to the kernel bootline for both 1 GB and 2 MB support: - `default_hugepagesz=1GB hugepagesz=1G hugepages=1` + `default_hugepagesz=1GB hugepagesz=1GB hugepages=16 hugepagesz=2M hugepages=2048` + + For just 1 GB hugepage support: + + `default_hugepagesz=1GB hugepagesz=1GB hugepages=16` + + This kernel bootline will allocate half the hugepages on each NUMA node. For + the IVSHMEM test below, 4 GB of 1 GB huge pages is needed for the test (1GB + for OVS and 3 GB for VM). This requires at least 8 1 GB pages for 4 1 GB + pages of NUMA node 0 hugepage memory to be available since half will be + allocated on NUMA Node 1 (assuming 2 CPU socket system). If system has + limited amount of memory or only 1 NUMA node, may need to adjust. At this + time 1GB pages are required and 2 MB pages are optional but very desirable + to have both 1 GB and 2 MB hugepage memory available on host at same time. + Dual hugepage memory size in VM is also very desirable (see IVSHMEM VM + setup information below). 2. Setup DPDK devices: @@ -112,9 +162,14 @@ Using the DPDK with ovs-vswitchd: 3. Bind network device to vfio-pci: `$DPDK_DIR/tools/dpdk_nic_bind.py --bind=vfio-pci eth1` -3. Mount the hugetable filsystem - +3. Mount the hugetable filesystem + The following may or may not be needed depending on host OS: + `mkdir /dev/hugepages` + Mount for 1 GB hugepages `mount -t hugetlbfs -o pagesize=1G none /dev/hugepages` + For additional 2 MB hugepage support: + `mkdir /dev/hugepages_2mb` + `mount -t hugetlbfs nodev /dev/hugepages_2mb -o pagesize=2MB` Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup. @@ -267,7 +322,7 @@ Using the DPDK with ovs-vswitchd: ovs-appctl dpif-netdev/pmd-stats-show ``` -DPDK Rings : +DPDK Rings: Following the steps above to create a bridge, you can now add dpdk rings @@ -299,12 +354,124 @@ The application simply receives an mbuf on the receive queue of the ethernet ring and then places that same mbuf on the transmit ring of the ethernet ring. It is a trivial loopback application. +DPDK Ring access on Host using IVSHMEM: +--- + +Use test program ring_client for IVSHMEM flow test. Requires DPDK to have +been built with IVSHMEM support. Rebuild DPDK and OVS with IVSHMEM support +(above) if not already. + +1: Move to directory with ring_client.c + + cd $(OVS_DIR)/tests/dpdk + + If desired, copy outside of OVS code tree and move to, to create + example of a DPDK host app with an IVSHMEM ring accessbile from OVS. + +2: Patch or