[dpdk-dev] virtio optimization idea

2015-09-14 Thread Xie, Huawei
On 9/10/2015 3:20 PM, Michael S. Tsirkin wrote:
> On Thu, Sep 10, 2015 at 06:32:35AM +, Xie, Huawei wrote:
>> On 9/9/2015 3:34 PM, Michael S. Tsirkin wrote:
>>> On Fri, Sep 04, 2015 at 08:25:05AM +, Xie, Huawei wrote:
 Hi:

 Recently I have done one virtio optimization proof of concept. The
 optimization includes two parts:
 1) avail ring set with fixed descriptors
 2) RX vectorization
 With the optimizations, we could have several times of performance boost
 for purely vhost-virtio throughput.
>>> Thanks!
>>> I'm very happy to see people work on the virtio ring format
>>> optimizations.
>>>
>>> I think it's best to analyze each optimization separately,
>>> unless you see a reason why they would only give benefit when applied
>>> together.
>>  
>> Agree. Will split the patch to see each change's benefit. Of course it
>> is also very common two give much more gain than the sum of individual.
>>  
>>> Also ideally, we'd need a unit test to show the performance impact.
>>> We've been using the tests in tools/virtio/ under linux,
>>> feel free to enhance these to simulate more workloads, or
>>> to suggest something else entirely.
>> If possible, we would provide perf test case under tools/virtio.
>> I am interested  if the optimization could help kernel virtio-net driver
>> performance, if not the other is the bottleneck.
>>> Avail ring is initialized with fixed descriptor and is never changed,
>>> i.e, the index value of the nth avail ring entry is always n, which
>>> means virtio PMD is actually refilling desc ring only, without having to
>>> change avail ring.
>>> When vhost fetches avail ring, if not evicted, it is always in its first
>>> level cache.
>>>
>>> When RX receives packets from used ring, we use the used->idx as the
>>> desc idx. This requires that vhost processes and returns descs from
>>> avail ring to used ring in order, which is true for both current dpdk
>>> vhost and kernel vhost implementation. In my understanding, there is no
>>> necessity for vhost net to process descriptors OOO. One case could be
>>> zero copy, for example, if one descriptor doesn't meet zero copy
>>> requirment, we could directly return it to used ring, earlier than the
>>> descriptors in front of it.
>>> To enforce this, i want to use a reserved bit to indicate in order
>>> processing of descriptors.
>>> So what's the point in changing the idx for the used ring?
>>> You need to communicate the length to the guest anyway, don't you?
>> For guest virtio driver, we only use the length field in the entry of
>> the used ring and don't use the index in the entry. Instead, use
>> used->idx & 255 as the desc idx for RX, and used->idx & 127 as the desc
>> idx for TX.
> OK but length and index are in the same cache line.
> As long as we read the length, does it make sense
> to skip reading the index?
I don't understand "skipping reading the index". Currently virtio RX
needs the length field, and CPU will automatically fetch the adjacent
index field in the unit of cache line, though the optimized driver
doesn't use the index field.
/huawei
>
>> For vhost driver, as it couldn't assume fixed ring layout(to support
>> legacy virtio), it needs to update the idx in the used ring entry.
 For tx ring, the arrangement is like below. Each transmitted mbuf needs
 a desc for virtio_net_hdr, so actually we have only 128 free slots.
>>> Just fix this one. Support ANY_LAYOUT and then you can put data
>>> linearly. And/or support INDIRECT_DESC and then you can
>>> use an indirect descriptor.
>> Would check those two features.


>>> This one came out corrupted.
>> Actually i ever replied to the original mail and fixed it. Copy it here
>> again.
>>
>> ++   
>> 
>> ||   
>> 
>> ||   
>> 
>>+-+-+-+--+--+--+--+   
>> 
>>|  0  |  1  | ... |  127 || 128  | 129  | ...  | 255  |   avail ring  
>> 
>>+--+--+--+--+-+---+--+---+--+---+--+--+---+   
>> 
>>   | ||  ||  |  | |   
>> 
>>   v vv  ||  v  v v   
>> 
>>+--+--+--+--+-+---+--+---+--+---+--+--+---+   
>> 
>>| 127 | 128 | ... |  255 || 127  | 128  | ...  | 255  |   desc ring for 
>> virtio_net_hdr
>>+--+--+--+--+-+---+--+---+--+---+--+--+---+   
>> 
>>   | ||  ||  |  | |   
>> 
>>   v vv  ||  v  v v

[dpdk-dev] vhost compliant virtio based networking interface in container

2015-09-14 Thread Xie, Huawei
On 9/8/2015 12:45 PM, Tetsuya Mukawa wrote:
> On 2015/09/07 14:54, Xie, Huawei wrote:
>> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
>>> On 2015/08/25 18:56, Xie, Huawei wrote:
 On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
> Hi Xie and Yanping,
>
>
> May I ask you some questions?
> It seems we are also developing an almost same one.
 Good to know that we are tackling the same problem and have the similar
 idea.
 What is your status now? We had the POC running, and compliant with
 dpdkvhost.
 Interrupt like notification isn't supported.
>>> We implemented vhost PMD first, so we just start implementing it.
>>>
> On 2015/08/20 19:14, Xie, Huawei wrote:
>> Added dev at dpdk.org
>>
>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>>> Yanping:
>>> I read your mail, seems what we did are quite similar. Here i wrote a
>>> quick mail to describe our design. Let me know if it is the same thing.
>>>
>>> Problem Statement:
>>> We don't have a high performance networking interface in container for
>>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>>
>>> The key components involved:
>>> 1.DPDK based virtio PMD driver in container.
>>> 2.device simulation framework in container.
>>> 3.dpdk(or kernel) vhost running in host.
>>>
>>> How virtio is created?
>>> A:  There is no "real" virtio-pci device in container environment.
>>> 1). Host maintains pools of memories, and shares memory to container.
>>> This could be accomplished through host share a huge page file to 
>>> container.
>>> 2). Containers creates virtio rings based on the shared memory.
>>> 3). Container creates mbuf memory pools on the shared memory.
>>> 4) Container send the memory and vring information to vhost through
>>> vhost message. This could be done either through ioctl call or vhost
>>> user message.
>>>
>>> How vhost message is sent?
>>> A: There are two alternative ways to do this.
>>> 1) The customized virtio PMD is responsible for all the vring creation,
>>> and vhost message sending.
> Above is our approach so far.
> It seems Yanping also takes this kind of approach.
> We are using vhost-user functionality instead of using the vhost-net
> kernel module.
> Probably this is the difference between Yanping and us.
 In my current implementation, the device simulation layer talks to "user
 space" vhost through cuse interface. It could also be done through vhost
 user socket. This isn't the key point.
 Here vhost-user is kind of confusing, maybe user space vhost is more
 accurate, either cuse or unix domain socket. :).

 As for yanping, they are now connecting to vhost-net kernel module, but
 they are also trying to connect to "user space" vhost.  Correct me if 
 wrong.
 Yes, there is some difference between these two. Vhost-net kernel module
 could directly access other process's memory, while using
 vhost-user(cuse/user), we need do the memory mapping.
> BTW, we are going to submit a vhost PMD for DPDK-2.2.
> This PMD is implemented on librte_vhost.
> It allows DPDK application to handle a vhost-user(cuse) backend as a
> normal NIC port.
> This PMD should work with both Xie and Yanping approach.
> (In the case of Yanping approach, we may need vhost-cuse)
>
>>> 2) We could do this through a lightweight device simulation framework.
>>> The device simulation creates simple PCI bus. On the PCI bus,
>>> virtio-net PCI devices are created. The device simulations provides
>>> IOAPI for MMIO/IO access.
> Does it mean you implemented a kernel module?
> If so, do you still need vhost-cuse functionality to handle vhost
> messages n userspace?
 The device simulation is  a library running in user space in container. 
 It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
 devices.
 The virtio-container-PMD configures the virtio-net pseudo devices
 through IOAPI provided by the device simulation rather than IO
 instructions as in KVM.
 Why we use device simulation?
 We could create other virtio devices in container, and provide an common
 way to talk to vhost-xx module.
>>> Thanks for explanation.
>>> At first reading, I thought the difference between approach1 and
>>> approach2 is whether we need to implement a new kernel module, or not.
>>> But I understand how you implemented.
>>>
>>> Please let me explain our design more.
>>> We might use a kind of similar approach to handle a pseudo virtio-net
>>> device in DPDK.
>>> (Anyway, we haven't finished implementing yet, this overview might have
>>> some technical problems)
>>>
>>> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
>>> then implement it as a separated program.
>>> The program also has bel

[dpdk-dev] [PATCH] testpmd: modify the mac of csum forwarding

2015-09-14 Thread Qiu, Michael
Hi, all

any other comments about this patch?

Thanks,
Michael

On 8/26/2015 2:12 PM, Liu, Jijiang wrote:
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Michael Qiu
>> Sent: Friday, August 07, 2015 11:29 AM
>> To: dev at dpdk.org
>> Subject: [dpdk-dev] [PATCH] testpmd: modify the mac of csum forwarding
>>
>> For some ethnet-switch like intel RRC, all the packet forwarded out by DPDK
>> will be dropped in switch side, so the packet generator will never receive 
>> the
>> packet.
>>
>> Signed-off-by: Michael Qiu 
>> ---
>>  app/test-pmd/csumonly.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c index
>> 1bf3485..bf8af1d 100644
>> --- a/app/test-pmd/csumonly.c
>> +++ b/app/test-pmd/csumonly.c
>> @@ -550,6 +550,10 @@ pkt_burst_checksum_forward(struct fwd_stream
>> *fs)
>>   * and inner headers */
>>
>>  eth_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
>> +ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
>> +ð_hdr->d_addr);
>> +ether_addr_copy(&ports[fs->tx_port].eth_addr,
>> +ð_hdr->s_addr);
>>  parse_ethernet(eth_hdr, &info);
>>  l3_hdr = (char *)eth_hdr + info.l2_len;
>>
>> --
>> 1.9.3
> The change will affect on the csum fwd performance.
> But I also think the change is necessary, or we cannot use csumonly fwd mode 
> in guest?
>
> Acked-by: Jijiang Liu 
>
>



[dpdk-dev] DPDK 2.2 roadmap

2015-09-14 Thread Zhang, Helin


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Thursday, September 10, 2015 8:44 PM
> To: O'Driscoll, Tim
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] DPDK 2.2 roadmap
> 
> 2015-09-09 12:56, O'Driscoll, Tim:
> > DCB for i40e &X550:  DCB support will be extended to the i40e and X550 NICs.
Yes, i40e DCB will be enabled in R2.2.

> 
> A patch for DCB on X550 is already applied but the release notes part was
> forgotten.
> 
> > IPsec Sample App:  A sample application will be provided to show how
> > the cryptodev library can be used to implement IPsec. This will be
> > based on the NetBSD IPsec implementation.
> 
> For each API, it is really interesting to have some unit tests and some 
> examples to
> show how it should be used. However adding an IPsec stack looks to be beyond
> the need of an API example. It may be big and difficult to maintain when
> updating DPDK.
> Why not spawn a new project here: http://dpdk.org/browse/ ?
> 
> > Refactor EAL for Non-PCI Devices:  This has been discussed extensively on 
> > the
> mailing list. See the RFCs for Refactor eal driver registration code
> (http://dpdk.org/ml/archives/dev/2015-September/023257.html) and Remove
> pci driver from vdevs
> (http://dpdk.org/ml/archives/dev/2015-August/023016.html).
> 
> I have the feeling we should think it as an unification work to reduce 
> differences
> between physical and virtual devices handling.
> Probably it will have to be continued during 2.3 cycle.
> 
> > Vhost Multi-Queue Support:  The vhost-user library will be updated to 
> > provide
> multi-queue support in the host similar to the RSS model so that guest driver 
> may
> allocate multiple rx/tx queues which then may be used to load balance between
> cores. See http://dpdk.org/ml/archives/dev/2015-August/022758.html for more
> details.
> 
> Does it require a patch on Qemu?
> 
> > Config Granularity of RSS for i40e:  All RSS hash and filter configurations 
> > for
> IPv4/v6, TCP/UDP, GRE, etc will be implemented for i40e. This includes support
> for QinQ and tunneled packets.
> 
> What is missing in i40e RSS/filtering?
The fields used for HASH-ing is configured by firmware, some of users want to 
have
a bit more flexibility on selecting packet fields for hashing.

> 
> > I40e 32-bit GRE Keys:  Both 24 and 32 bit keys for GRE will be supported for
> i40e.
> 
> Could you please give more details?
Currently only 24 bits of GRE key will be used for hash/filter calculation, 
while 32 bits
was wanted by some of users. So either 24 or 32 bits can be selected by the 
users
from R2.2.

> 
> > X550 Enhancements:  Support VF TSO, more Entries in RSS and per VF RSS,
> network overlay support with flow director for X550.
> 
> Could you please give more details on "network overlay support with flow
> director"?
X550 hardware supports flow director on tunneled packets (VXLAN/NVGRE), these
will be enabled from R2.2.

Regards,
Helin

> 
> Thanks for sharing your plan



[dpdk-dev] DPDK 2.2 roadmap

2015-09-14 Thread Thomas Monjalon
Thanks for the details.
The roadmap is updated:
http://dpdk.org/dev/roadmap

Maybe the keep-alive feature needs some design discussion.

Anyone else to share his plan?



[dpdk-dev] ixgbe: account more Rx errors Issue

2015-09-14 Thread Tahhan, Maryam

> From: Kyle Larose [mailto:eomereadig at gmail.com] 
> Sent: Wednesday, September 9, 2015 6:43 PM
> To: Tahhan, Maryam
> Cc: Olivier MATZ; Andriy Berestovskyy; dev at dpdk.org
> Subject: Re: [dpdk-dev] ixgbe: account more Rx errors Issue
>
>
> On Mon, Sep 7, 2015 at 7:44 AM, Tahhan, Maryam  
> wrote:
> > From: Olivier MATZ [mailto:olivier.matz at 6wind.com]
> > Sent: Monday, September 7, 2015 9:30 AM
> > To: Tahhan, Maryam; Andriy Berestovskyy
> > Cc: dev at dpdk.org
> > Subject: Re: ixgbe: account more Rx errors Issue
> >
> > Hi,
> >
> > On 09/06/2015 07:15 PM, Tahhan, Maryam wrote:
> > >> From: Andriy Berestovskyy [mailto:aber at semihalf.com]
> > >> Sent: Friday, September 4, 2015 5:59 PM
> > >> To: Tahhan, Maryam
> > >> Cc: dev at dpdk.org; Olivier MATZ
> > >> Subject: Re: ixgbe: account more Rx errors Issue
> > >>
> > >> Hi Maryam,
> > >> Please see below.
> > >>
> > >>> XEC counts the Number of receive IPv4, TCP, UDP or SCTP XSUM errors
> > >>
> > >> Please note than UDP checksum is optional for IPv4, but UDP packets
> > >> with zero checksum hit XEC.
> > >>
> > >
> > > I understand, but this is what the hardware register is picking up and 
> > > what I
> > included previously is the definitions of the registers from the datasheet.
> > >
> > >>> And general crc errors counts Counts the number of receive packets
> > >>> with
> > >> CRC errors.
> > >>
> > >> Let me explain you with an example.
> > >>
> > >> DPDK 2.0 behavior:
> > >> host A sends 10M IPv4 UDP packets (no checksum) to host B host B
> > >> stats: 9M ipackets + 1M ierrors (missed) = 10M
> > >>
> > >> DPDK 2.1 behavior:
> > >> host A sends 10M IPv4 UDP packets (no checksum) to host B host B
> > >> stats: 9M ipackets + 11M in ierrors (1M missed + 10M XEC) = 20M?
> > >
> > > Because it's hitting the 2 error registers. If you had packets with 
> > > multiple
> > errors that are added up as part of ierrors you'll still be getting more 
> > than
> > 10M errors which is why I asked for feedback on the 3 suggestions below.
> > What I'm saying is the number of errors being > the number of received
> > packets will be seen if you hit multiple error registers on the NIC.
> > >
> > >>
> > >>> So our options are we can:
> > >>> 1. Add only one of these into the error stats.
> > >>> 2. We can introduce some cooking of stats in this scenario, so only
> > >>> add
> > >> either or if they are equal or one is higher than the other.
> > >>> 3. Add them all which means you can have more errors than the number
> > >>> of
> > >> received packets, but TBH this is going to be the case if your
> > >> packets have multiple errors anyway.
> > >>
> > >> 4. ierrors should reflect NIC drops only.
> > >
> > > I may have misinterpreted this, but ierrors in rte_ethdev.h ierrors is 
> > > defined
> > as the Total number of erroneous received packets.
> > > Maybe we need a clear definition or a separate drop counter as I see
> > uint64_t q_errors defined as: Total number of queue packets received that
> > are dropped.
> > >
> > >> XEC does not count drops, so IMO it should be removed from ierrors.
> > >
> > > While it's picking up the 0 checksum as an error (which it shouldn't
> > > necessarily be doing), removing it could mean missing other valid
> > > L3/L4 checksum errors... Let me experiment some more with L3/L4
> > > checksum errors and crcerrs to see if we can cook the stats around
> > > this register in particular. I would hate to remove it and miss
> > > genuine errors
> >
> > For me, the definition that looks the most straightforward is:
> >
>>  ipackets = packets successfully received by hardware imissed = packets
> > dropped by hardware because the software does
> >? ?not poll fast enough (= queue full)
> > ierrors = packets dropped by hardware (malformed packets, ...)
> >
> > These 3 stats never count twice the same packet.
> >
> > If we want more statistics, they could go in xstats. For instance, a 
> > counter for
> > invalid checksum. The definition of these stats would be pmd-specific.
> >
> > I agree we should clarify and have a consensus on the definitions before 
> > going
> > further.
> >
> >
> > Regards,
> > Olivier
> Hi Olivier
> I think it's important to distinguish between errors and drops and provide a 
> statistics API that exposes both. This way people have access to as much 
> information as possible when things do go wrong and nothing is missed in 
> terms of errors.
>
> My suggestion for the high level registers would be:
> ipackets = Total number of packets successfully received by hardware
> imissed = Total number of? packets dropped by hardware because the software 
> does not poll fast enough (= queue full)
> idrops = Total number of packets dropped by hardware (malformed packets, ...) 
> Where the # of drops can ONLY be <=? the packets received (without overlap 
> between registers).
> ierrors = Total number of erroneous received packets. Where the # of errors 
> can be >= the packets received (without overlap between registers), this is

[dpdk-dev] [PATCH v4 0/2] ethdev: add port speed capability bitmap

2015-09-14 Thread Nélio Laranjeiro
On Sun, Sep 13, 2015 at 11:18:33PM +0200, Thomas Monjalon wrote:
> 2015-09-13 21:14, Marc Sune:
> > 2015-09-09 15:33 GMT+02:00 Thomas Monjalon :
> > > 2015-09-09 15:10, N?lio Laranjeiro:
> > > > I think V2 is better, maybe you can add a function to convert a single
> > > > bitmap value to the equivalent integer and get rid of ETH_SPEED_XXX
> > > macros.
> > > >
> > > > Thomas what is your opinion?
> > >
> > > Your proposal looks good Nelio.
> > 
> > I am confused, specially since you were the one advocating for having a
> > unified set of constants for speeds (discussion in v2).
> 
> Yes, my first thought was advocating an unification between capabilities and
> negotiated link properties.
> After I was convinced by Nelio's arguments: bitmap is good for capabilities
> (especially to describe every capabilities in one field) but integer is better
> for negotiated speed (especially for aggregated links).
> Converting bitmap speed into integer should be easy to implement in a 
> function.
> 
> > In any case, as I see it, if we want to address the comments of  M. Brorup:
> > 
> > http://comments.gmane.org/gmane.comp.networking.dpdk.devel/19664

I read it.

> > 
> > we need bitmaps for rte_eth_conf link_speed to set the advertised speeds.
> 
> Yes I forgot this interesting comment. It is saying we need
>   1/ capabilities
>   2/ advertised modes (for auto-negotiation or fixed config)
>   3/ negotiated mode
> Previously we were focused only on 1/ and 3/.
> 2/ was only limited to a mode configured without negotiation and was using the
> same field as 3/.
> Maybe we should really have 3 different fields. 1/ and 2/ would use a bitmap.
> 
> > Let me know if you really want to come back to v2 or not.
> 
> It needs more discussion. What do you think of the above proposal?
> What is the opinion of Nelio? Morten?

I agree with Morten, and with this proposition (we should be possible to
disable some capabilities in the advertise field).

For that we should have those 3 fields:
  1/ Capability (as a bitmap),  necessary for the user to configure the
 behavior of the PHY.
  2/ Advertise (as a bitmap) to configure the PHY.
  3/ speed, duplex, status (as rte_eth_link structure), necessary to
 verify that the configuration corresponds to what has been set and
 for other purposes.

I still think Marc needs to go back to V2 and continue from this one.
And as Thomas suggested, he could have a function to get rid of the
double defines and use the sames ones for bitmap and non bitmap fields.

Just for information, before I started this discussion I have worked on
a patch to change the rte_eth_link.link_speed from 16 to 32 bit, I did
not pushed it because, it should be pushed afters Marc's one, and it
will need some rework after that.

-- 
N?lio Laranjeiro
6WIND


[dpdk-dev] [PATCH v4 0/2] ethdev: add port speed capability bitmap

2015-09-14 Thread Morten Brørup
It is important to consider that a multipath link (bonding etc.) is not a 
physical link, but a logical link (built on top of multiple physical links). 
Regardless whether it is a Layer2 link aggregate (IEEE 802.1ad, Ethernet 
bonding, EtherChannel, DSL pair bonding, etc.) or a Layer3 multipath link (e.g. 
simultaneously using Wi-Fi and mobile networks). So it doesn't make sense 
trying to impose physical link properties on a purely logical link. Likewise, 
it doesn't make sense to impose logical link properties on physical links. In 
other words: Don't consider bonding or any other logical link types when 
designing the PHY API.

I think there is consensus that 1/ (PHY capabilities) and 2/ (PHY 
advertisements) should use the same definitions, specifically a bitmap field. 
And when you disregard bonding, I don't see any reason to use different 
definitions for 3/ (PHY negotiation result). This makes it one unified API for 
all three purposes.

Nelio suggested adding a support function to convert the bitmap field to a 
speed value as an integer. I strongly support this, because you cannot expect 
the bitmap to be ordered by speed. This support function will be able to 
determine which speed is higher when exotic speeds are added to the bitmap. 
Please extend this conversion function to give three output parameters: speed, 
full/half duplex, auto negotiation/non-auto negotiation, or add two separate 
functions to get the duplex and auto-negotiation.

I haven't read the suggested code, but there should be some means in 2/ 
(advertisements) to disable auto negotiation, e.g. a single bit in the bitmap 
to indicate if the speed/duplex-indicating bits in the bitmap means forced 
speed/duplex (in which case only a single speed/duplex-bit should be set) or 
auto negotiation advertised speed/duplex (in which case multiple 
speed/duplex-bits can be set). And some means in 3/ (result) and maybe 2/ 
(advertisements) to select and/or indicate physical interface in 
dual-personality ports (e.g. ports where the PHY has both an SFP and a RJ45 
connector, but only one of the two can be used at any time).


Med venlig hilsen / kind regards
- Morten Br?rup

-Original Message-
From: Thomas Monjalon [mailto:thomas.monja...@6wind.com] 
Sent: 13. september 2015 23:19
To: Marc Sune
Cc: N?lio Laranjeiro; dev at dpdk.org; Olga Shern; Adrien Mazarguil; Morten 
Br?rup
Subject: Re: [dpdk-dev] [PATCH v4 0/2] ethdev: add port speed capability bitmap

2015-09-13 21:14, Marc Sune:
> 2015-09-09 15:33 GMT+02:00 Thomas Monjalon :
> > 2015-09-09 15:10, N?lio Laranjeiro:
> > > I think V2 is better, maybe you can add a function to convert a 
> > > single bitmap value to the equivalent integer and get rid of 
> > > ETH_SPEED_XXX
> > macros.
> > >
> > > Thomas what is your opinion?
> >
> > Your proposal looks good Nelio.
> 
> I am confused, specially since you were the one advocating for having 
> a unified set of constants for speeds (discussion in v2).

Yes, my first thought was advocating an unification between capabilities and 
negotiated link properties.
After I was convinced by Nelio's arguments: bitmap is good for capabilities 
(especially to describe every capabilities in one field) but integer is better 
for negotiated speed (especially for aggregated links).
Converting bitmap speed into integer should be easy to implement in a function.

> In any case, as I see it, if we want to address the comments of  M. Brorup:
> 
> http://comments.gmane.org/gmane.comp.networking.dpdk.devel/19664
> 
> we need bitmaps for rte_eth_conf link_speed to set the advertised speeds.

Yes I forgot this interesting comment. It is saying we need
1/ capabilities
2/ advertised modes (for auto-negotiation or fixed config)
3/ negotiated mode
Previously we were focused only on 1/ and 3/.
2/ was only limited to a mode configured without negotiation and was using the 
same field as 3/.
Maybe we should really have 3 different fields. 1/ and 2/ would use a bitmap.

> Let me know if you really want to come back to v2 or not.

It needs more discussion. What do you think of the above proposal?
What is the opinion of Nelio? Morten?




[dpdk-dev] DPDK 2.2 roadmap

2015-09-14 Thread Matej Vido
Hello Thomas,

CESNET would like to submit new virtual poll mode driver for COMBO-80G and
COMBO-100G cards. This PMD requires libsze2 library and kernel modules
(combov3, szedata2_cv3) to be installed.

We have already sent a patch serie in June
(http://dpdk.org/ml/archives/dev/2015-June/019736.html ), but missed
previous merge window. Now, we are working on a new patchset version with
scattered packets support.

Meanwhile, we have improved the firmware and would like to share the
results of our 
benchmark:https://www.liberouter.org/wp-content/uploads/2015/09/pmd_szedata2_dpdk_measurement-2015-09-11.pdf

Regards,
Matej

Mon, 14 Sep 2015, Thomas Monjalon :

> Thanks for the details.
> The roadmap is updated:
> http://dpdk.org/dev/roadmap
>
> Maybe the keep-alive feature needs some design discussion.
>
> Anyone else to share his plan?
>

--
Matej Vido
CESNET, a. l. e.


[dpdk-dev] DPDK 2.2 roadmap

2015-09-14 Thread Thomas Monjalon
Hi,

2015-09-14 15:44, Matej Vido:
> CESNET would like to submit new virtual poll mode driver for COMBO-80G and
> COMBO-100G cards. This PMD requires libsze2 library and kernel modules
> (combov3, szedata2_cv3) to be installed.

The name of the driver was szedata2, right?
What is the meaning? Why not having liberouter in its name?

> We have already sent a patch serie in June
> (http://dpdk.org/ml/archives/dev/2015-June/019736.html ), but missed
> previous merge window. Now, we are working on a new patchset version with
> scattered packets support.

Please try to split the patches in order to ease the review.
One feature per patch is a good rule.

> Meanwhile, we have improved the firmware and would like to share the
> results of our 
> benchmark:https://www.liberouter.org/wp-content/uploads/2015/09/pmd_szedata2_dpdk_measurement-2015-09-11.pdf

Nice results!

Thanks


[dpdk-dev] continuous release notes

2015-09-14 Thread Thomas Monjalon
Hi,

In order to provide a good release notes with little efforts,
we have to update this file continuously:
http://dpdk.org/browse/dpdk/tree/doc/guides/rel_notes/release_2_2.rst

It means each patch adding a new feature or fixing a serious bug of a
previous release, must include a new item in the release notes.

We missed it for some previous patch:
17e01e3 i40e: fix base driver allocation when not using first numa node
45e73f4 ixgbe: remove burst size restriction of vector Rx
b7fcd13 ixgbe: fix X550 DCB
7d49e0f hash: fix memory allocation of cuckoo key table
79db649 eal/linux: fix epoll timeout

Thanks for taking care.


[dpdk-dev] DPDK 2.2 roadmap

2015-09-14 Thread Matej Vido
Hi,

2015-09-14 16:02 GMT+02:00 Thomas Monjalon :

> Hi,
>
> 2015-09-14 15:44, Matej Vido:
> > CESNET would like to submit new virtual poll mode driver for COMBO-80G
> and
> > COMBO-100G cards. This PMD requires libsze2 library and kernel modules
> > (combov3, szedata2_cv3) to be installed.
>
> The name of the driver was szedata2, right?
> What is the meaning? Why not having liberouter in its name?
>

SZE is our Straight ZEro copy technology for fast data transfers between
hardware and RAM, 2 is its second version, _cv3 denotes compatibility with
our third generation of FPGA cards. Liberouter is the name of our team,
which does also other, non-hardware stuff.

In the future we want to rewrite this PMD to be independent on sze library
and kernel modules - that could also further improve performance a little.


>
> > We have already sent a patch serie in June
> > (http://dpdk.org/ml/archives/dev/2015-June/019736.html ), but missed
> > previous merge window. Now, we are working on a new patchset version with
> > scattered packets support.
>
> Please try to split the patches in order to ease the review.
> One feature per patch is a good rule.
>
> > Meanwhile, we have improved the firmware and would like to share the
> > results of our benchmark:
> https://www.liberouter.org/wp-content/uploads/2015/09/pmd_szedata2_dpdk_measurement-2015-09-11.pdf
>
> Nice results!
>
> Thanks
>

Regards,
Matej

--
Matej Vido,
CESNET, a. l. e.


[dpdk-dev] [PATCH v4 0/2] ethdev: add port speed capability bitmap

2015-09-14 Thread Marc Sune
2015-09-14 12:52 GMT+02:00 Morten Br?rup :

> It is important to consider that a multipath link (bonding etc.) is not a
> physical link, but a logical link (built on top of multiple physical
> links). Regardless whether it is a Layer2 link aggregate (IEEE 802.1ad,
> Ethernet bonding, EtherChannel, DSL pair bonding, etc.) or a Layer3
> multipath link (e.g. simultaneously using Wi-Fi and mobile networks). So it
> doesn't make sense trying to impose physical link properties on a purely
> logical link. Likewise, it doesn't make sense to impose logical link
> properties on physical links. In other words: Don't consider bonding or any
> other logical link types when designing the PHY API.
>

+1


>
> I think there is consensus that 1/ (PHY capabilities) and 2/ (PHY
> advertisements) should use the same definitions, specifically a bitmap
> field. And when you disregard bonding, I don't see any reason to use
> different definitions for 3/ (PHY negotiation result). This makes it one
> unified API for all three purposes.
>

Agree.


>
> Nelio suggested adding a support function to convert the bitmap field to a
> speed value as an integer. I strongly support this, because you cannot
> expect the bitmap to be ordered by speed.


Agree with Nelio&you. This is useful.


> This support function will be able to determine which speed is higher when
> exotic speeds are added to the bitmap. Please extend this conversion
> function to give three output parameters: speed, full/half duplex, auto
> negotiation/non-auto negotiation, or add two separate functions to get the
> duplex and auto-negotiation.
>

Since, Full/Half duplex is for legacy 10/100Mbps only (afaik), I have my
doubts on using a bit for all speeds. I would suggest to define (unroll)
100M (or 100M_FD) and 100M_HD, and the same 10Mbps/1gbps, as Thomas was
suggesting some mails ago.

This was done in v4 (implicitely 100M == 100M_FD). See below.


>
> I haven't read the suggested code, but there should be some means in 2/
> (advertisements) to disable auto negotiation, e.g. a single bit in the
> bitmap to indicate if the speed/duplex-indicating bits in the bitmap means
> forced speed/duplex (in which case only a single speed/duplex-bit should be
> set) or auto negotiation advertised speed/duplex (in which case multiple
> speed/duplex-bits can be set).


Agree.

v3/4 of this patch adds the bitmap in the advertised, as per discussed, to
select a group of speeds This is not implemented by drivers yet (!).

So, as of v4 of this patch, there could be: a) autoneg any supported speed
(=> bitmap == 0) b) autoneg over group of speeds (=> more than one bit set
in the bitmap) c) forced speed (one and only one set in the bitmap).

I think this is precisely what you meant + b) as a bonus


> And some means in 3/ (result) and maybe 2/ (advertisements) to select
> and/or indicate physical interface in dual-personality ports (e.g. ports
> where the PHY has both an SFP and a RJ45 connector, but only one of the two
> can be used at any time).
>
>
For rte_eth_link_get() I don't have such a strong opinion. You either

* encode the link speed and duplex as of now, separating duplex and numeric
speed. I would suggest to add the encoded speed+duplex bitmap flag for
consistency (although redundant).
* or you return a single value, the bitmap with a single flag set of the
unrolled speeds, and then have the helpers int rte_eth_speed_from_bm(int
val_bm) and bool rte_eth_duplex_from_bm(int val_bm).


Marc


>
> Med venlig hilsen / kind regards
> - Morten Br?rup
>
> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: 13. september 2015 23:19
> To: Marc Sune
> Cc: N?lio Laranjeiro; dev at dpdk.org; Olga Shern; Adrien Mazarguil; Morten
> Br?rup
> Subject: Re: [dpdk-dev] [PATCH v4 0/2] ethdev: add port speed capability
> bitmap
>
> 2015-09-13 21:14, Marc Sune:
> > 2015-09-09 15:33 GMT+02:00 Thomas Monjalon :
> > > 2015-09-09 15:10, N?lio Laranjeiro:
> > > > I think V2 is better, maybe you can add a function to convert a
> > > > single bitmap value to the equivalent integer and get rid of
> > > > ETH_SPEED_XXX
> > > macros.
> > > >
> > > > Thomas what is your opinion?
> > >
> > > Your proposal looks good Nelio.
> >
> > I am confused, specially since you were the one advocating for having
> > a unified set of constants for speeds (discussion in v2).
>
> Yes, my first thought was advocating an unification between capabilities
> and negotiated link properties.
> After I was convinced by Nelio's arguments: bitmap is good for
> capabilities (especially to describe every capabilities in one field) but
> integer is better for negotiated speed (especially for aggregated links).
> Converting bitmap speed into integer should be easy to implement in a
> function.
>
> > In any case, as I see it, if we want to address the comments of  M.
> Brorup:
> >
> > http://comments.gmane.org/gmane.comp.networking.dpdk.devel/19664
> >
> > we need bitmaps for rte_eth_conf link_spe