Hi Adrien, > > Hi Konstantin, > > On Wed, Nov 30, 2016 at 10:54:50AM +0000, Ananyev, Konstantin wrote: > [...] > > > Something is definitely needed here, and only PMDs can provide it. I think > > > applications should not have to clear checksum fields or initialize them > > > to > > > some magic value, same goes for any other offload or hardware limitation > > > that needs to be worked around. > > > > > > tx_prep() is one possible answer to this issue, however as mentioned in > > > the > > > original patch it can be very expensive if exposed by the PMD. > > > > > > Another issue I'm more concerned about is the way limitations are managed > > > (struct rte_eth_desc_lim). While not officially tied to tx_prep(), this > > > structure contains new fields that are only relevant to a few devices, > > > and I > > > fear it will keep growing with each new hardware quirk to manage, breaking > > > ABIs in the process. > > > > Well, if some new HW capability/limitation would arise and we'd like to > > support > > it in DPDK, then yes we probably would need to think how to incorporate it > > here. > > Do you have anything particular in mind here? > > Nothing in particular, so for the sake of the argument, let's suppose that I > would like to add a field to expose some limitation that only applies to my > PMD during TX but looks generic enough to make sense, e.g. maximum packet > size when VLAN tagging is requested.
Hmm, I didn't hear about such limitations so far, but if it is real case - sure, feel free to submit the patch. > PMDs are free to set that field to some > special value (say, 0) if they do not care. > > Since that field exists however, conscious applications should check its > value for each packet that needs to be transmitted. This extra code causes a > slowdown just by sitting in the data path. Since it is not the only field in > that structure, the performance impact can be significant. > > Even though this code is inside applications, it remains unfair to PMDs for > which these tests are irrelevant. This problem is identified and addressed > by tx_prepare(). I suppose the question is why do we need: uint16_t nb_seg_max; uint16_t nb_mtu_seg_max; as we now have tx_prepare(), right? For two reasons: 1. Some people might feel that tx_prepare() is not good (smart/fast) enough for them and would prefer to do necessary preparations for TX offloads themselves. 2. Even if people do use tx_prepare() they still should take this information into accout. As an example ixgbe can't TX packets with then 40 segments. Obviously ixbge_tx_prep() performs that check and returns an error. But it wouldn't try to merge/reallocate mbufs for you. User still has to do it himself, or just prevent creating such long chains somehow. > > Thanks to tx_prepare(), these checks are moved back into PMDs where they > belong. PMDs that do not need them do not have to provide support for > tx_prepare() and do not suffer any performance impact as result; > applications only have to make sure tx_prepare() is always called at some > point before tx_burst(). > > Once you reach this stage, you've effectively made tx_prepare() mandatory > before tx_burst(). If some bug occurs, then perhaps you forgot to call > tx_prepare(), you just need to add it. The total cost for doing TX is > therefore tx_prepare() + tx_burst(). > > I'm perhaps a bit pessimistic mind you, but I do not think tx_prepare() will > remain optional for long. Sure, PMDs that do not implement it do not care, > I'm focusing on applications, for which the performance impact of calling > tx_prepare() followed by tx_burst() is higher than a single tx_burst() > performing all the necessary preparation at once. > > [...] > > > Following the same logic, why can't such a thing be made part of the TX > > > burst function as well (through a direct call to rte_phdr_cksum_fix() > > > whenever necessary). From an application standpoint, what are the > > > advantages > > > of having to: > > > > > > if (tx_prep()) // iterate and update mbufs as needed > > > tx_burst(); // iterate and send > > > > > > Compared to: > > > > > > tx_burst(); // iterate, update as needed and send > > > > I think that was discussed extensively quite a lot previously here: > > As Thomas already replied - main motivation is to allow user > > to execute them on different stages of packet TX pipeline, > > and probably on different cores. > > I think that provides better flexibility to the user to when/where > > do these preparations and hopefully would lead to better performance. > > And I agree, I think this use case is valid but does not warrant such a high > penalty when your application does not need that much flexibility. Simple > (yet conscious) applications need the highest performance. Complex ones as > you described already suffer quite a bit from IPCs and won't mind a couple > of extra CPU cycles right? It would mean an extra cache-miss for every packet, so I think performance hit would be quite significant. About the 'simple' case when tx_prep() and tx_burst() are called on the same core, Why do you believe that: tx_prep(); tx_burst(); would be much slower than tx_burst() {tx_prep(), ...}? tx_prep() itself is quite expensive, let say for Intel HW it includes: - read mbuf fileds (2 cache-lines), - read packet header (1/2 cache-lines) - calculate pseudo-header csum - update packet header Comparing to that price of extra function call seems neglectable (if we TX packets in bursts of course). > > Yes they will, therefore we need a method that satisfies both cases. > > As a possible solution, a special mbuf flag could be added to each mbuf > having gone through tx_prepare(). That way, tx_burst() could skip some > checks and things it would otherwise have done. That's an interesting idea, but it has one drawback: As I understand, it means that from now on if user doing preparations on his own, he had to setup this flag, otherwise tx_burst() would do extra unnecessary work. So any existing applications that using TX offloads and do preparation by themselves would have to be modified to avoid performance loss. > > Another possibility, telling the PMD first that you always intend to use > tx_prepare() and getting a simpler/faster tx_burst() callback as a result. That what we have right now (at least for Intel HW): it is a user responsibility to do the necessary preparations/checks before calling tx_burst(). With tx_prepare() we just remove from user the headache to implement tx_prepare() on his own. Now he can use a 'proper' PMD provided function. My vote still would be for that model. Konstantin