Hi, Slava
Thanks for your detailed explanation!You are right,I didn't look
carefully!
With best regards,
Jiawei
On 2021/3/25 7:55 下午, Slava Ovsiienko wrote:
Hi, Jiawei
-----Original Message-----
From: Jiawei Zhu <17826875...@163.com>
Sent: Wednesday, March 24, 2021 18:22
To: Slava Ovsiienko <viachesl...@nvidia.com>; dev@dpdk.org
Cc: zhujiawe...@huawei.com; Matan Azrad <ma...@nvidia.com>; Shahaf
Shuler <shah...@nvidia.com>
Subject: Re: [PATCH] net/mlx5: add Rx checksum offload flag return bad
Hi,Slava
Thanks for your explain,the multiplications and divisions are in the
TRANSPOSE,not in the rte_be_to_cpu_16.
[SO]
Yes, TRANSPOSE is the macro with mul and div operators. But, these ones
are translated by compiler to the simple shifts (due to operands are power of
2).
The only place where TRANSPOSE is used is the rxq_cq_to_ol_flags() routine.
I've compiled this one and provided the assembly listing - please see one
in my previous reply. It illustrates how TRASPOSE was compiled to and
presents the x86 code - we see only shifts:
43 0047 48C1EA02 shrq $2,%rdx
44 004b 48C1E802 shrq $2,%rax
No any mul/div, exactly as we expected.
So I think use if-else directly could improves the performance.
[SO]
The if/else construction is usually compiled to conditional jumps, the branch
prediction in CPU over the various ingress traffic patterns (we are analyzing
the
flags of the received packets) might not work well and we’ll get performance
penalty.
Hence, it seems the best practice is not to have the conditional jumps at all.
The existing code follows this approach as we can see from the assembly listing
- there
is no conditional jumps.
With best regards,
Slava
PS. Just removed embarrassing details from the listing - this is merely the
resulting code
of rxq_cq_to_ol_flags(). I removed static and made this one non-inline to see
the
isolated piece of code:
rxq_cq_to_ol_flags:
movzwl 28(%rdi),%edx // endianness conversion optimized out at all
movl %edx,%eax
andw $512,%dx
andw $1024,%ax
movzwl %dx,%edx
movzwl %ax,%eax
shrq $2,%rdx
shrq $2,%rax
orl %edx,%eax
ret
PPS. As we can see - the shift values are the same for both flags, so there
might be some area to optimize
(we could have only one shift and only one masking with AND)