We've recently run into a problem with performance due to what appears to
be the socket buffer accounting based on skb->truesize.
NOTE: detailed analysis follows, please bear with me...
This code in tcp_input.c is preventing growth of the window when e1000
pci/pci-x is a client. This in turn causes the sender to not send as
large of a chunk of data as it can, which yields lower throughput due to
less efficient transfers, especially in a bidirectional test with multiple
clients.
the code in question, apologies for the wrapping:
232 static int __tcp_grow_window(const struct sock *sk, struct tcp_sock
*tp,
233 const struct sk_buff *skb)
234 {
235 /* Optimize this! */
236 int truesize = tcp_win_from_space(skb->truesize)/2;
237 int window = tcp_win_from_space(sysctl_tcp_rmem[2])/2;
238
239 while (tp->rcv_ssthresh <= window) {
240 if (truesize <= skb->len)
241 return 2 * inet_csk(sk)->icsk_ack.rcv_mss;
242
243 truesize >>= 1;
244 window >>= 1;
245 }
246 return 0;
247 }
This code begins to return 0 when the rcv_ssthresh reaches (in my test
with 2.6.15) 66608. I have tcpdump data if you would like to see the
window growth.
Assuming a 1500 mtu, This appears to be due to e1000 allocating 2k buffers
for hardware + 2 bytes of alignment to align the ip header (NET_IP_ALIGN),
to which then dev_alloc_skb adds 16 bytes. There is a final alloc_skb
call of 2048 + 16 + 2 bytes, which then comes from 4096 byte slab.
skb->truesize is 4k, this is used to compute the socket buffer charge, and
also to limit window size, which in turn limits our future window growth.
e1000 hardware (for pci/pci-x) has a requirement of giving power of 2
sized buffers to hardware.
There are three solutions that I see, and I'd like some input on:
1) give 1k buffers to hardware, and use two of them for every 1500 byte
frame. This is slightly less inefficient for DMA, and will probably give
some of our customers with long latency to I/O heartburn.
2) tell hardware to use 2k buffers, but fudge how much memory we allocate
so we don't go past the 2k boundary. Hardware enforces that frames will
be discarded if they are longer than 1522 bytes, so we shouldn't have to
worry about the adapter DMAing past the end of the buffer allocated.
this is optimized for this 1500 MTU case. This is efficient, has a better
memory footprint, but dma overrun can only be detected after the fact (we
may BUG or WARN_ON if we notice this occuring)
3) fix the socket buffer accounting code to work better with DMA devices
that need power of two allocations. If the socket buffer accounting code
didn't mix the semantics of TCP window size with total stack memory
utilization it would also solve this problem. All DMA hardware I've used
has to pre-allocate a fixed size buffer for receives, guaranteeing some
waste (up to the next power of 2)
here is a proposed patch (compile tested) to do option 2) for e1000
=====================
e1000: optimize memory usage for 1500 MTU on pci-x legacy recieve path
This patch attempts to optimize memory usage with the assumption that since
the hardware will drop packets longer than 1522 bytes when RCTL.LPE is 0,
we can cheat a little to eliminate unnessary allocated memory. This should
approximately halve our memory usage for receive.
Signed-off-by: Jesse Brandeburg <[EMAIL PROTECTED]>
---
drivers/net/e1000/e1000_main.c | 15 ++++++++++-----
1 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3955,6 +3955,11 @@ e1000_alloc_rx_buffers(struct e1000_adap
i = rx_ring->next_to_use;
buffer_info = &rx_ring->buffer_info[i];
+ if (adapter->netdev->mtu <= ETH_DATA_LEN) {
+ /* unreserve a little since LPE won't be set */
+ bufsz = adapter->rx_buffer_len - NET_IP_ALIGN - 16;
+ }
+
while (cleaned_count--) {
if (!(skb = buffer_info->skb))
skb = dev_alloc_skb(bufsz);
@@ -4002,26 +4007,26 @@ e1000_alloc_rx_buffers(struct e1000_adap
skb->dev = netdev;
buffer_info->skb = skb;
- buffer_info->length = adapter->rx_buffer_len;
+ buffer_info->length = bufsz;
map_skb:
buffer_info->dma = pci_map_single(pdev,
skb->data,
- adapter->rx_buffer_len,
+ buffer_info->length,
PCI_DMA_FROMDEVICE);
/* Fix for errata 23, can't cross 64kB boundary */
if (!e1000_check_64k_bound(adapter,
(void *)(unsigned long)buffer_info->dma,
- adapter->rx_buffer_len)) {
+ buffer_info->length)) {
DPRINTK(RX_ERR, ERR,
"dma align check failed: %u bytes at %p\n",
- adapter->rx_buffer_len,
+ buffer_info->length,
(void *)(unsigned long)buffer_info->dma);
dev_kfree_skb(skb);
buffer_info->skb = NULL;
pci_unmap_single(pdev, buffer_info->dma,
- adapter->rx_buffer_len,
+ buffer_info->length,
PCI_DMA_FROMDEVICE);
break; /* while !buffer_info->skb */
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html