[RFC] TCP/IPV4 and e1000 does not grow window

Jesse Brandeburg Tue, 31 Jan 2006 16:29:48 -0800

We've recently run into a problem with performance due to what appears tobe the socket buffer accounting based on skb->truesize.


NOTE: detailed analysis follows, please bear with me...

This code in tcp_input.c is preventing growth of the window when e1000pci/pci-x is a client. This in turn causes the sender to not send aslarge of a chunk of data as it can, which yields lower throughput due toless efficient transfers, especially in a bidirectional test with multipleclients.


the code in question, apologies for the wrapping:

232 static int __tcp_grow_window(const struct sock *sk, struct tcp_sock*tp,

233                              const struct sk_buff *skb)
234 {
235         /* Optimize this! */
236         int truesize = tcp_win_from_space(skb->truesize)/2;
237         int window = tcp_win_from_space(sysctl_tcp_rmem[2])/2;
238
239         while (tp->rcv_ssthresh <= window) {
240                 if (truesize <= skb->len)
241                         return 2 * inet_csk(sk)->icsk_ack.rcv_mss;
242
243                 truesize >>= 1;
244                 window >>= 1;
245         }
246         return 0;
247 }

This code begins to return 0 when the rcv_ssthresh reaches (in my testwith 2.6.15) 66608. I have tcpdump data if you would like to see thewindow growth.

Assuming a 1500 mtu, This appears to be due to e1000 allocating 2k buffersfor hardware + 2 bytes of alignment to align the ip header (NET_IP_ALIGN),to which then dev_alloc_skb adds 16 bytes. There is a final alloc_skbcall of 2048 + 16 + 2 bytes, which then comes from 4096 byte slab.

skb->truesize is 4k, this is used to compute the socket buffer charge, andalso to limit window size, which in turn limits our future window growth.

e1000 hardware (for pci/pci-x) has a requirement of giving power of 2sized buffers to hardware.


There are three solutions that I see, and I'd like some input on:

1) give 1k buffers to hardware, and use two of them for every 1500 byteframe. This is slightly less inefficient for DMA, and will probably givesome of our customers with long latency to I/O heartburn.

2) tell hardware to use 2k buffers, but fudge how much memory we allocateso we don't go past the 2k boundary. Hardware enforces that frames willbe discarded if they are longer than 1522 bytes, so we shouldn't have toworry about the adapter DMAing past the end of the buffer allocated.this is optimized for this 1500 MTU case. This is efficient, has a bettermemory footprint, but dma overrun can only be detected after the fact (wemay BUG or WARN_ON if we notice this occuring)

3) fix the socket buffer accounting code to work better with DMA devicesthat need power of two allocations. If the socket buffer accounting codedidn't mix the semantics of TCP window size with total stack memoryutilization it would also solve this problem. All DMA hardware I've usedhas to pre-allocate a fixed size buffer for receives, guaranteeing somewaste (up to the next power of 2)


here is a proposed patch (compile tested) to do option 2) for e1000



=====================

e1000: optimize memory usage for 1500 MTU on pci-x legacy recieve path

This patch attempts to optimize memory usage with the assumption that since
the hardware will drop packets longer than 1522 bytes when RCTL.LPE is 0,
we can cheat a little to eliminate unnessary allocated memory.  This should
approximately halve our memory usage for receive.

Signed-off-by: Jesse Brandeburg <[EMAIL PROTECTED]>

---

 drivers/net/e1000/e1000_main.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3955,6 +3955,11 @@ e1000_alloc_rx_buffers(struct e1000_adap
        i = rx_ring->next_to_use;
        buffer_info = &rx_ring->buffer_info[i];

+       if (adapter->netdev->mtu <= ETH_DATA_LEN) {
+               /* unreserve a little since LPE won't be set */
+               bufsz = adapter->rx_buffer_len - NET_IP_ALIGN - 16;
+       }
+
        while (cleaned_count--) {
                if (!(skb = buffer_info->skb))
                        skb = dev_alloc_skb(bufsz);
@@ -4002,26 +4007,26 @@ e1000_alloc_rx_buffers(struct e1000_adap
                skb->dev = netdev;

                buffer_info->skb = skb;
-               buffer_info->length = adapter->rx_buffer_len;
+               buffer_info->length = bufsz;
 map_skb:
                buffer_info->dma = pci_map_single(pdev,
                                                  skb->data,
-                                                 adapter->rx_buffer_len,
+                                                 buffer_info->length,
                                                  PCI_DMA_FROMDEVICE);

                /* Fix for errata 23, can't cross 64kB boundary */
                if (!e1000_check_64k_bound(adapter,
                                        (void *)(unsigned long)buffer_info->dma,
-                                       adapter->rx_buffer_len)) {
+                                       buffer_info->length)) {
                        DPRINTK(RX_ERR, ERR,
                                "dma align check failed: %u bytes at %p\n",
-                               adapter->rx_buffer_len,
+                               buffer_info->length,
                                (void *)(unsigned long)buffer_info->dma);
                        dev_kfree_skb(skb);
                        buffer_info->skb = NULL;

                        pci_unmap_single(pdev, buffer_info->dma,
-                                        adapter->rx_buffer_len,
+                                        buffer_info->length,
                                         PCI_DMA_FROMDEVICE);

                        break; /* while !buffer_info->skb */
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] TCP/IPV4 and e1000 does not grow window

Reply via email to