We've recently run into a problem with performance due to what appears to be the socket buffer accounting based on skb->truesize.

NOTE: detailed analysis follows, please bear with me...

This code in tcp_input.c is preventing growth of the window when e1000 pci/pci-x is a client. This in turn causes the sender to not send as large of a chunk of data as it can, which yields lower throughput due to less efficient transfers, especially in a bidirectional test with multiple clients.

the code in question, apologies for the wrapping:

232 static int __tcp_grow_window(const struct sock *sk, struct tcp_sock *tp,
233                              const struct sk_buff *skb)
234 {
235         /* Optimize this! */
236         int truesize = tcp_win_from_space(skb->truesize)/2;
237         int window = tcp_win_from_space(sysctl_tcp_rmem[2])/2;
238
239         while (tp->rcv_ssthresh <= window) {
240                 if (truesize <= skb->len)
241                         return 2 * inet_csk(sk)->icsk_ack.rcv_mss;
242
243                 truesize >>= 1;
244                 window >>= 1;
245         }
246         return 0;
247 }

This code begins to return 0 when the rcv_ssthresh reaches (in my test with 2.6.15) 66608. I have tcpdump data if you would like to see the window growth.

Assuming a 1500 mtu, This appears to be due to e1000 allocating 2k buffers for hardware + 2 bytes of alignment to align the ip header (NET_IP_ALIGN), to which then dev_alloc_skb adds 16 bytes. There is a final alloc_skb call of 2048 + 16 + 2 bytes, which then comes from 4096 byte slab.

skb->truesize is 4k, this is used to compute the socket buffer charge, and also to limit window size, which in turn limits our future window growth.

e1000 hardware (for pci/pci-x) has a requirement of giving power of 2 sized buffers to hardware.

There are three solutions that I see, and I'd like some input on:

1) give 1k buffers to hardware, and use two of them for every 1500 byte frame. This is slightly less inefficient for DMA, and will probably give some of our customers with long latency to I/O heartburn.

2) tell hardware to use 2k buffers, but fudge how much memory we allocate so we don't go past the 2k boundary. Hardware enforces that frames will be discarded if they are longer than 1522 bytes, so we shouldn't have to worry about the adapter DMAing past the end of the buffer allocated. this is optimized for this 1500 MTU case. This is efficient, has a better memory footprint, but dma overrun can only be detected after the fact (we may BUG or WARN_ON if we notice this occuring)

3) fix the socket buffer accounting code to work better with DMA devices that need power of two allocations. If the socket buffer accounting code didn't mix the semantics of TCP window size with total stack memory utilization it would also solve this problem. All DMA hardware I've used has to pre-allocate a fixed size buffer for receives, guaranteeing some waste (up to the next power of 2)

here is a proposed patch (compile tested) to do option 2) for e1000



=====================

e1000: optimize memory usage for 1500 MTU on pci-x legacy recieve path

This patch attempts to optimize memory usage with the assumption that since
the hardware will drop packets longer than 1522 bytes when RCTL.LPE is 0,
we can cheat a little to eliminate unnessary allocated memory.  This should
approximately halve our memory usage for receive.

Signed-off-by: Jesse Brandeburg <[EMAIL PROTECTED]>

---

 drivers/net/e1000/e1000_main.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3955,6 +3955,11 @@ e1000_alloc_rx_buffers(struct e1000_adap
        i = rx_ring->next_to_use;
        buffer_info = &rx_ring->buffer_info[i];

+       if (adapter->netdev->mtu <= ETH_DATA_LEN) {
+               /* unreserve a little since LPE won't be set */
+               bufsz = adapter->rx_buffer_len - NET_IP_ALIGN - 16;
+       }
+
        while (cleaned_count--) {
                if (!(skb = buffer_info->skb))
                        skb = dev_alloc_skb(bufsz);
@@ -4002,26 +4007,26 @@ e1000_alloc_rx_buffers(struct e1000_adap
                skb->dev = netdev;

                buffer_info->skb = skb;
-               buffer_info->length = adapter->rx_buffer_len;
+               buffer_info->length = bufsz;
 map_skb:
                buffer_info->dma = pci_map_single(pdev,
                                                  skb->data,
-                                                 adapter->rx_buffer_len,
+                                                 buffer_info->length,
                                                  PCI_DMA_FROMDEVICE);

                /* Fix for errata 23, can't cross 64kB boundary */
                if (!e1000_check_64k_bound(adapter,
                                        (void *)(unsigned long)buffer_info->dma,
-                                       adapter->rx_buffer_len)) {
+                                       buffer_info->length)) {
                        DPRINTK(RX_ERR, ERR,
                                "dma align check failed: %u bytes at %p\n",
-                               adapter->rx_buffer_len,
+                               buffer_info->length,
                                (void *)(unsigned long)buffer_info->dma);
                        dev_kfree_skb(skb);
                        buffer_info->skb = NULL;

                        pci_unmap_single(pdev, buffer_info->dma,
-                                        adapter->rx_buffer_len,
+                                        buffer_info->length,
                                         PCI_DMA_FROMDEVICE);

                        break; /* while !buffer_info->skb */
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to