Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)

Milton Miller Mon, 04 Jun 2007 02:11:19 -0700


On Jun 1, 2007, at 3:45 PM, David Acker wrote:

Ok, I took a stab at coding and testing these ideas. Below is a patchagainst 2.6.22-rc3.
Let me know what you think.

I think you got most of the ideas. As Auke noted, your coding styleis showing again. And your mailer again munged whitespace (fixed bys/^<space><space>/<space>/ s/^$/<space>/).

  Testing shows that we can hit any of the following scenarios:

Find a buffer not complete with rx->el and rx->s0 set.
        I read back the status and see if the receiver is still running.


This means the hardware reached what we think is end of list, good.

Find a buffer not complete with rx->el not set and rx->s0 set.
I check the next buffer and if it is complete, I free the skb andreturn 0.

This is the case when hardware read the descriptor between el beingremoved and size being set.

If the next buffer is not complete, I read the receiver's status tosee if he is still running.

The polling of still running will save us waiting for irq or next pollwhen hardware read el before we cleared it, at the cost of an ioreadwhen there is simply no packet ready and we hit a fill mark.

Find a buffer that is complete with rx->el not set and rx->s0 set.
It appears that hardware can read the rfd's el-bit, then software canclear the rfd el-bit and set the rfd size, and then hardware can comein and read the size.

Yes, since the size is after the EL flag in the descriptor, this canhappen since the pci read is not atomic.

I am reading the status back, although I don't think that I have to inthis instance.

Actually, you are reading it when the rfd still has EL set. Since thecpu will never encounter that case, the if condition is neversatisfied.

How about creating a state unknown, for when we think we should checkthe device if its running.If we are in this state and then encounter a received packet without s0set, we can set it back

to running.   We set it when we rx a packet with s0 set.

We then move both io_status reads to the caller.

I am testing a version of this code patched against 2.6.18.4 on my PXA255 based system. I will let you all know how it goes.

I'm assuming this is why the cleanup of the receiver start to alwaysstart on rx_to_clean got dropped again. :-)

Also, I would like a few sentences in the Driver Operation section IVReceive big comment. Something like

In order to keep updates to the RFD link field from colliding withhardware writes to mark packets complete, we use the feature thathardware will not write to a size 0 descriptor and mark the previouspacket as end-of-list (EL). After updating the link, we remove EL andonly then restore the size such that hardware may use theprevious-to-end RFD.

at the end of the first paragraph, and insert software before "nolocking is required" in the second.

On the ARM, their is a race condition between software allocating anew receivebuffer and hardware writing into a buffer. The two race on touchingthe lastReceive Frame Descriptor (RFD). It has its el-bit set and its nextlink equalto 0. When hardware encounters this buffer it attempts to write datato itand then update Status Word bits and Actual Count in the RFD. At thesame timesoftware may try to clear the el-bit and set the link address to a newbuffer.
Since the entire RFD is once cache-line, the two write operations can
collide. This can lead to the receive unit stalling or freed receivebuffers
getting written to.
The fix is to set the el-bit on and the size to 0 on the next to lastbufferin the chain. When the hardware encounters this buffer it stops anddoes
not write to it at all.  The hardware issues an RNR interrupt with the
receive unit in the No Resources state. When software allocatesbuffers,
it can update the tail of the list when either it knows the hardware
has stopped or the previous to the new one to mark marked.
Once it has a new next to last buffer prepared, it can clear the el-bit
and set the size on the previous one.  The race on this buffer is safe
since the link already points to a valid next buffer.  We keep flags
in our software descriptor to note if the el bit is set and if the size
was 0.  When we clear the RFD's el bit and set its size, we also clear
the el flag but we leave the size was 0 bit set.  This was we can find
this buffer again later.

If the hardware sees the el-bit cleared without the size set, it will
move on to the next buffer and skip this one.  If it sees
the size set but the el-bit still set, it will complete that buffer
and then RNR interrupt and wait.


Signed-off-by: David Acker <[EMAIL PROTECTED]>

---
--- linux-2.6.22-rc3/drivers/net/e100.c.orig 2007-06-0116:13:16.000000000 -0400+++ linux-2.6.22-rc3/drivers/net/e100.c 2007-06-01 16:20:36.000000000-0400
@@ -281,10 +281,17 @@ struct csr {
 };

 enum scb_status {
+       rus_no_res       = 0x08,
        rus_ready        = 0x10,
        rus_mask         = 0x3C,
 };

+enum ru_state  {
+       ru_suspended = 0,
+       ru_running       = 1,
+       ru_uninitialized = -1,
+};
+

If we are going to have uninitialized, it should be 0. But its onlyset (1) during teardown, and (2) during rx list allocation. In thesame function its set to suspended.

Also, lets call this stopped, not suspended. Suspended is a specifichardware state we are not using.


See above for my comments about adding unknown or maybe_running.

 enum scb_stat_ack {
        stat_ack_not_ours    = 0x00,
        stat_ack_sw_gen      = 0x04,
@@ -395,10 +402,16 @@ struct rfd {
        u16 size;
 };

+enum rx_flags {
+       rx_el = 0x01,
+       rx_s0 = 0x02,
+};
+
 struct rx {
        struct rx *next, *prev;
        struct sk_buff *skb;
        dma_addr_t dma_addr;
+       u8 flags;
 };

 #if defined(__BIG_ENDIAN_BITFIELD)
@@ -526,6 +539,7 @@ struct nic {
        struct rx *rx_to_use;
        struct rx *rx_to_clean;
        struct rfd blank_rfd;
+       enum ru_state ru_running;

        spinlock_t cb_lock                      ____cacheline_aligned;
        spinlock_t cmd_lock;
@@ -947,7 +961,7 @@ static void e100_get_defaults(struct nic
                ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));

        /* Template for a freshly allocated RFD */
-       nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
+       nic->blank_rfd.command = 0;
        nic->blank_rfd.rbd = 0xFFFFFFFF;
        nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);

@@ -1742,11 +1756,55 @@ static int e100_alloc_cbs(struct nic *ni
        return 0;
 }

-static inline void e100_start_receiver(struct nic *nic)

+static void e100_find_mark_el(struct nic *nic, struct rx *marked_rx,int is_running)

 {
-       /* Start if RFA is non-NULL */
-       if(nic->rx_to_clean->skb)
-               e100_exec_cmd(nic, ruc_start, nic->rx_to_clean->dma_addr);
+       struct rx *rx_to_mark = nic->rx_to_use->prev->prev;
+       struct rfd *rfd_to_mark;

Let's just call this rfd or mark_rfd (since its used for both marksetting and clearing).

+
+       if(is_running && rx_to_mark->prev->flags & rx_el)
+               return;


or just rx_to_mark == marked_rx (saves the load of flags).

+
+       if(marked_rx == rx_to_mark)
+               return;
+
+       rfd_to_mark = (struct rfd *) rx_to_mark->skb->data;
+       rfd_to_mark->command |= cpu_to_le16(cb_el);
+       rfd_to_mark->size = 0;
+       pci_dma_sync_single_for_cpu(nic->pdev, rx_to_mark->dma_addr,
+               sizeof(struct rfd), PCI_DMA_TODEVICE);


for_device, BIDIRECTIONAL

+       rx_to_mark->flags |= (rx_el | rx_s0);
+
+       if(!marked_rx)
+               return;
+
+       rfd_to_mark = (struct rfd *) marked_rx->skb->data;
+       rfd_to_mark->command &= ~cpu_to_le16(cb_el);
+       pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr,
+               sizeof(struct rfd), PCI_DMA_TODEVICE);


for_device BIDIRECTIONAL

+
+       rfd_to_mark->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
+       pci_dma_sync_single_for_cpu(nic->pdev, marked_rx->dma_addr,
+               sizeof(struct rfd), PCI_DMA_TODEVICE);

for_device BIDIRECTIONAL

+
+       if(is_running)
+               marked_rx->flags &= ~rx_el;
+       else
+               marked_rx->flags &= ~(rx_el | rx_s0);
+}
+
+static inline void e100_start_receiver(struct nic *nic, struct rx *rx)
+{
+       if(!nic->rxs) return;
+       if(ru_suspended != nic->ru_running) return;

Are these checks just paranoia? Can we call poll leading to rx_cleanwithout setting up the receive list?

+
+       /* handle init time starts */
+       if(!rx) rx = nic->rxs;

Again, as the previous patch noted, we can drop rx as a parameter andalways (re)start on rx_to_clean.

+
+       /* (Re)start RU if suspended or idle and RFA is non-NULL */
+       if(rx->skb) {
+               e100_exec_cmd(nic, ruc_start, rx->dma_addr);
+               nic->ru_running = ru_running;
+       }
 }

 #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
@@ -1774,8 +1832,6 @@ static int e100_rx_alloc_skb(struct nic
                struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
                put_unaligned(cpu_to_le32(rx->dma_addr),
                        (u32 *)&prev_rfd->link);
-               wmb();
-               prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
                pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
                        sizeof(struct rfd), PCI_DMA_TODEVICE);
        }
@@ -1789,6 +1845,7 @@ static int e100_rx_indicate(struct nic *
        struct sk_buff *skb = rx->skb;
        struct rfd *rfd = (struct rfd *)skb->data;
        u16 rfd_status, actual_size;
+       u8 status;

        if(unlikely(work_done && *work_done >= work_to_do))
                return -EAGAIN;
@@ -1800,9 +1857,46 @@ static int e100_rx_indicate(struct nic *

        DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);

-       /* If data isn't ready, nothing to indicate */
-       if(unlikely(!(rfd_status & cb_complete)))
+       /* If data isn't ready, nothing to indicate
+        * If both the el and s0 rx flags are set, we have hit the marked
+        * buffer but we don't know if hardware has seen it so we check
+        * the status.
+        * If only the s0 flag is set, we check the next buffer.
+        * If it is complete, we know that hardware saw the rfd el bit
+        * get cleared but did not see the rfd size get set so it
+        * skipped this buffer.  We just return 0 and look at the
+        * next buffer.
+        * If only the s0 flag is set but the next buffer is
+        * not complete, we cleared the el flag as hardware
+        * hit this buffer. */
+       if(unlikely(!(rfd_status & cb_complete))) {
+               u8 maskedFlags = rx->flags & (rx_el | rx_s0);
+               if(maskedFlags == (rx_el | rx_s0)) {
+                       status = ioread8(&nic->csr->scb.status);
+                       if(status & rus_no_res) {
+                               nic->ru_running = ru_suspended;
+                       }
+               } else if(maskedFlags == rx_s0) {
+                       struct rx *next_rx = rx->next;
+                       struct rfd *next_rfd = (struct rfd *)next_rx->skb->data;
+                       pci_dma_sync_single_for_cpu(nic->pdev, 
next_rx->dma_addr,
+                               sizeof(struct rfd), PCI_DMA_FROMDEVICE);
+                       if(next_rfd->status & cpu_to_le16(cb_complete)) {
+                               pci_unmap_single(nic->pdev, rx->dma_addr,
+                                       RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
+                               dev_kfree_skb_any(skb);
+                               rx->skb = NULL;
+                               rx->flags &= ~rx_s0;
+                               return 0;
+                       } else {
+                               status = ioread8(&nic->csr->scb.status);
+                               if(status & rus_no_res) {
+                                       nic->ru_running = ru_suspended;
+                               }
+                       }
+               }
                return -ENODATA;
+       }

        /* Get actual data size */
        actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
@@ -1813,6 +1907,15 @@ static int e100_rx_indicate(struct nic *
        pci_unmap_single(nic->pdev, rx->dma_addr,
                RFD_BUF_LEN, PCI_DMA_FROMDEVICE);

+       /* This happens when hardward sees the rfd el flag set
+        * but then sees the rfd size set as well */
+       if(le16_to_cpu(rfd->command) & cb_el) {

The racing read of the dma device might see it, but our cpu will neversee this since it wrote them in the other order.


how about if rx-flags & s0 ?
Also see my comments about ru_unknown.

Where do you clear s0 when we encounter a packet with rfd statuscomplete?


Maybe we should just set flags to 0 when allocating the skb?

+               status = ioread8(&nic->csr->scb.status);
+               if(status & rus_no_res) {
+                       nic->ru_running = ru_suspended;
+               }
+       }
+
        /* Pull off the RFD and put the actual data (minus eth hdr) */
        skb_reserve(skb, sizeof(struct rfd));
        skb_put(skb, actual_size);
@@ -1842,19 +1945,46 @@ static int e100_rx_indicate(struct nic *
 static void e100_rx_clean(struct nic *nic, unsigned int *work_done,
        unsigned int work_to_do)
 {
-       struct rx *rx;
+       struct rx *rx, *marked_rx;
+       int restart_required = 0;
+       int err = 0;

        /* Indicate newly arrived packets */

for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean =rx->next) {

-               if(e100_rx_indicate(nic, rx, work_done, work_to_do))
-                       break; /* No more to clean */
+               err = e100_rx_indicate(nic, rx, work_done, work_to_do);
+               /* Hit quota or no more to clean */
+               if(-EAGAIN == err || -ENODATA == err)
+                       break;
        }

My ru_state of maybe_running would cause us to read the status andresolve it to either running or suspended here (after the clean loop).


+       /* On EAGAIN, hit quota so have more work to do, restart once
+        * cleanup is complete.
+        * Else, are we already rnr? then pay attention!!! this ensures that
+        * the state machine progression never allows a start with a
+        * partially cleaned list, avoiding a race between hardware
+        * and rx_to_clean when in NAPI mode */

It took me a long while to figure out what race you are referring tohere. At first I thought you were just moving a comment. I amtotally lost by "are we already rnr? then pay attention!!!".

This isn't a race, its more like "We don't restart the receiver when westop processing due to NAPI quota because we don't yet know which rfdis the first that already contains data. That is, rx_to_clean pointsto something that might be a used rfd instead of a known clean one."

+       if(-EAGAIN != err && ru_suspended == nic->ru_running)
+               restart_required = 1;
+
+       marked_rx = nic->rx_to_use->prev->prev;
+       while(!(marked_rx->flags & rx_el))
+               marked_rx = marked_rx->prev;
+

Not while. This should be if, followed by BUG_ON(!marked_rx->flags &rx_el). We can remove the BUG_ON later, but it points to an error inour logic if its not one of the two places.

        /* Alloc new skbs to refill list */
        for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
                if(unlikely(e100_rx_alloc_skb(nic, rx)))
                        break; /* Better luck next time (see watchdog) */
        }
+
+       e100_find_mark_el(nic, marked_rx, !restart_required);

We could pass the true ru_status based value here, but since we won'tbe restoring the receiver it doesn't matter.

+
+       if(restart_required) {
+               // ack the rnr?
+               iowrite8(stat_ack_rnr, &nic->csr->scb.stat_ack);
+               e100_start_receiver(nic, nic->rx_to_clean);
+               if(work_done)
+                       (*work_done)++;

Why is restarting the receiver work? Actually receiving the packet iswork.

+       }
 }

 static void e100_rx_clean_list(struct nic *nic)
@@ -1862,6 +1992,8 @@ static void e100_rx_clean_list(struct ni
        struct rx *rx;
        unsigned int i, count = nic->params.rfds.count;

+       nic->ru_running = ru_uninitialized;
+
        if(nic->rxs) {
                for(rx = nic->rxs, i = 0; i < count; rx++, i++) {
                        if(rx->skb) {
@@ -1883,6 +2015,7 @@ static int e100_rx_alloc_list(struct nic
        unsigned int i, count = nic->params.rfds.count;

        nic->rx_to_use = nic->rx_to_clean = NULL;
+       nic->ru_running = ru_uninitialized;


Here is the short duration of uninitialized, and its not the 0 case.


        if(!(nic->rxs = kcalloc(count, sizeof(struct rx), GFP_ATOMIC)))
                return -ENOMEM;
@@ -1897,6 +2030,9 @@ static int e100_rx_alloc_list(struct nic
        }

        nic->rx_to_use = nic->rx_to_clean = nic->rxs;
+       nic->ru_running = ru_suspended;
+
+       e100_find_mark_el(nic, NULL, 0);

        return 0;
 }
@@ -1916,6 +2052,10 @@ static irqreturn_t e100_intr(int irq, vo
        /* Ack interrupt(s) */
        iowrite8(stat_ack, &nic->csr->scb.stat_ack);

+       /* We hit Receive No Resource (RNR); restart RU after cleaning */
+       if(stat_ack & stat_ack_rnr)
+               nic->ru_running = ru_suspended;
+
        if(likely(netif_rx_schedule_prep(netdev))) {
                e100_disable_irq(nic);
                __netif_rx_schedule(netdev);
@@ -2007,7 +2147,7 @@ static int e100_up(struct nic *nic)
        if((err = e100_hw_init(nic)))
                goto err_clean_cbs;
        e100_set_multicast_list(nic->netdev);
-       e100_start_receiver(nic);
+       e100_start_receiver(nic, NULL);


(remove second arg to start_receiver, this diff will go away).

        mod_timer(&nic->watchdog, jiffies);
        if((err = request_irq(nic->pdev->irq, e100_intr, IRQF_SHARED,
                nic->netdev->name, nic->netdev)))
@@ -2088,7 +2228,7 @@ static int e100_loopback_test(struct nic
                mdio_write(nic->netdev, nic->mii.phy_id, MII_BMCR,
                        BMCR_LOOPBACK);

-       e100_start_receiver(nic);
+       e100_start_receiver(nic, NULL);

(and this one).

        if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
                err = -ENOMEM;


Thanks for doing the work David.

milton

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix e100 rx path on ARM (was [PATCH] e100 rx: or s and el bits)

Reply via email to