On 12.12.2018 13:27, Anssi Hannula wrote: > On 12.12.2018 12:58, claudiu.bez...@microchip.com wrote: >> >> On 11.12.2018 15:21, Anssi Hannula wrote: >>> On 10.12.2018 12:34, claudiu.bez...@microchip.com wrote: >>>> On 07.12.2018 14:00, Anssi Hannula wrote: >>>>> On 6.12.2018 16:14, claudiu.bez...@microchip.com wrote: >>>>>> Hi Anssi, >>>>> Hi! >>>>> >>>>>> On 05.12.2018 16:00, Anssi Hannula wrote: >>>>>>> On 5.12.2018 14:37, claudiu.bez...@microchip.com wrote: >>>>>>>> On 30.11.2018 20:21, Anssi Hannula wrote: >>>>>>>>> When reading buffer descriptors on RX or on TX completion, an >>>>>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has >>>>>>>>> been populated. However, there are no memory barriers to ensure that >>>>>>>>> the >>>>>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect >>>>>>>>> to >>>>>>>>> that bit. >>>>>>>>> >>>>>>>>> Fix that by adding DMA read memory barriers on those paths. >>>>>>>>> >>>>>>>>> I did not observe any actual issues caused by these being missing, >>>>>>>>> though. >>>>>>>>> >>>>>>>>> Tested on a ZynqMP based system. >>>>>>>>> >>>>>>>>> Signed-off-by: Anssi Hannula <anssi.hann...@bitwise.fi> >>>>>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver") >>>>>>>>> Cc: Nicolas Ferre <nicolas.fe...@microchip.com> >>>>>>>>> --- >>>>>>>>> drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++---- >>>>>>>>> 1 file changed, 16 insertions(+), 4 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c >>>>>>>>> b/drivers/net/ethernet/cadence/macb_main.c >>>>>>>>> index 430b7a0f5436..c93baa8621d5 100644 >>>>>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c >>>>>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c >>>>>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue >>>>>>>>> *queue) >>>>>>>>> >>>>>>>>> /* First, update TX stats if needed */ >>>>>>>>> if (skb) { >>>>>>>>> + /* Ensure all of desc is at least as >>>>>>>>> up-to-date >>>>>>>>> + * as ctrl (TX_USED bit) >>>>>>>>> + */ >>>>>>>>> + dma_rmb(); >>>>>>>>> + >>>>>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this >>>>>>>> time >>>>>>>> data specific to this descriptor was completed. The TX descriptors for >>>>>>>> next >>>>>>>> data to be send is updated under a locked spinlock. >>>>>>> The previous rmb() is before the TX_USED check, so my understanding is >>>>>>> that the following could happen in theory: >>>>>> We are using this IP on and ARM architecture, so, with regards to rmb(), >>>>>> I >>>>>> understand from [1] that dsb completes when: >>>>>> "All explicit memory accesses before this instruction complete. >>>>>> All Cache, Branch predictor and TLB maintenance operations before this >>>>>> instruction complete." >>>>>> >>>>>>> 1. rmb(). >>>>>> According to [1] this should end after all previous instructions (loads, >>>>>> stores) ends. >>>>>> >>>>>>> 2. Reads are reordered so that TX timestamp is read first - no barriers >>>>>>> are crossed. >>>>>> But, as per [1], no onward instruction will be reached until all >>>>>> instruction prior to dsb ends, so, after rmb() all descriptor's members >>>>>> should be updated, right? >>>>> The descriptor that triggered the TX interrupt should be visible now, yes. >>>>> However, the controller may be writing to any other descriptors at the >>>>> same time as the loop is processing through them, as there are multiple >>>>> TX buffers. >>>> Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at >>>> the beginning of loop intended for that... In the 2nd loop you are >>>> operating with the same descriptor which was read in the first loop. >>> I'm concerned about the 2nd iteration of the first for loop in >>> macb_tx_interrupt(). >>> That operates on a different descriptor due to tail++, and that >>> different descriptor may have in-flight writes from controller as the >>> interrupt may or may not already be raised for it. >> I agree with that, there may be in-flights updates, but since ownership bit >> should be updated at the end of the TX (I expect at this moment to be >> updated all the descriptor's parts, including timestamps), if you read a >> descriptor and it hasn't been treated by the hardware the descriptor is own >> by the hardware and there is this check after ctrl is read: >> >> if (!(ctrl & MACB_BIT(TX_USED))) >> break; >> >> Next time a new tx interrupt will be raised this descriptor would be also >> treated. >> >> In case the above instruction doesn't break then it means the descriptor >> was transmitted. And since ownership should be updated at the end of the >> TX, this means no in-flights updates should be done on this descriptor, so, >> when you will read the timestamps these should be the good ones (even it is >> the first one, the 3rd, and so on). > > Agreed with above, the only problem is if the ts1/ts2 load is reordered > before ctrl load. > >> Moreover, this new barrier barrier is in the 2nd loop, you're placing a >> memory barrier after every step in the 2nd loop for the same descriptor >> read in the 1st loop (it was already loaded in a processor register). > > The barrier is inside "if (skb)", and skb is only set for the last frame
Agree, this escaped me. > as indicated by the code at the end of the loop: > > /* skb is set only for the last buffer of the frame. > * WARNING: at this point skb has been freed by > * macb_tx_unmap(). > */ > if (skb) > break; > > >> To avoid reordering of ctrl and ts1/ts2 loads wouldn't be enough to have >> memory barrier after: >> ctrl = desc->ctrl; > > Yes, it would. > My v2 moved it into the other direction, into the ts code, though, which > has the added benefit of no barrier when timestamps are disabled (which > IIRC is the default). > But I can move it to just after ctrl if you so prefer. It's ok in gem_ptp_txstamp(). > >>> Or am I missing something? >>> >>>>>>> 3. HW writes timestamp and sets TX_USED (or they become visible). >>>>>> I expect hardware to set TX_USED and timestamp before raising TX complete >>>>>> interrupt. If so, there should be no on-flight updates of this >>>>>> descriptor, >>>>>> right? Hardware raised a TX_USED bit read interrupt when it reads a >>>>>> descriptor like this and hangs TX. >>>>> For the first iteration of the loop, that is correct - there should be >>>>> no in-flight writes from controller as it already raised the interrupt. >>>>> However, the following iterations of the loop process descriptors that >>>>> may or may not have the interrupt raised yet, and therefore may still >>>>> have in-flight writes. >>>> I expect the hardware to give up ownership of the descriptor just at the >>>> end of TX operation, so that the word containing TX_USED to be the last one >>>> updated. If you reads the descriptor and TX_USED is not there the first >>>> loop breaks, queue->tx_tail is updated with the last processed descriptor >>>> in the queue (see "queue->tx_tail = tail;") then the next interrupt will >>>> start processing with this last one descriptor. >>> Agreed, that is how it works. Issues only occur if we somehow operate on >>> data before ownership was transferred. >> Could this happens? Shouldn't this: >> if (!(ctrl & MACB_BIT(TX_USED))) >> break; >> >> assure that this could not happen and the fact that the ownership is passed >> to CPU at the end of descriptor update? > > I don't see how that prevents desc->ctrl and desc->ts_1/ts_2 reordering, > which is the only issue that I can see. > By "data" above I meant ts_1/ts_2, AFAICS there is nothing else we read > from the descriptors aside from ctrl itself. > >> If load reordering b/w desc->ctrl and desc->ts_1/ts_2 wouldn't be enough to >> have a barrier after >> ctrl = desc->ctrl instruction? > > Yes, per above. > >>>>>>> 4. Code checks TX_USED. >>>>>>> 5. Code operates on timestamp that is actually garbage. >>>>>>> >>>>>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this >>>>>>> than dma_rmb(), though. >>>>>> If you still think this scenario could happen why not calling a dsb in >>>>>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it. >>>>> OK, I will move it there. Unless we arrive at a conclusion that it is >>>>> unnecessary altogether, of course :) >>>>> >>>>>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid >>>>>> timestamp was placed in the descriptor. But, again, I expect the >>>>>> timestamp >>>>>> and TX_USED to be set by hardware before raising TX complete interrupt. >>>>> Yes, but since my concern is that without barriers in between, >>>>> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as >>>>> set even though ts_1 is not yet an actual timestamp. And per above, all >>>>> this may occur before the TX complete interrupt is raised for the >>>>> descriptor in question. >>>> If so, why not placing the barrier when reading timestamps? From my point >>>> of view the place of "dma_rmb()" in this patch doesn't guarantees that >>>> ts1/ts2 were read after the execution of barrier (correct me if I'm wrong). >>> If the timestamp ts1/ts2 reads are done after dma_rmb(), and dev->ctrl >>> is read before the barrier, my understanding is that they are guaranteed >>> to be ordered so that dev->ctrl is read before ts1/ts2. >>> >>> Per [1], "[DMB] ensures that all explicit memory accesses that appear in >>> program order before the |DMB| instruction are observed before any >>> explicit memory accesses that appear in program order after the |DMB| >>> instruction". >>> And the compiler barrier ("memory" asm clobber) included within >>> dma_rmb() ensures that in program order, ctrl is loaded before the >>> barrier, and ts1/ts2 after it. >>> >>> But as mentioned, it makes sense to have it closer to the ts1/ts2 reads >>> so I'll make that change. >>> >>> [1] >>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html >>> >>> >>>>> I agree that this TX case seems somewhat unlikely to be triggered in >>>>> real life at least at present, though, as the ts_1/ts_2 read is so far >>>>> after the desc->ctrl read, and behind function calls, so unlikely to be >>>>> reordered before desc->ctrl read. >>>>> But the similar RX cases below seem more problematic as the racy loads >>>>> in question are right after each other in code. >>>>> > [...] >