There is a bug in the DMA FIFO read logic that is likely the root cause of this. Changing the line below in axi_dma_fifo.v fixed it for me.
OUTPUT2: begin // Replicated write logic to break a read timing critical path for read_count read_count <= (output_page_boundry < occupied_minus_one) ? output_page_boundry[7:0] : occupied_minus_one[7:0]; - read_count_plus_one <= (output_page_boundry < occupied_minus_one) ? ({1'b0,output_page_boundry[7:0]} + 9'd1) : {1'b0, occupied[7:0]}; + read_count_plus_one <= (output_page_boundry < occupied_minus_one) ? ({1'b0,output_page_boundry[7:0]} + 9'd1) : ({1'b0, occupied_minus_one[7:0]} + 9'd1); -Juan On Wed, Aug 29, 2018 at 9:30 AM Alan Conrad via USRP-users < usrp-users@lists.ettus.com> wrote: > Thanks Brian, that certainly sounds like the problem I’m experiencing. > I’ll try rebuilding my FPGA and UHD as you suggest. If that doesn’t work > or I get more information, I let you know. > > > > Thanks again, > > > > Al > > > > *From:* Brian Padalino <bpadal...@gmail.com> > *Sent:* Tuesday, August 28, 2018 8:57 PM > *To:* Alan Conrad <acon...@gogoair.com> > *Cc:* USRP-users@lists.ettus.com > *Subject:* Re: [USRP-users] Transmit Thread Stuck Receiving Tx Flow > Control Packets > > > > > > On Tue, Aug 28, 2018 at 4:02 PM Alan Conrad via USRP-users < > usrp-users@lists.ettus.com> wrote: > > Hi All, > > > > I’ve been working on an application that requires two receive streams and > two transmit streams, written using the C++ API. I have run into a problem > when transmitting packets and I am hoping that someone has seen something > similar and/or may be able to shed some light on this. > > > > My application is streaming two receive and two transmit channels, each at > 100 Msps over dual 10GigE interfaces (NIC is Intel X520-DA2). I have two > receive threads, each calling recv() on separate receive streams, and two > transmit threads each calling send(), also on separate transmit streams. > Each receive thread copies samples into a large circular buffer. Each > transmit thread reads samples from the buffer to be sent in the send() > call. So, each receive thread is paired with a transmit thread through a > shared circular buffer with some mutex locking to prevent simultaneous > access to shared circular buffer memory. > > > > I did read in the UHD manual that recv() is not thread safe. I assumed > that this meant that recv() is not thread safe when called on the same > rx_streamer from two different threads but would be ok when called on > different rx_streamers. If this is not the case, please let me know. > > > > On to my problem… > > > > After running for several minutes, one of the transmit threads will get > stuck in the send() call. Using strace to monitor the system calls it > appears that the thread is in a loop continuously calling the > > poll() and recvfrom() system calls from within the UHD API. Here’s the > output of strace attached to one of the transmit threads after this has > occurred. These are the only two system calls that get logged for the > transmit thread once this problem occurs. > > > > 11:19:04.564078 poll([{fd=62, events=POLLIN}], 1, 100) = 0 (Timeout) > > 11:19:04.664276 recvfrom(62, 0x5619724e90c0, 1472, MSG_DONTWAIT, NULL, > NULL) = -1 EAGAIN (Resource temporarily unavailable) > > 11:19:04.664381 poll([{fd=62, events=POLLIN}], 1, 100) = 0 (Timeout) > > 11:19:04.764600 recvfrom(62, 0x5619724e90c0, 1472, MSG_DONTWAIT, NULL, > NULL) = -1 EAGAIN (Resource temporarily unavailable) > > 11:19:04.764699 poll([{fd=62, events=POLLIN}], 1, 100) = 0 (Timeout) > > 11:19:04.864906 recvfrom(62, 0x5619724e90c0, 1472, MSG_DONTWAIT, NULL, > NULL) = -1 EAGAIN (Resource temporarily unavailable) > > > > This partial stack trace shows that the transmit thread is stuck in the > while loop in the tx_flow_ctrl() function. I think this is happening due > to missed or missing TX flow control packets. > > > > #0 0x00007fdb8fe4fbf9 in __GI___poll (fds=fds@entry=0x7fdb167fb510, > nfds=nfds@entry=1, timeout=timeout@entry=100) at > ../sysdeps/unix/sysv/linux/poll.c:29 > > #1 0x00007fdb9186de45 in poll (__timeout=100, __nfds=1, > __fds=0x7fdb167fb510) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46 > > #2 uhd::transport::wait_for_recv_ready (timeout=0.10000000000000001, > sock_fd=<optimized out>) at > /home/aconrad/rfnoc/src/uhd/host/lib/transport/udp_common.hpp:59 > > #3 udp_zero_copy_asio_mrb::get_new (index=@0x55726266f6e8: 28, > timeout=<optimized out>, this=<optimized out>) > > at /home/aconrad/rfnoc/src/uhd/host/lib/transport/udp_zero_copy.cpp:79 > > #4 udp_zero_copy_asio_impl::get_recv_buff (this=0x55726266f670, > timeout=<optimized out>) at > /home/aconrad/rfnoc/src/uhd/host/lib/transport/udp_zero_copy.cpp:226 > > #5 0x00007fdb915d48cc in tx_flow_ctrl (fc_cache=..., async_xport=..., > endian_conv=0x7fdb915df600 <uhd::ntohx<unsigned int>(unsigned int)>, > > unpack=0x7fdb918b1090 > <uhd::transport::vrt::chdr::if_hdr_unpack_be(unsigned int const*, > uhd::transport::vrt::if_packet_info_t&)>) > > at > /home/aconrad/rfnoc/src/uhd/host/lib/usrp/device3/device3_io_impl.cpp:345 > > > > The poll() and recvfrom() calls are in the > udp_zero_copy_asio_mrb::get_new() function in udp_zero_copy.cpp. > > > > Has anyone seen this problem before or have any suggestions on what else > to look at to further debug this problem? I have not yet used Wireshark to > see what’s happening on the wire, but I’m planning to do that. Also note > that, if I run a single transmit/receive pair (instead of two) I don’t see > this problem and everything works as I expect. > > > > My hardware is an X310 with the XG firmware and dual SBX-120 > daughterboards. Here are the software versions I’m using, as displayed by > the UHD API when the application starts. > > > > [00:00:00.000049] Creating the usrp device with: > addr=192.168.30.2,second_addr=192.168.40.2... > > [INFO] [UHD] linux; GNU C++ version 7.3.0; Boost_106501; > UHD_4.0.0.rfnoc-devel-788-g1f8463cc > > > > The host is a Dell PowerEdge R420 with 24 CPU cores and 24 GB ram. I > think the clock speed is a little lower than recommended at 2.7 GHz but > thought that I could distribute the work load across the various cores to > account for that. Also, I have followed the instructions to setup dual 10 > GigE interfaces for the X310 here, > https://kb.ettus.com/Using_Dual_10_Gigabit_Ethernet_on_the_USRP_X300/X310 > <https://protect-us.mimecast.com/s/MKjlC31KyLFp6Lm5cgYuvL?domain=kb.ettus.com> > . > > > > Any help is appreciated. > > > > I think you're hitting this: > > > > https://github.com/EttusResearch/uhd/issues/203 > > > > Which is the same thing that I hit. I tracked it down to something > happening in the FPGA with the DMA FIFO. > > > > I rebuilt my FPGA and UHD off the following commits, which switch over to > byte based flow control: > > > > UHD commit 98057752006b5c567ed331c5b14e3b8a281b83b9 > > FPGA commit c7015a9a57a77c0e312f0c56e461ac479cf7f1e9 > > > > And the problem disappeared for the time being. The infinite loop still > exists as a potential issue, but it seemed whatever was causing the lockup > in the DMA FIFO disappeared or at least couldn't be reproduced. > > > > Give that a shot and see if it works for you, or if you can still > reproduce it? We never got to the root cause of the problem. > > > > Brian > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >
_______________________________________________ USRP-users mailing list USRP-users@lists.ettus.com http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com