Hi Simon, Hi Sylvain,
 
we've found it!
 
We're not using nonblocking connections as this never did work properly … in my 
today's debug session I've found why. In api_lib.c, line 610 this code just 
prevents using NETCONN_WRITE for nonblocking connections:
 
  dontblock = netconn_is_nonblocking(conn) || (apiflags & NETCONN_DONTBLOCK);
  if (dontblock && !bytes_written) {
    /* This implies netconn_write() cannot be used for non-blocking send, since
       it has no way to return the number of bytes written. */
    return ERR_VAL;
  }
 
 
To avoid a too long blocking we're using send timeouts. And here was the issue.
 
As Neil already correctly has found, a part of the data was not written into 
the TCP send buffer. This happened in do_writemore in api_msg.c in the send 
timeout handling:
 
    if (conn->write_offset == 0) {
      /* nothing has been written */
      err = ERR_WOULDBLOCK;
      conn->current_msg->msg.w.len = 0;
    } else {
      /* partial write */
      err = ERR_OK;
      conn->current_msg->msg.w.len = conn->write_offset;
      conn->write_offset = 0;
    }
 
If NETCONN_WRITE wants to write data and the remaining space in the TCP send 
buffer is less than the data to be written, do_writemore will write as many 
data as it can. As the outgoing flow is quite slow, it'll then run into the 
send timeout to avoid a too long blocking. We'll get the "partial write" 
condition above and it'll return an ERR_OK which is passed as the result of the 
NETCONN_WRITE, so my application routing assumes the data is all written. In 
fact, it's only partly written and the data which did not match into the 
TCP_SNDBUF is lost here (which is the loss we have faced).
 
Giving back an ERR_OK where data is definitely lost is clearly a bug, so as a 
solution we've changed the SNDTIMEO handling in do_writemore in this way:
 
#if LWIP_SO_SNDTIMEO
                if (conn->send_timeout != 0)
                {
    dataptr = (u8_t*)conn->current_msg->msg.w.dataptr + conn->write_offset;
    diff = conn->current_msg->msg.w.len - conn->write_offset;
    if (diff > 0xffffUL)  /* max_u16_t */
                               {
      len = 0xffff;
                               }
                               else
                               {
      len = (u16_t)diff;
                               }
    available = tcp_sndbuf(conn->pcb.tcp);
    if (available < len)
                               {
      err = ERR_WOULDBLOCK;
      goto err_mem;
                               }
                }
  if ((conn->send_timeout != 0) &&
      ((s32_t)(sys_now() - conn->current_msg->msg.w.time_started) >= 
conn->send_timeout)) {
    write_finished = 1;
    if (conn->write_offset == 0) {
      /* nothing has been written */
      err = ERR_WOULDBLOCK;
      conn->current_msg->msg.w.len = 0;
    } else {
      /* partial write */
      err = ERR_OK;
      conn->current_msg->msg.w.len = conn->write_offset;
      conn->write_offset = 0;
    }
  } else
#endif /* LWIP_SO_SNDTIMEO */
 
If the timeout is enabled and a sent_timeout is set for the connection, we 
check if the data to be sent will completely fit into the TCP_SNDBUF. If this 
is not the case, as we have no way to return the amount of unwritten bytes to 
the NETCONN_WRITE output, we'll not write anything but return an ERR_WOULDBLOCK 
back. That's a proper return for the my upper  layers to know that the data was 
not sent. It'll kept in my output queue and my IP-Server task will periodically 
retry to send the data out.
 
With that modification we don't see any data loss anymore.
 
This condition should however be checked from your side as it allows a 
potential data loss without the application getting aware of it.
 
Thank you for your time and ideas in that issue!
 
Marco
_______________________________________________
lwip-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/lwip-users

Reply via email to