Package: librelp0 Version: 1.0.0-1 Severity: important Tags: upstream patch
Dear Maintainer, While using rsyslog with its relp plugin linked to librelp0, I encountered a bug where my central rsyslog server entered into an endless loop and it then basically stopped processing the logs. The rsyslogd process was fully using one core on the system. To help understand this behaviour, I ran rsyslog in foreground with debug flag: # rsyslogd -c5 -n -d The resulting output can be found in attached file rsyslog_debug_endless_loop_truncated.log.xz. I largely truncated the file to significantly reduce its size. Feel free to tell me if you need to have a look to original file 6MB though. You may notice in the output this repeating line : "tcpSend returns 0". After a deeper analysis I figured out that it was due to one particular "client" node whose TCP socket was in a weird state. Every send() that were performed by librelp0 on this socket returned -1 and set errno to EAGAIN, endlessly. By endlessly, I mean approximately 20 days. Here the backtrace according to GDB on the breakpoint tcp.c:488: #0 relpTcpSend (pThis=pThis@entry=0x69e1c0, pBuf=<optimized out>, pLenBuf=pLenBuf@entry=0x7ffff475ead8) at tcp.c:488 #1 0x00007ffff5f71482 in relpSendbufSend (pThis=0x69bbb0, pTcp=pTcp@entry=0x69e1c0) at sendbuf.c:131 #2 0x00007ffff5f71274 in relpSendqSend (pThis=pThis@entry=0x6992d0, pTcp=pTcp@entry=0x69e1c0) at sendq.c:249 #3 0x00007ffff5f7133a in relpSendqAddBuf (pThis=0x6992d0, pBuf=<optimized out>, pTcp=0x69e1c0) at sendq.c:170 #4 0x00007ffff5f6f2a5 in relpSessSendResponse (pThis=pThis@entry=0x69d210, txnr=<optimized out>, pData=<optimized out>, lenData=<optimized out>) at relpsess.c:248 #5 0x00007ffff5f725c9 in relpSCInit (pFrame=pFrame@entry=0x699510, pSess=pSess@entry=0x69d210) at copen.c:141 #6 0x00007ffff5f6e863 in relpEngineDispatchFrame (pThis=<optimized out>, pSess=pSess@entry=0x69d210, pFrame=pFrame@entry=0x699510) at relp.c:474 #7 0x00007ffff5f7005f in relpFrameProcessOctetRcvd (ppThis=ppThis@entry=0x69d230, c=10 '\n', pSess=pSess@entry=0x69d210) at relpframe.c:207 #8 0x00007ffff5f6ec07 in relpSessRcvData (pThis=0x69d210) at relpsess.c:224 #9 0x00007ffff5f6e709 in relpEngineRun (pThis=0x6745a0) at relp.c:405 #10 0x000000000043ae06 in ?? () #11 0x00007ffff79b0b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #12 0x00007ffff72eee6d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #13 0x0000000000000000 in ?? () Function relpSendqSend() has a while loop that loops until there is no data to send anymore. But with this weird socket state, the loop becomes endless. When used by rsyslog, this endless loop results in a DOS on the central rsyslog server. The attached patch abort-send-loop-after-max-tries.diff is a proposal to fix this bug. It counts the number of tries, and if it reaches 100 tries, relpSendqSend() returns RELP_RET_PARTIAL_WRITE. Note that this return value is already "managed" by relpSendqAddBuf() in some way. It would probably be better to tell rsyslog about this error, other than a simple debug message, so that it could then appear in rsyslog own logs and warn the sysadmins of this weird socket/node. But I'm not familiar with librepl0 and rsyslog code base so I haven't taken time to get deeper into this. -- System Information: Debian Release: 7.6 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 3.2.0-4-amd64 (SMP w/4 CPU cores) Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages librelp0 depends on: ii libc6 2.13-38+deb7u4 librelp0 recommends no packages. librelp0 suggests no packages. -- no debconf information
--- librelp-1.0.0.orig/src/sendq.c 2009-12-01 12:55:08.000000000 +0100 +++ librelp-1.0.0/src/sendq.c 2014-11-27 13:08:11.870560533 +0100 @@ -237,6 +237,8 @@ { relpSendqe_t *pEntry; relpRetVal localRet; + const int max_tries = 100; + int nb_tries = 0; ENTER_RELPFUNC; RELPOBJ_assert(pThis, Sendq); @@ -253,7 +256,13 @@ */ CHKRet(relpSendqDelFirstBuf(pThis)); pEntry = pThis->pRoot; /* as we deleted from the root, the is our next entry */ - } else if(localRet != RELP_RET_PARTIAL_WRITE) { + } else if(localRet == RELP_RET_PARTIAL_WRITE) { + if(nb_tries >= max_tries) { + pThis->pEngine->dbgprint("relpSendqSend() gives up after %d partial writes\n", nb_tries); + ABORT_FINALIZE(localRet); + } else + nb_tries++; + } else { ABORT_FINALIZE(localRet); } }
rsyslog_debug_endless_loop_truncated.log.xz
Description: Binary data