Package: librelp0
Version: 1.0.0-1
Severity: important
Tags: upstream patch

Dear Maintainer,

While using rsyslog with its relp plugin linked to librelp0, I encountered a
bug where my central rsyslog server entered into an endless loop and it then
basically stopped processing the logs. The rsyslogd process was fully using one
core on the system.

To help understand this behaviour, I ran rsyslog in foreground with debug flag:

# rsyslogd -c5 -n -d

The resulting output can be found in attached file
rsyslog_debug_endless_loop_truncated.log.xz. I largely truncated the file to
significantly reduce its size. Feel free to tell me if you need to have a look
to original file 6MB though. You may notice in the output this repeating line :
"tcpSend returns 0".

After a deeper analysis I figured out that it was due to one particular "client"
node whose TCP socket was in a weird state. Every send() that were performed by
librelp0 on this socket returned -1 and set errno to EAGAIN, endlessly. By
endlessly, I mean approximately 20 days.

Here the backtrace according to GDB on the breakpoint tcp.c:488:

#0  relpTcpSend (pThis=pThis@entry=0x69e1c0, pBuf=<optimized out>, 
pLenBuf=pLenBuf@entry=0x7ffff475ead8) at tcp.c:488
#1  0x00007ffff5f71482 in relpSendbufSend (pThis=0x69bbb0, 
pTcp=pTcp@entry=0x69e1c0) at sendbuf.c:131
#2  0x00007ffff5f71274 in relpSendqSend (pThis=pThis@entry=0x6992d0, 
pTcp=pTcp@entry=0x69e1c0) at sendq.c:249
#3  0x00007ffff5f7133a in relpSendqAddBuf (pThis=0x6992d0, pBuf=<optimized 
out>, pTcp=0x69e1c0) at sendq.c:170
#4  0x00007ffff5f6f2a5 in relpSessSendResponse (pThis=pThis@entry=0x69d210, 
txnr=<optimized out>, pData=<optimized out>, lenData=<optimized out>) at 
relpsess.c:248
#5  0x00007ffff5f725c9 in relpSCInit (pFrame=pFrame@entry=0x699510, 
pSess=pSess@entry=0x69d210) at copen.c:141
#6  0x00007ffff5f6e863 in relpEngineDispatchFrame (pThis=<optimized out>, 
pSess=pSess@entry=0x69d210, pFrame=pFrame@entry=0x699510) at relp.c:474
#7  0x00007ffff5f7005f in relpFrameProcessOctetRcvd 
(ppThis=ppThis@entry=0x69d230, c=10 '\n', pSess=pSess@entry=0x69d210) at 
relpframe.c:207
#8  0x00007ffff5f6ec07 in relpSessRcvData (pThis=0x69d210) at relpsess.c:224
#9  0x00007ffff5f6e709 in relpEngineRun (pThis=0x6745a0) at relp.c:405
#10 0x000000000043ae06 in ?? ()
#11 0x00007ffff79b0b50 in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007ffff72eee6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x0000000000000000 in ?? ()

Function relpSendqSend() has a while loop that loops until there is no
data to send anymore. But with this weird socket state, the loop becomes
endless. When used by rsyslog, this endless loop results in a DOS on the
central rsyslog server.

The attached patch abort-send-loop-after-max-tries.diff is a proposal to fix
this bug. It counts the number of tries, and if it reaches 100 tries, 
relpSendqSend() returns RELP_RET_PARTIAL_WRITE. Note that this return value
is already "managed" by relpSendqAddBuf() in some way.

It would probably be better to tell rsyslog about this error, other than a
simple debug message, so that it could then appear in rsyslog own logs and
warn the sysadmins of this weird socket/node. But I'm not familiar with
librepl0 and rsyslog code base so I haven't taken time to get deeper into
this.

-- System Information:
Debian Release: 7.6
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-4-amd64 (SMP w/4 CPU cores)
Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages librelp0 depends on:
ii  libc6  2.13-38+deb7u4

librelp0 recommends no packages.

librelp0 suggests no packages.

-- no debconf information

--- librelp-1.0.0.orig/src/sendq.c	2009-12-01 12:55:08.000000000 +0100
+++ librelp-1.0.0/src/sendq.c	2014-11-27 13:08:11.870560533 +0100
@@ -237,6 +237,8 @@
 {
 	relpSendqe_t *pEntry;
 	relpRetVal localRet;
+	const int max_tries = 100;
+	int nb_tries = 0;
 
 	ENTER_RELPFUNC;
 	RELPOBJ_assert(pThis, Sendq);
@@ -253,7 +256,13 @@
 			 */
 			CHKRet(relpSendqDelFirstBuf(pThis));
 			pEntry = pThis->pRoot; /* as we deleted from the root, the is our next entry */
-		} else if(localRet != RELP_RET_PARTIAL_WRITE) {
+		} else if(localRet == RELP_RET_PARTIAL_WRITE) {
+			if(nb_tries >= max_tries) {
+				pThis->pEngine->dbgprint("relpSendqSend() gives up after %d partial writes\n", nb_tries);
+				ABORT_FINALIZE(localRet);
+			} else
+				nb_tries++;
+		} else {
 			ABORT_FINALIZE(localRet);
 		}
 	}

Attachment: rsyslog_debug_endless_loop_truncated.log.xz
Description: Binary data

Reply via email to