it's not clear if this config and pstats output is for the sending or receiving system

Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue: origin=core.queue 
size=1000 enqueued=248342 full=702 discarded.full=2 discarded.nf=0 maxqsize=1000

This indicates that the queue to deliver messages out from this system filled it's queue 702 times, that would cause processing on this system to receive messages to block (or at least spill over to the disk files for this queue, which are considerably slower) a larger queue size may help ride out short surges in traffic

one thing that can speed things up is to set a batch size of something on the order of 100-128, something like that on the sending side.

Also, naming the actions makes the pstats output much easier to read

you can run top and hit "H" to show threads and see if there is a thread that is using a lot of cpu.

Is there a firewall/router/switch that could be dropping packets in the path? tcp timeouts/retries could account for delays

David Lang

On Tue, 16 Jul 2024, Jesper Skou Jensen via rsyslog wrote:

Date: Tue, 16 Jul 2024 08:20:51 +0000
From: Jesper Skou Jensen via rsyslog <rsyslog@lists.adiscon.com>
To: "rsyslog@lists.adiscon.com" <rsyslog@lists.adiscon.com>
Cc: Jesper Skou Jensen <j...@jysk.com>
Subject: [rsyslog] rsyslog stops accepting TCP for a minute or two

Hi guys

I hope that someone here have some good ideas because I've run out of things to 
tweak.

We have a few rsyslog servers in the company, the basic setup is a couple of 
collectors/relays that forwards syslog to two archive servers.

We have run this setup for years but recently we have started pushing 
considerable (3-4 times as much) more syslog towards the servers and it appears 
they can't quite handle the extra load. The servers stop accepting TCP 
connections every now and then, for a minute or two and then works fine again 
without any pattern that I can see. I'm trying to figure out why and what I can 
do about it.

Each relay server receives a few thousand message per second at about 
30-40Mbit/s syslog and forwards that to two other rsyslog servers with a total 
of 60-80Mbit/s.

The relay servers are hovering around 5-7% CPU usage and little load around 0.3 
so I don't think it's a hardware limitation.


Notable parts of the config includes:

$MaxOpenFiles 81920 #there are usually only a few thousand open files, but 
currently this is what it sits at. I tried raising it to a MUCH higher number, 
didn't change anything

#we accept both UDP and TCP but it appears to be only TCP that is acting up so 
I'll concentrate on that.
module(
       load="imptcp"
       Threads="6" #was 2 when we had 4 CPU cores, now we have 8 CPU cores
)
input(
       type="imptcp"
       port="514"
       KeepAlive="on" #was running without probes, I have enabled them to try 
to keep the open (idle) connections down
       KeepAlive.Probes="3"
       KeepAlive.Interval="60"
)

#we forward to two severs, here's the config for one of them. We did have zip 
activated but I have disabled that, in an attempt to fix our issues and it 
didn't compress much anyways.
action(
       type="omfwd"
       target="REDACTED"
       port="6514"
       protocol="tcp"

       template="format_forward"

       queue.type="linkedlist"
       queue.filename="forward_to_archive"
       action.resumeRetryCount="-1"
       queue.saveOnShutdown="on"

       #compress syslog message during transfer to save bandwidth
#        ziplevel="1"
#        compression.mode="single"

       #encrypt syslog traffic
       StreamDriverMode="1" # run driver in TLS-only mode
       StreamDriverAuthMode="x509/name"
       StreamDriverPermittedPeers="REDACTED" #only permit the certificate from 
the verified syslog server
)


#I have added the impstats module but don't quiiiite know how to read the 
output, maybe it tells you guys something?

module(load="impstats"
       ResetCounters="on"
       interval="60"
       severity="7"
       log.syslog="off"
       log.file="/var/log/rsyslog-stats.log"
)

Tue Jul 16 09:23:13 2024: global: origin=dynstats
Tue Jul 16 09:23:13 2024: imuxsock: origin=imuxsock submitted=29 
ratelimit.discarded=0 ratelimit.numratelimiters=0
Tue Jul 16 09:23:13 2024: dynafile cache dynaFileDebug: origin=omfile 
requests=0 level0=0 missed=0 evicted=0 maxused=0 closetimeouts=0
Tue Jul 16 09:23:13 2024: action-0-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd: origin=core.action 
processed=248342 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd: origin=core.action 
processed=248342 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-3-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-4-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-5-builtin:omfile: origin=core.action 
processed=47 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-6-builtin:omfile: origin=core.action 
processed=74 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-7-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-8-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-9-builtin:omfile: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-10-builtin:omusrmsg: origin=core.action 
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: imudp(*:514): origin=imudp submitted=20084 
disallowed=0
Tue Jul 16 09:23:13 2024: imudp(*:514): origin=imudp submitted=0 disallowed=0
Tue Jul 16 09:23:13 2024: imptcp(*/514/IPv4): origin=imptcp submitted=180195 
sessions.opened=138 sessions.openfailed=130 sessions.closed=45 
bytes.received=38520208 bytes.decompressed=0
Tue Jul 16 09:23:13 2024: imptcp(*/514/IPv6): origin=imptcp submitted=0 
sessions.opened=0 sessions.openfailed=0 sessions.closed=0 bytes.received=0 
bytes.decompressed=0
Tue Jul 16 09:23:13 2024: imtcp(6514): origin=imtcp submitted=0
Tue Jul 16 09:23:13 2024: imjournal: origin=imjournal submitted=58 read=57 
discarded=0 failed=0 poll_failed=0 rotations=0 recovery_attempts=0 
ratelimit_discarded_in_interval=0 disk_usage_bytes=1543561216
Tue Jul 16 09:23:13 2024: resource-usage: origin=impstats utime=37793520044 
stime=49266959335 maxrss=308628 minflt=90210820 majflt=4472 inblock=4641624 
oublock=497968144 nvcsw=4271503687 nivcsw=272265814 openfiles=3150
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd queue[DA]: origin=core.queue 
size=0 enqueued=0 full=0 discarded.full=0 discarded.nf=0 maxqsize=287009
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd queue: origin=core.queue 
size=0 enqueued=248342 full=0 discarded.full=0 discarded.nf=0 maxqsize=1000
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue[DA]: origin=core.queue 
size=158847 enqueued=143360 full=0 discarded.full=0 discarded.nf=0 
maxqsize=1620216
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue: origin=core.queue 
size=1000 enqueued=248342 full=702 discarded.full=2 discarded.nf=0 maxqsize=1000
Tue Jul 16 09:23:13 2024: main Q: origin=core.queue size=54384 enqueued=202693 
full=0 discarded.full=2 discarded.nf=0 maxqsize=100000
Tue Jul 16 09:23:13 2024: io-work-q: origin=imptcp enqueued=723 maxqsize=7
Tue Jul 16 09:23:13 2024: imudp(w3): origin=imudp called.recvmmsg=3256 
called.recvmsg=0 msgs.received=3803
Tue Jul 16 09:23:13 2024: imudp(w0): origin=imudp called.recvmmsg=3855 
called.recvmsg=0 msgs.received=4109
Tue Jul 16 09:23:13 2024: imudp(w2): origin=imudp called.recvmmsg=4418 
called.recvmsg=0 msgs.received=4799
Tue Jul 16 09:23:13 2024: imudp(w1): origin=imudp called.recvmmsg=6247 
called.recvmsg=0 msgs.received=7373


Without knowing "exactly" how to read the logs, I would think that 
sessions.openfailed=130 is something bad? Anything else to keep a lookout for?


I have also tried to manually change ulimit for the rsyslogd process, to as 
high/unlimited numbers as I could, but that didn't change anything.



Best regards

Jesper Skou Jensen
_______________________________________________
rsyslog mailing list
https://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

_______________________________________________
rsyslog mailing list
https://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to