it's not clear if this config and pstats output is for the sending or receiving
system
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue: origin=core.queue
size=1000 enqueued=248342 full=702 discarded.full=2 discarded.nf=0 maxqsize=1000
This indicates that the queue to deliver messages out from this system filled
it's queue 702 times, that would cause processing on this system to receive
messages to block (or at least spill over to the disk files for this queue,
which are considerably slower) a larger queue size may help ride out short
surges in traffic
one thing that can speed things up is to set a batch size of something on the
order of 100-128, something like that on the sending side.
Also, naming the actions makes the pstats output much easier to read
you can run top and hit "H" to show threads and see if there is a thread that is
using a lot of cpu.
Is there a firewall/router/switch that could be dropping packets in the path?
tcp timeouts/retries could account for delays
David Lang
On Tue, 16 Jul 2024, Jesper Skou Jensen via rsyslog wrote:
Date: Tue, 16 Jul 2024 08:20:51 +0000
From: Jesper Skou Jensen via rsyslog <rsyslog@lists.adiscon.com>
To: "rsyslog@lists.adiscon.com" <rsyslog@lists.adiscon.com>
Cc: Jesper Skou Jensen <j...@jysk.com>
Subject: [rsyslog] rsyslog stops accepting TCP for a minute or two
Hi guys
I hope that someone here have some good ideas because I've run out of things to
tweak.
We have a few rsyslog servers in the company, the basic setup is a couple of
collectors/relays that forwards syslog to two archive servers.
We have run this setup for years but recently we have started pushing
considerable (3-4 times as much) more syslog towards the servers and it appears
they can't quite handle the extra load. The servers stop accepting TCP
connections every now and then, for a minute or two and then works fine again
without any pattern that I can see. I'm trying to figure out why and what I can
do about it.
Each relay server receives a few thousand message per second at about
30-40Mbit/s syslog and forwards that to two other rsyslog servers with a total
of 60-80Mbit/s.
The relay servers are hovering around 5-7% CPU usage and little load around 0.3
so I don't think it's a hardware limitation.
Notable parts of the config includes:
$MaxOpenFiles 81920 #there are usually only a few thousand open files, but
currently this is what it sits at. I tried raising it to a MUCH higher number,
didn't change anything
#we accept both UDP and TCP but it appears to be only TCP that is acting up so
I'll concentrate on that.
module(
load="imptcp"
Threads="6" #was 2 when we had 4 CPU cores, now we have 8 CPU cores
)
input(
type="imptcp"
port="514"
KeepAlive="on" #was running without probes, I have enabled them to try
to keep the open (idle) connections down
KeepAlive.Probes="3"
KeepAlive.Interval="60"
)
#we forward to two severs, here's the config for one of them. We did have zip
activated but I have disabled that, in an attempt to fix our issues and it
didn't compress much anyways.
action(
type="omfwd"
target="REDACTED"
port="6514"
protocol="tcp"
template="format_forward"
queue.type="linkedlist"
queue.filename="forward_to_archive"
action.resumeRetryCount="-1"
queue.saveOnShutdown="on"
#compress syslog message during transfer to save bandwidth
# ziplevel="1"
# compression.mode="single"
#encrypt syslog traffic
StreamDriverMode="1" # run driver in TLS-only mode
StreamDriverAuthMode="x509/name"
StreamDriverPermittedPeers="REDACTED" #only permit the certificate from
the verified syslog server
)
#I have added the impstats module but don't quiiiite know how to read the
output, maybe it tells you guys something?
module(load="impstats"
ResetCounters="on"
interval="60"
severity="7"
log.syslog="off"
log.file="/var/log/rsyslog-stats.log"
)
Tue Jul 16 09:23:13 2024: global: origin=dynstats
Tue Jul 16 09:23:13 2024: imuxsock: origin=imuxsock submitted=29
ratelimit.discarded=0 ratelimit.numratelimiters=0
Tue Jul 16 09:23:13 2024: dynafile cache dynaFileDebug: origin=omfile
requests=0 level0=0 missed=0 evicted=0 maxused=0 closetimeouts=0
Tue Jul 16 09:23:13 2024: action-0-builtin:omfile: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd: origin=core.action
processed=248342 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd: origin=core.action
processed=248342 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-3-builtin:omfile: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-4-builtin:omfile: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-5-builtin:omfile: origin=core.action
processed=47 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-6-builtin:omfile: origin=core.action
processed=74 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-7-builtin:omfile: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-8-builtin:omfile: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-9-builtin:omfile: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: action-10-builtin:omusrmsg: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Tue Jul 16 09:23:13 2024: imudp(*:514): origin=imudp submitted=20084
disallowed=0
Tue Jul 16 09:23:13 2024: imudp(*:514): origin=imudp submitted=0 disallowed=0
Tue Jul 16 09:23:13 2024: imptcp(*/514/IPv4): origin=imptcp submitted=180195
sessions.opened=138 sessions.openfailed=130 sessions.closed=45
bytes.received=38520208 bytes.decompressed=0
Tue Jul 16 09:23:13 2024: imptcp(*/514/IPv6): origin=imptcp submitted=0
sessions.opened=0 sessions.openfailed=0 sessions.closed=0 bytes.received=0
bytes.decompressed=0
Tue Jul 16 09:23:13 2024: imtcp(6514): origin=imtcp submitted=0
Tue Jul 16 09:23:13 2024: imjournal: origin=imjournal submitted=58 read=57
discarded=0 failed=0 poll_failed=0 rotations=0 recovery_attempts=0
ratelimit_discarded_in_interval=0 disk_usage_bytes=1543561216
Tue Jul 16 09:23:13 2024: resource-usage: origin=impstats utime=37793520044
stime=49266959335 maxrss=308628 minflt=90210820 majflt=4472 inblock=4641624
oublock=497968144 nvcsw=4271503687 nivcsw=272265814 openfiles=3150
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd queue[DA]: origin=core.queue
size=0 enqueued=0 full=0 discarded.full=0 discarded.nf=0 maxqsize=287009
Tue Jul 16 09:23:13 2024: action-1-builtin:omfwd queue: origin=core.queue
size=0 enqueued=248342 full=0 discarded.full=0 discarded.nf=0 maxqsize=1000
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue[DA]: origin=core.queue
size=158847 enqueued=143360 full=0 discarded.full=0 discarded.nf=0
maxqsize=1620216
Tue Jul 16 09:23:13 2024: action-2-builtin:omfwd queue: origin=core.queue
size=1000 enqueued=248342 full=702 discarded.full=2 discarded.nf=0 maxqsize=1000
Tue Jul 16 09:23:13 2024: main Q: origin=core.queue size=54384 enqueued=202693
full=0 discarded.full=2 discarded.nf=0 maxqsize=100000
Tue Jul 16 09:23:13 2024: io-work-q: origin=imptcp enqueued=723 maxqsize=7
Tue Jul 16 09:23:13 2024: imudp(w3): origin=imudp called.recvmmsg=3256
called.recvmsg=0 msgs.received=3803
Tue Jul 16 09:23:13 2024: imudp(w0): origin=imudp called.recvmmsg=3855
called.recvmsg=0 msgs.received=4109
Tue Jul 16 09:23:13 2024: imudp(w2): origin=imudp called.recvmmsg=4418
called.recvmsg=0 msgs.received=4799
Tue Jul 16 09:23:13 2024: imudp(w1): origin=imudp called.recvmmsg=6247
called.recvmsg=0 msgs.received=7373
Without knowing "exactly" how to read the logs, I would think that
sessions.openfailed=130 is something bad? Anything else to keep a lookout for?
I have also tried to manually change ulimit for the rsyslogd process, to as
high/unlimited numbers as I could, but that didn't change anything.
Best regards
Jesper Skou Jensen
_______________________________________________
rsyslog mailing list
https://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.
_______________________________________________
rsyslog mailing list
https://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.