Hi.

Long time no see!

A few months ago we deployed an opensearch cluster feeded by rsyslog and
let it running without giving it any attention, to be honest.
Today we realized it wasn't working and while reviewing things I noticed a
weird thing with rsyslog.
*Rsyslog is creating MANY (so many a simple "ls" gets stuck)
queue.xxxnumber files on disk.* Probably there's something wrong with our
configuration, but definitively is not working as I expected.

find /nfs/rsyslog/ -type f | wc -l results in *649420* queue.xxx files
du -sh /nfs/rsyslog/ results in *10T* (gladly, its an old synology)


*rsyslog -v (I should probably upgrade, yeah...)*
rsyslogd  8.2302.0 (aka 2023.02) compiled with:
PLATFORM: x86_64-pc-linux-gnu
PLATFORM (lsb_release -d):
FEATURE_REGEXP: Yes
GSSAPI Kerberos 5 support: No
FEATURE_DEBUG (debug build, slow code): No
32bit Atomic operations supported: Yes
64bit Atomic operations supported: Yes
memory allocator: system default
Runtime Instrumentation (slow code): No
uuid support: Yes
systemd support: Yes
Config file: /etc/rsyslog.conf
PID file: /var/run/rsyslogd.pid
Number of Bits in RainerScript integers: 64


*Configuration (I think is better to paste it complete, sorry if I'm
wrong):*
# https://www.rsyslog.com/doc/master/configuration/modules/omstdout.html
#module(load="omstdout")
# http://www.rsyslog.com/doc/v8-stable/configuration/modules/impstats.html
# http://www.rsyslog.com/rsyslog-statistic-counter/
# http://www.rsyslog.com/impstats-analyzer/
module(load="impstats" log.file="/logs/rsyslog-stats.log" interval="60"
log.syslog="off")

# http://www.rsyslog.com/doc/master/rainerscript/global.html
global(
      workDirectory="/data"
      # http://www.rsyslog.com/doc/v8-stable/troubleshooting/debug.html
      environment=["RSYSLOG_DEBUGLOG=/logs/rsyslog-debug.log",
"RSYSLOG_DEBUG=DebugOnDemand NoStdOut"]
      MaxMessageSize="128k"
      internalmsg.ratelimit.interval="1"
      internalmsg.ratelimit.burst="100000"

      #
http://www.rsyslog.com/doc/rsconf1_escapecontrolcharactersonreceive.html
      parser.escapeControlCharactersOnReceive="off"

      # This controls whether slashes in the "programname" property are
permitted or not. This property bases on a BSD concept, and by BSD syslogd
sources, slashes are NOT permitted inside the program name. However, some
Linux tools (including most importantly the       journal) store slashes as
part of the program name inside the syslogtag. In those cases, the
programname is truncated at the first slash. If this setting is changed to
"on", slashes are permitted and will not terminate programname parsing.
      #parser.permitSlashInHostname="on"

      # This controls whether slashes in the static part of the tag are
permitted or not. If this setting is off, a value of "app/foo[1234]" in the
tag will result in a programname of "app". If an application stores an
absolute path name like "/app/foo[1234]", the programname property
will become empty (""). If you need to actually store slashes as part of
the programname, this setting should be changed to "on" to permit this.
Then, a syslogtag of "/app/foo[1234]" will result in programname being
"/app/foo".
      parser.permitSlashInProgramName="on"
)

main_queue(
      queue.filename="main"
      queue.maxdiskspace="4G"
      queue.saveonshutdown="on"
      queue.lowwatermark="10000"
      queue.highwatermark="60000"
      queue.size="100000"
      queue.type="LinkedList"
      queue.fulldelaymark="90000"
      queue.lightdelaymark="75000"
      #TODO queue.dequeuebatchsize="1
      #IIUC: writes in disk when memory reaches highwatermark
      #If disk becomes full, we have 2 options:
      # - Dont accept any more messages, using fulldelaymark
      # - Reject messages
      #WARN: if bigger than fulldelay, will never discard
      #queue.discardmark="9750"
      #queue.discardseverity="8"
)

#TODO using UDP until RELP works flawlessly
#http://www.rsyslog.com/doc/v8-stable/configuration/modules/imudp.html
module(load="imudp")
input(
  port="514"
  type="imudp"
  name="imudp"
  ruleset="ruleset"
)
module(load="imtcp")
input(
        port="601"
        type="imtcp"
        name="imtcp"
        ruleset="ruleset"
)
# http://www.rsyslog.com/doc/master/configuration/modules/imrelp.html
module(load="imrelp")
input(
  port="20514"
  type="imrelp"
  name="imrelp"
  ruleset="ruleset"
)

# http://www.rsyslog.com/doc/master/configuration/templates.html
# Debug: template="RSYSLOG_DebugFormat"
# http://www.rsyslog.com/doc/master/configuration/properties.html
ruleset(
  name="ruleset"
  queue.filename="queue"
  # Memory queue will grow until _highwatermark_
  # when reached, it will start writing in disk [DA]
  # it will stop if it goes down to _lowwatermark_
  # Memory queue will start discarding if reaches _size_
  #
  # Disk queue will grow up to _size_ (or _maxdiskspace_?)
  # Tested so far:
  # It grows a bit over _size_ but doesn't reach _maxdiskspace_
  # however it stopped growing, so i don't know if that second statement is
true
  # To me, its not clear if disk queue discards or overwrites and some
messages are lost
  queue.maxdiskspace="1G"
  queue.saveonshutdown="on"
  queue.lowwatermark="10000"
  queue.highwatermark="60000"
  queue.size="100000"
  queue.type="LinkedList"
  queue.fulldelaymark="90000"
  queue.lightdelaymark="75000"
) {
action(type="omfwd" Target="mytarget" Port="10514" Protocol="tcp")
}


*Statistics (please, notice ruleser[DA size])*
Fri Apr 19 11:15:02 2024: imudp(w0): origin=imudp called.recvmmsg=0
called.recvmsg=0 msgs.received=0
Fri Apr 19 11:16:02 2024: global: origin=dynstats
Fri Apr 19 11:16:02 2024: global: origin=percentile
Fri Apr 19 11:16:02 2024: TCP-mytarget-10514: origin=omfwd bytes.sent=0
Fri Apr 19 11:16:02 2024: action-0-builtin:omfwd: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Fri Apr 19 11:16:02 2024: imudp(*/514/IPv4): origin=imudp submitted=0
disallowed=0
Fri Apr 19 11:16:02 2024: imtcp(601): origin=imtcp submitted=0
Fri Apr 19 11:16:02 2024: imrelp(20514): origin=imrelp submitted=0
Fri Apr 19 11:16:02 2024: resource-usage: origin=impstats utime=27080
stime=262330 maxrss=7876 minflt=576 majflt=142 inblock=24824 oublock=15496
nvcsw=1798 nivcsw=266 openfiles=15
*Fri Apr 19 11:16:02 2024: ruleset[DA]: origin=core.queue size=-1917883149
enqueued=0 full=0 discarded.full=0 discarded.nf <http://discarded.nf>=0
maxqsize=0 *
Fri Apr 19 11:16:02 2024: ruleset: origin=core.queue size=0 enqueued=0
full=0 discarded.full=0 discarded.nf=0 maxqsize=0
Fri Apr 19 11:16:02 2024: main Q[DA]: origin=core.queue size=0 enqueued=0
full=0 discarded.full=0 discarded.nf=0 maxqsize=0
Fri Apr 19 11:16:02 2024: main Q: origin=core.queue size=0 enqueued=0
full=0 discarded.full=0 discarded.nf=0 maxqsize=0
Fri Apr 19 11:16:02 2024: imudp(w0): origin=imudp called.recvmmsg=0
called.recvmsg=0 msgs.received=0

Going back a few days, we can see there's a moment queue size crash:

*Mon Apr  1 16:52:46 2024: ruleset[DA]: origin=core.queue size=0 enqueued=0
full=0 discarded.full=0 discarded.nf <http://discarded.nf>=0 maxqsize=0 *
Mon Apr  1 16:52:46 2024: ruleset: origin=core.queue size=0 enqueued=0
full=0 discarded.full=0 discarded.nf=0 maxqsize=0
Mon Apr  1 16:52:46 2024: main Q[DA]: origin=core.queue size=0 enqueued=0
full=0 discarded.full=0 discarded.nf=0 maxqsize=0
Mon Apr  1 16:52:46 2024: main Q: origin=core.queue size=0 enqueued=0
full=0 discarded.full=0 discarded.nf=0 maxqsize=0
Mon Apr  1 16:52:46 2024: imudp(w0): origin=imudp called.recvmmsg=0
called.recvmsg=0 msgs.received=0
Mon Apr  1 16:55:53 2024: global: origin=dynstats
Mon Apr  1 16:55:53 2024: global: origin=percentile
Mon Apr  1 16:55:53 2024: TCP-mytarget-10514: origin=omfwd bytes.sent=0
Mon Apr  1 16:55:53 2024: action-0-builtin:omfwd: origin=core.action
processed=0 failed=0 suspended=0 suspended.duration=0 resumed=0
Mon Apr  1 16:55:53 2024: imudp(*/514/IPv4): origin=imudp submitted=0
disallowed=0
Mon Apr  1 16:55:53 2024: imtcp(601): origin=imtcp submitted=0
Mon Apr  1 16:55:53 2024: imrelp(20514): origin=imrelp submitted=0
Mon Apr  1 16:55:53 2024: resource-usage: origin=impstats utime=3259
stime=19555 maxrss=7936 minflt=558 majflt=102 inblock=19024 oublock=8
nvcsw=148 nivcsw=35 openfiles=14
*Mon Apr  1 16:55:53 2024: ruleset[DA]: origin=core.queue size=-2025213502
enqueued=0 full=0 discarded.full=0 discarded.nf <http://discarded.nf>=0
maxqsize=0*


Seems rsyslog queue counter overflowed or whatever, but its also failing to
measure disk-size usage (queue.maxdiskspace="1G" vs 10T).
Apart from upgrading rsyslog version, do you see anything that could be
improved? Perhaps stats each 60s are too much, maybe we didn't setup
discard following best practices...

Thanks in advance
_______________________________________________
rsyslog mailing list
https://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to