Follow-up: Of all things, I found that ommail is configured on our primary receiver without a queue attached and this indirectly led to the problems we were encountering after the Ubuntu 16.04 to 18.04 OS upgrade. The local Postfix daemon was setup as a null-client/forwarding instance, relying on our relays to handle all cleanup and final forwarding of notifications. The Postfix version used by 18.04 introduced new requirements that our old config didn't have and we didn't initially catch this. I happened to stumble upon it after checking what feels like nearly everything else.
As I restarted Postfix to apply the updated config, rsyslog logged this: 2020-09-16T17:18:58.104710-05:00 woodchuck1 rsyslogd: action 'email-alert-test' suspended (module 'ommail'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2008.0 try https://www.rsyslog.com/e/2007 ] 2020-09-16T17:18:58.104895-05:00 woodchuck1 rsyslogd: action 'email-alert-test' suspended (module 'ommail'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2008.0 try https://www.rsyslog.com/e/2007 ] 2020-09-16T17:18:58.104968-05:00 woodchuck1 rsyslogd: action 'email-alert-test' suspended (module 'ommail'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2008.0 try https://www.rsyslog.com/e/2007 ] Immediately after the queue associated with a ruleset (that in turn called the ruleset that action was part of) immediately cleared. Unfortunately the impstats log file didn't help much, perhaps just due to ignorance of what it is trying to tell me. Wed Sep 16 16:25:35 2020: email-alert-test: origin=core.action processed=3 failed=0 suspended=0 suspended.duration=0 resumed=0 After I send this I will go back and add a queue to each ommail action in the hope that it will help surface any future issues related to that module. -----Original Message----- From: rsyslog <[email protected]> On Behalf Of Adam Chalkley via rsyslog Sent: Thursday, September 3, 2020 8:14 AM To: Rainer Gerhards <[email protected]> Cc: Adam Chalkley <[email protected]>; rsyslog-users <[email protected]> Subject: Re: [rsyslog] Upgraded receiver from Ubuntu 16.04 to 18.04, main queue filling up, imrelp-related errors Rainer, Thanks for the reply. I've not shared the config, but I can work on doing so. It's rather long and likely over complicated, but it would be great to have other eyes on it. It may take some time to sanitize the contents before sharing, but when I'm able, should I share it here as attachments, inline as plaintext or via a GitHub repo? I'm leaning towards the latter if you don't have a specific preference. Thank you for offering to take a look at the configuration! -----Original Message----- From: rsyslog <[email protected]> On Behalf Of Rainer Gerhards via rsyslog Sent: Wednesday, September 2, 2020 3:14 AM To: rsyslog-users <[email protected]> Cc: Rainer Gerhards <[email protected]> Subject: Re: [rsyslog] Upgraded receiver from Ubuntu 16.04 to 18.04, main queue filling up, imrelp-related errors I don't see the rest of this thread, can you (re?) post you config? Rainer El mié., 2 sept. 2020 a las 3:13, Adam Chalkley via rsyslog (<[email protected]>) escribió: > > Unfortunately the system is still having issues. > > I enabled debug logging earlier, copied the debug log aside and *then* > disabled debug-on-demand logging (last time I forgot that not copying the > file elsewhere first complete wipes the log file). > > I have 18 MB worth of captured details for a short timeframe. > > Out of that log file I'm seeing a lot of these entries: > > 7316.804057780:imuxsock.c : main Q: queue.c: EnqueueMsg advised worker > start > 7316.804062774:imuxsock.c : imuxsock.c: --------imuxsock calling poll() > on 2 fds > 7316.804232359:imuxsock.c : imuxsock.c: Message from UNIX socket: #3, > size 221 > 7316.804258678:imuxsock.c : datetime.c: ParseTIMESTAMP3339: invalid year: > 0, pszTS: 'e' > 7316.804264073:imuxsock.c : main Q: queue.c: EnqueueMsg advised worker > start > 7316.804267610:imuxsock.c : imuxsock.c: --------imuxsock calling poll() > on 2 fds > 7317.803939077:imrelp.c : main Q: queue.c: EnqueueMsg advised worker > start > 7317.803994833:imrelp.c : imrelp.c: relpTcpSend: send data: 14916 rsp 6 > 200 OK > > 7317.804025407:imrelp.c : imrelp.c: relpTcpSend: sock 80, lenbuf 19, > send returned -1 [errno 32] > 7317.804038885:imrelp.c : imrelp.c: librelp: generic error: ecode > 10014, emsg 'error sending relp: Broken pipe' > 7317.804049421:imrelp.c : errmsg.c: Called LogMsg, msg: imrelp[2514]: > error 'error sending relp: Broken pipe', object 'lstn 2514: conn to clt > 192.168.2.172/192.168.2.172' - input may not work as intended > 7317.804054526:imrelp.c : operatingstate.c: osf: MSG imrelp[2514]: > error 'error sending relp: Broken pipe', object 'lstn 2514: conn to clt > 192.168.2.172/192.168.2.172' - input may not work as intended: signaling new > internal message via SIGTTOU: 'imrelp[2514]: error 'error sending relp: > Broken pipe', object 'lstn 2514: conn to clt 192.168.2.172/192.168.2.172' - > input may not work as intended [v8.2006.0 try https://www.rsyslog.com/e/2353 > ]' > 7317.804097763:main thread : janitor.c: janitorRun() called > > The pattern seems to be that a remote target is interrupted and then the main > queue (main Q) starts filling up. All omrelp actions have queues attached and > none of them ever filled (going by impstats log file which was regularly > receiving updates). > > Going by the local log file on the primary receiver it noted these details: > > 2020-09-01T12:27:33.306418-05:00 woodchuck1 rsyslogd: main Q:Reg: high > activity - starting 1 additional worker thread(s), currently 1 active worker > threads. [v8.2006.0 try https://www.rsyslog.com/e/2439 ] > 2020-09-01T12:27:33.429855-05:00 woodchuck1 rsyslogd: omfwd: TCPSendBuf error > -2027, destruct TCP Connection to graylog1.example.com:514 [v8.2006.0 try > https://www.rsyslog.com/e/2027 ] > 2020-09-01T12:27:33.469576-05:00 woodchuck1 rsyslogd: action > 'ForwardToGraylog1' suspended (module 'builtin:omfwd'), retry 0. There should > be messages before this one giving the reason for suspension. [v8.2006.0 try > https://www.rsyslog.com/e/2007 ] > 2020-09-01T12:27:33.470212-05:00 woodchuck1 rsyslogd: action > 'ForwardToGraylog1' resumed (module 'builtin:omfwd') [v8.2006.0 try > https://www.rsyslog.com/e/2359 ] > 2020-09-01T12:33:02.644015-05:00 woodchuck1 rsyslogd: -- MARK -- > 2020-09-01T12:53:02.711834-05:00 woodchuck1 rsyslogd: -- MARK -- > 2020-09-01T13:13:02.802190-05:00 woodchuck1 rsyslogd: -- MARK -- > 2020-09-01T13:33:02.856874-05:00 woodchuck1 rsyslogd: -- MARK -- > 2020-09-01T13:45:10.206220-05:00 woodchuck1 rsyslogd: main Q:Reg: high > activity - starting 1 additional worker thread(s), currently 2 active worker > threads. [v8.2006.0 try https://www.rsyslog.com/e/2439 ] > 2020-09-01T13:45:10.222580-05:00 woodchuck1 rsyslogd: librelp error 10008 > forwarding to server woodchuck2.example.com:2514 - suspending [v8.2006.0 try > https://www.rsyslog.com/e/2291 ] > 2020-09-01T13:45:10.222794-05:00 woodchuck1 rsyslogd: action > 'ForwardTowoodchuck2' suspended (module 'omrelp'), retry 0. There should be > messages before this one giving the reason for suspension. [v8.2006.0 try > https://www.rsyslog.com/e/2007 ] > 2020-09-01T13:45:11.232309-05:00 woodchuck1 rsyslogd: action > 'ForwardTowoodchuck2' resumed (module 'omrelp') [v8.2006.0 try > https://www.rsyslog.com/e/2359 ] > According to the impstats log, the main queue started filling up around 11:21 > am, about the time that several of our mail relays were rebooting from > maintenance. > > This is an Ubuntu 18.04 box that was in-place upgraded from Ubuntu 16.04. > > Not sure if it is relevant, but the existing rsyslog packages were not > swapped out with replacement 18.04-based packages as part of the upgrade as > expected. > > dpkg -l | grep rsyslog > ii rsyslog 8.2006.0-0adiscon2xenial1 > amd64 a rocket-fast system for log processing > ii rsyslog-mmjsonparse 8.2006.0-0adiscon2xenial1 > amd64 Parsing/handling of CEE/Lumberjack JSON messages in > rsyslog > ii rsyslog-mmnormalize 8.2006.0-0adiscon2xenial1 > amd64 The rsyslog-mmnormalize package provides log > normalization > ii rsyslog-mmrm1stspace 8.2006.0-0adiscon2xenial1 > amd64 The mmrm1stspace module permits to strip the > leading space from > ii rsyslog-relp 8.2006.0-0adiscon2xenial1 > amd64 RELP protocol support for rsyslog > > I was planning to wait for the next rsyslog packages to be released for the > Ubuntu PPA to see if that was enough to trigger a switchover to "bionic" > based packages. I suspect this is completely unrelated to the problem > experienced, but wanted to make sure and note it here. > > Recent changes (not *exactly* flailing around, but not very confident) from a > few days ago (which didn't help today): > > * Modified forward queues (attached to actions within a ruleset) > ** increase queue size from 10K to 100K > ** increase worker threads from 1 to 4 > ** disable explicit high water mark setting, allow default setting to apply > ** disable explicit low water mark setting, allow default setting to apply > > * Modified 'main Q' (as impstats lists it) > ** adjust worker threads from 1 to 4 > ** adjust queue size from 10K to 500K > > Changes made tonight after restarting rsyslog: > > * Modified ruleset used to write out messages to local log files on the > receiver > ** increase queue size to 250K > ** increase worker threads to 4 > > * Disable lightDelayMark on main queue (as discussed on this thread) > > > I'm not sure what's going on. This box was quite stable up until we upgraded > the OS from Ubuntu 16.04 to 18.04. > > Any ideas/suggestions are welcome. > > Thanks. > > -----Original Message----- > From: Adam Chalkley > Sent: Friday, August 28, 2020 4:15 PM > To: rsyslog-users <[email protected]> > Cc: [email protected] > Subject: RE: [rsyslog] Upgraded receiver from Ubuntu 16.04 to 18.04, > connections from clients failing with a high number of CLOSE_WAIT connections > on receiver > > Hi Andre, > > Thank you for the additional feedback. > > As you suggested, the problem is likely tied back to the TCP probe. We've not > had it enabled since last Sunday and rsyslog has been running fine without > making the changes we previously discussed (I'm still interested in making > them, I've just been pulled in other directions). > > Do you happen to know of a safe way to check that the port is open remotely > without triggering a failure from rsyslog's perspective? I'm guessing that a > minimal RELP-compatible client would be the best approach. Is there such a > tool that you're aware of that could be called periodically to confirm that a > rsyslog receiver (RELP-enabled port) is functioning properly? > > Just thought I would ask. > > Thanks! > > -----Original Message----- > From: Andre Lorbach <[email protected]> > Sent: Monday, August 24, 2020 4:06 AM > To: rsyslog-users <[email protected]> > Cc: Adam Chalkley <[email protected]> > Subject: AW: [rsyslog] Upgraded receiver from Ubuntu 16.04 to 18.04, > connections from clients failing with a high number of CLOSE_WAIT connections > on receiver > > I think those errors were there all the time but not reported in older > librelp version. > I reviewed the code and we added this error output about 2 years ago in > librelp. > Ubuntu 16.04 most likely is using an older librelp version, so you did not > see the error there. > > The problem is caused by the TCP Probe, it may helps if you try to receive > data before you drop the connection. > > Best regards, > Andre Lorbach > -- > Adiscon GmbH > Mozartstr. 21 > 97950 Großrinderfeld, Germany > Ph. +49-9349-9298530 > Geschäftsführer/President: Rainer Gerhards Reg.-Gericht Mannheim, HRB > 560610 > Ust.-IDNr.: DE 81 22 04 622 > Web: www.adiscon.com - Mail: [email protected] > > Informations regarding your data privacy policy can be found here: > https://www.adiscon.com/data-privacy-policy/ > > This e-mail may contain confidential and/or privileged information. If you > are not the intended recipient or have received this e-mail in error > please notify the sender immediately and delete this e-mail. Any > unauthorized copying, disclosure or distribution of the material in this > e-mail is strictly forbidden. > > Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte > Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail > irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und > vernichten Sie diese E-Mail. Das unerlaubte Kopieren und die unbefugte > Weitergabe dieser E-Mail sind nicht gestattet. > > > > > -----Ursprüngliche Nachricht----- > > Von: rsyslog <[email protected]> Im Auftrag von Adam > > Chalkley via rsyslog > > Gesendet: Mittwoch, 19. August 2020 18:38 > > An: rsyslog-users <[email protected]> > > Cc: Adam Chalkley <[email protected]> > > Betreff: [rsyslog] Upgraded receiver from Ubuntu 16.04 to 18.04, > connections > > from clients failing with a high number of CLOSE_WAIT connections on > > receiver > > > > Hi, > > > > We upgraded the OS on our central receiver yesterday from Ubuntu 16.04 > > (4.4 kernel) to 18.04 (4.15 kernel). > > > > We are using the upstream PPA, so running 8.2006.0 on receivers and > > endpoints. > > > > When we started getting reports from our Nagios instance that the > rsyslog > > forward queues endpoints were beginning to fill we checked our receiver > > (sawmill1) and saw 94 open TCP connections with 40 of them in CLOSE_WAIT > > from our Nagios server, most of them I suspect from the TCP port > connection > > test performed every 5 minutes. > > > > Log samples from the receiver system (which are related to port probes > from > > our Nagios instance): > > > > 2020-08-19T10:05:01.279416-05:00 lincoln rsyslogd: -- MARK -- > > 2020-08-19T10:05:08.249358-05:00 lincoln rsyslogd: imrelp[2514]: error > 'server > > closed relp session, session broken', object 'lstn 2514: conn to clt > > 192.168.2.10/192.168.2.10' - input may not work as intended [v8.2006.0 > try > > https://www.rsyslog.com/e/2353 ] > > 2020-08-19T10:05:08.249626-05:00 lincoln rsyslogd: imrelp[2514]: error > 'error > > sending relp: Bad file descriptor', object 'lstn 2514: conn to clt > > 192.168.2.10/192.168.2.10' - input may not work as intended [v8.2006.0 > try > > https://www.rsyslog.com/e/2353 ] > > 2020-08-19T10:08:08.020625-05:00 lincoln rsyslogd: imrelp[2514]: error > 'server > > closed relp session, session broken', object 'lstn 2514: conn to clt > > 192.168.2.10/192.168.2.10' - input may not work as intended [v8.2006.0 > try > > https://www.rsyslog.com/e/2353 ] > > 2020-08-19T10:08:08.021253-05:00 lincoln rsyslogd: imrelp[2514]: error > 'error > > sending relp: Bad file descriptor', object 'lstn 2514: conn to clt > > 192.168.2.10/192.168.2.10' - input may not work as intended [v8.2006.0 > try > > https://www.rsyslog.com/e/2353 ] > > 2020-08-19T10:11:08.074712-05:00 lincoln rsyslogd: imrelp[2514]: error > 'server > > closed relp session, session broken', object 'lstn 2514: conn to clt > > 192.168.2.10/192.168.2.10' - input may not work as intended [v8.2006.0 > try > > https://www.rsyslog.com/e/2353 ] > > > > Log samples from the Nagios instance: > > > > 2020-08-19T11:19:53.444953-05:00 nagios rsyslogd: > > omrelp[lincoln.lib.auburn.edu:2514]: error 'error waiting on required > session > > state, session broken', object 'conn to srvr > lincoln.lib.auburn.edu:2514' - > > action may not work as intended [v8.2006.0 try > > https://www.rsyslog.com/e/2353 ] > > 2020-08-19T11:19:53.445260-05:00 nagios rsyslogd: > > omrelp[lincoln.lib.auburn.edu:2514]: error 'error opening connection to > > remote peer', object 'conn to srvr lincoln.lib.auburn.edu:2514' - > action may > > not work as intended [v8.2006.0 try https://www.rsyslog.com/e/2353 ] > > > > Is there a setting I can apply to rsyslog to help resolve this? > > > > Is this a known bug? > > > > We didn't have the issue with v8.2006.0 on our receiver when it was > running > > Ubuntu 16.04 (the prior OS release), even though it made the same > > complaints about the TCP port probes from Nagios. > > > > Thanks in advance. > > > > _______________________________________________ > > rsyslog mailing list > > https://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com/professional-services/ > > What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: > > This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites > beyond > > our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. > _______________________________________________ > rsyslog mailing list > https://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of > sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T > LIKE THAT. _______________________________________________ rsyslog mailing list https://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. _______________________________________________ rsyslog mailing list https://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. _______________________________________________ rsyslog mailing list https://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

