On Mon, 24 Aug 2015, Page, Jeremy wrote:

Sorry, was not saying don't look at logs, just saying logs are only reactive and only see things you're logging (if the server crashes you may log nada but that's definitely an issue! I also personally find correlation easier when I have graphic data but something like the ELK stack could help here, I have checks that look at ELK and then alert when they find pertinent data (they could also watch logs but this way they're in a single place and can also look for negatives (i.e. no one has logged in for 15 minutes is an error even if everything else is "green).

the absense of logs is also detectable by event correlation engines. I have a very simple ruleset in SEC that alerts me when things stop logging.

ELK and Splunk are great for exploring your data and doing correlations manually. But after you figure out what you are interested in, they are horribly inefficient to do the ongoing monitoring and alerting compared to tools that aren't database driven.

https://www.usenix.org/publications/login/feb14/logging-reports-dashboards
https://www.usenix.org/publications/login/april14/lang (splunk tuning, most of which is applicable to ElasticSearch with some terminology changes)

"looking at logs is 100% accurate at detecting logged problems :-)" - I'm stealing this.

go ahead. It can be a positive statement or a negative statement, depending on the problem :-)

In this context, the point is that applications usually log internal problems, and when they do, it's far more accurate to react to the log messages than to try and detect the same problem by the application response behavior.

David Lang

my sec config: It sends an alert when something stops logging, and again every 4 hours until it comes back. I have rsyslog configured to pass it a single value (unless it's the disable heartbeat alert message), which is usually hostname, but is sometimes a specific application/instance

type=single
ptype=regexp
pattern= disable heartbeat alert (\S+)
context=[!SEC_INTERNAL_EVENT]
desc=clear_heartbeat_$1
action=delete heartbeat_$1

type=single
ptype=regexp
pattern= setup extended logging outage alert for (\S+)
context=[!SEC_INTERNAL_EVENT]
desc=long_heartbeat_$1
action=create heartbeat_$1 14400 (shellcmd /usr/local/bin/sec/notify.sh $1 '4+ hours'; udgram /dev/log " sec-alert: setup extended logging outage alert for $1 ");

type=single
ptype=regexp
pattern=(\S+)
context=[!SEC_INTERNAL_EVENT]
desc=heartbeat_$1
action=create heartbeat_$1 310 (shellcmd /usr/local/bin/sec/notify.sh $1 '5 min'; udgram /dev/log " sec-alert: setup extended logging outage alert for $1 ")

# cat /usr/local/bin/sec/notify.sh
#!/bin/sh

(
echo "From: `hostname`@company.com"
echo "To: m...@company.com"
echo "Subject: $1 stopped reporting"
echo
echo "System $1 was generating logs, but has not generated any logs in the last $2"
echo
echo "If this system continues to fail to log, an additional message will be generated every four hours, To disable this, create a log 'disable heartbeat alert $1'"
echo
echo "for example:"
echo "logger -t manual disable heartbeat alert $1"
#) >/var/log/alerts.notsent
) |sendmail -t

_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to