On Monday, September 07, 2015 05:10:13 AM Richard Guy Briggs wrote: > There are several reports of the kernel losing contact with auditd when > it is, in fact, still running. When this happens, kernel syslogs show: > "audit: *NO* daemon at audit_pid=<pid>" > although auditd is still running, and is apparently happy, listening on > the netlink socket. The pid in the "*NO* daemon" message matches the pid > of the running auditd process. Restarting auditd solves this. > > The problem appears to happen randomly, and doesn't seem to be strongly > correlated to the rate of audit events being logged. The problem > happens fairly regularly (every few days), but not yet reproduced to > order. > > On production kernels, BUG_ON() is a no-op, so any error will trigger > this. > > Commit 34eab0a7cd45 ("audit: prevent an older auditd shutdown from > orphaning a newer auditd startup") eliminates one possible cause. This > isn't the case here, since the PID in the error message and the PID of > the running auditd match. > > The primary expected cause of error here is -ECONNREFUSED when the audit > daemon goes away, when netlink_getsockbyportid() can't find the auditd > portid entry in the netlink audit table (or there is no receive > function). If -EPERM is returned, that situation isn't likely to be > resolved in a timely fashion without administrator intervention. In > both cases, reset the audit_pid. This does not rule out a race > condition. SELinux is expected to return zero since this isn't an INET > or INET6 socket. Other LSMs may have other return codes. Log the error > code for better diagnosis in the future. > > In the case of -ENOMEM, the situation could be temporary, based on local > or general availability of buffers. -EAGAIN should never happen since > the netlink audit (kernel) socket is set to MAX_SCHEDULE_TIMEOUT. > -ERESTARTSYS and -EINTR are not expected since this kernel thread is not > expected to receive signals. In these cases (or any other unexpected > ones for now), report the error and re-schedule the thread, retrying up > to 5 times. > > v2: > Removed BUG_ON(). > Moved comma in pr_*() statements. > Removed audit_strerror() text. > > Reported-by: Vipin Rathor <v.rat...@gmail.com> > Reported-by: <ctc...@hotmail.com> > Signed-off-by: Richard Guy Briggs <r...@redhat.com> > --- > kernel/audit.c | 24 +++++++++++++++++++----- > 1 files changed, 19 insertions(+), 5 deletions(-)
Queued up for linux-audit#next as soon as 4.3-rc1 is released. > diff --git a/kernel/audit.c b/kernel/audit.c > index 1c13e42..18cdfe2 100644 > --- a/kernel/audit.c > +++ b/kernel/audit.c > @@ -407,16 +407,30 @@ static void audit_printk_skb(struct sk_buff *skb) > static void kauditd_send_skb(struct sk_buff *skb) > { > int err; > + int attempts = 0; > +#define AUDITD_RETRIES 5 > + > +restart: > /* take a reference in case we can't send it and we want to hold it */ > skb_get(skb); > err = netlink_unicast(audit_sock, skb, audit_nlk_portid, 0); > if (err < 0) { > - BUG_ON(err != -ECONNREFUSED); /* Shouldn't happen */ > + pr_err("netlink_unicast sending to audit_pid=%d returned error: > %d\n", > + audit_pid, err); > if (audit_pid) { > - pr_err("*NO* daemon at audit_pid=%d\n", audit_pid); > - audit_log_lost("auditd disappeared"); > - audit_pid = 0; > - audit_sock = NULL; > + if (err == -ECONNREFUSED || err == -EPERM > + || ++attempts >= AUDITD_RETRIES) { > + audit_log_lost("audit_pid=%d reset"); > + audit_pid = 0; > + audit_sock = NULL; > + } else { > + pr_warn("re-scheduling(#%d) write to > audit_pid=%d\n", > + attempts, audit_pid); > + set_current_state(TASK_INTERRUPTIBLE); > + schedule(); > + __set_current_state(TASK_RUNNING); > + goto restart; > + } > } > /* we might get lucky and get this in the next auditd */ > audit_hold_skb(skb); -- paul moore security @ redhat -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/