Hi-

I recently implemented a "service cfengine3 restart" weekly cron job as a
workaround to the MAX_FD bug that others and myself have seen.  I neglected
to except the master from the restart so when cf-serverd was killed a number
of hosts complained about in-flight transfers or not being able to reach the
master.  This is quite reasonable however I found one host that suffered a
complete loss or corruption of its limits.conf file.  It essentially bricked
the system, requiring a rebuild.

Here is the sequence:

cron job restarts cf3.  cf3 reports to syslog:

Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
while doing [lock.independent.server_cfengine.-cfengine3.the_server_d
aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]
Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
while doing
[lock.independent.server_cfengine.-cfengine3.the_server_daemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]
Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Logical start time Fri Oct 29
04:41:01 2010
Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
at Thu Oct 28 12:28:34 2010
Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Logical start time Thu Oct 28
12:28:34 2010
Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
at Thu Oct 28 12:28:34 2010
Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
while doing [lock.independent.server_cfengine.-cfengine3.the_server_d
aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]Nov  7 04:23:06 cfengine3
cf-serverd[14585]:  Logical start time Fri Oct 29 04:41:01 2010
Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
at Thu Oct 28 12:28:34 2010


cf3 on the client host emailed me at approximately the time of the restart
that it failed to copy limits.conf


date: Sun, Nov 7, 2010 at 4:23 AM
subject: community [hap10.broadinstitute.org/192.168.32.34]

Was not able to copy /cfengine/farm/etc/security/limits.conf.crdwga to
/etc/security/limits.conf
I: Made in version 'not specified' of '/var/cfengine/inputs/farm.cf' near
line 279


I have noticed other similar such failures on other hosts before but cf3
usually makes a note that it aborted the transaction:

!! New file /etc/security/limits.conf.cfnew seems to have been corrupted in
transit (dest 0 and src 1844), aborting!
Was not able to copy /cfengine/farm/etc/security/limits.conf to
/etc/security/limits.conf

Immediately after the failure on the host in question it started reporting
over the network that limits.conf was corrupt.

Nov  7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): cannot read
settings from /etc/security/limits.conf: No such file or directory
Nov  7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): error parsing
the configuration file: '/etc/security/limits.conf'

I was of course unable to login to the system to investigate further so
rebuilt it.

I've since excepted the master from the weekly restart but I am alarmed that
there is a use case where cf-agent can corrupt a file.  Any ideas on how
this might have happened and whether there are any added safeguards that can
be put in place?

The client is running cfengine3-community 3.0.5 and the master is running
3.1.0b2.  Both are on CentOS5.5 x86_64.

Thanks,
Frans
_______________________________________________
Help-cfengine mailing list
Help-cfengine@cfengine.org
https://cfengine.org/mailman/listinfo/help-cfengine

Reply via email to