If he'd kill -9'd cf-execd, I'd expect corruption. Since the output he pasted looked like it was a signal 15, I would have expected it to be caught and cleaned up after (e.g. Finish in-progress transfers). Further, the cf2 behavior of copying to a temp file and then moving into place does still work from his output as well...it showed the copy to a temporary location. This seems to be all the more reason to ask -- why was there any corruption?
Sure the promise could be modified, or the schedule staggered to avoid the issue...but those are workarounds vs. real solutions it seems. Question: Can the condition be reliably reproduced by just kill -15'ing cf-execd with in-progress file transfers? If yes, then that would seem bad to me and get reported as a bug. The kill -9 (hard stop) vs. kill -15 (graceful shutdown) thing is not cfengine specific, it's how many UNIX programs work. If signal 15's can cause corruption, it may very well catch people off guard. On 11/9/10 1:22 AM, "Seva Gluschenko" <seva.glusche...@gmail.com> wrote: > Of course, Cfengine3 acts the same way, file never gets installed > directly in place of an older file. > > 2010/11/9 Bas van der Vlies <b...@sara.nl>: >> >> On 9 nov 2010, at 08:37, Seva Gluschenko wrote: >> >>> Frans, >>> >>> since you're terminating cf-serverd in the middle of a file transfer, >>> the receiving agent reasonably treats it as a corruption. There's >>> nothing wrong with it. On the other hand, why terminating cf-serverd >>> when you just need to restart cf-execd? Modify your promise and feel >>> safe. >>> >> >> I thought cfengine has some logic for transfering files: (i think this >> cfengine2 style, did not check it for cfengine3) >> * first copy it to <filename>.cfnew >> * if this succeed and it is correct move to <filename> >> >> This is to avoid corruption like this. I server can crash and you don't want >> the clients to sufffer from this with file that are corrupted. >> >> >>> 2010/11/8 Frans Lawaetz <fr...@broadinstitute.org>: >>>> Hi- >>>> >>>> I recently implemented a "service cfengine3 restart" weekly cron job as a >>>> workaround to the MAX_FD bug that others and myself have seen. I neglected >>>> to except the master from the restart so when cf-serverd was killed a >>>> number >>>> of hosts complained about in-flight transfers or not being able to reach >>>> the >>>> master. This is quite reasonable however I found one host that suffered a >>>> complete loss or corruption of its limits.conf file. It essentially >>>> bricked >>>> the system, requiring a rebuild. >>>> >>>> Here is the sequence: >>>> >>>> cron job restarts cf3. cf3 reports to syslog: >>>> >>>> Nov 7 04:23:06 cfengine3 cf-serverd[14585]: Received signal 15 (SIGTERM) >>>> while doing [lock.independent.server_cfengine.-cfengine3.the_server_d >>>> aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016] >>>> Nov 7 04:23:06 cfengine3 cf-serverd[14585]: Received signal 15 (SIGTERM) >>>> while doing >>>> [lock.independent.server_cfengine.-cfengine3.the_server_daemon_2542_MD5=5b2 >>>> c904169606aa9b27ec369fd13e016] >>>> Nov 7 04:23:06 cfengine3 cf-serverd[14585]: Logical start time Fri Oct 29 >>>> 04:41:01 2010 >>>> Nov 7 04:23:06 cfengine3 cf-serverd[14585]: This sub-task started really >>>> at Thu Oct 28 12:28:34 2010 >>>> Nov 7 04:23:06 cfengine3 cf-serverd[14585]: Logical start time Thu Oct 28 >>>> 12:28:34 2010 >>>> Nov 7 04:23:06 cfengine3 cf-serverd[14585]: This sub-task started really >>>> at Thu Oct 28 12:28:34 2010 >>>> Nov 7 04:23:06 cfengine3 cf-serverd[14585]: Received signal 15 (SIGTERM) >>>> while doing [lock.independent.server_cfengine.-cfengine3.the_server_d >>>> aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]Nov 7 04:23:06 cfengine3 >>>> cf-serverd[14585]: Logical start time Fri Oct 29 04:41:01 2010 >>>> Nov 7 04:23:06 cfengine3 cf-serverd[14585]: This sub-task started really >>>> at Thu Oct 28 12:28:34 2010 >>>> >>>> >>>> cf3 on the client host emailed me at approximately the time of the restart >>>> that it failed to copy limits.conf >>>> >>>> >>>> date: Sun, Nov 7, 2010 at 4:23 AM >>>> subject: community [hap10.broadinstitute.org/192.168.32.34] >>>> >>>> Was not able to copy /cfengine/farm/etc/security/limits.conf.crdwga to >>>> /etc/security/limits.conf >>>> I: Made in version 'not specified' of '/var/cfengine/inputs/farm.cf' near >>>> line 279 >>>> >>>> >>>> I have noticed other similar such failures on other hosts before but cf3 >>>> usually makes a note that it aborted the transaction: >>>> >>>> !! New file /etc/security/limits.conf.cfnew seems to have been corrupted in >>>> transit (dest 0 and src 1844), aborting! >>>> Was not able to copy /cfengine/farm/etc/security/limits.conf to >>>> /etc/security/limits.conf >>>> >>>> Immediately after the failure on the host in question it started reporting >>>> over the network that limits.conf was corrupt. >>>> >>>> Nov 7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): cannot read >>>> settings from /etc/security/limits.conf: No such file or directory >>>> Nov 7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): error >>>> parsing >>>> the configuration file: '/etc/security/limits.conf' >>>> >>>> I was of course unable to login to the system to investigate further so >>>> rebuilt it. >>>> >>>> I've since excepted the master from the weekly restart but I am alarmed >>>> that >>>> there is a use case where cf-agent can corrupt a file. Any ideas on how >>>> this might have happened and whether there are any added safeguards that >>>> can >>>> be put in place? >>>> >>>> The client is running cfengine3-community 3.0.5 and the master is running >>>> 3.1.0b2. Both are on CentOS5.5 x86_64. >>>> >>>> Thanks, >>>> Frans >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Help-cfengine mailing list >>>> Help-cfengine@cfengine.org >>>> https://cfengine.org/mailman/listinfo/help-cfengine >>>> >>>> >>> >>> >>> >>> -- >>> SY, Seva Gluschenko. >>> _______________________________________________ >>> Help-cfengine mailing list >>> Help-cfengine@cfengine.org >>> https://cfengine.org/mailman/listinfo/help-cfengine >> >> -- >> Bas van der Vlies >> b...@sara.nl >> >> >> >> > > > > -- > SY, Seva Gluschenko. > _______________________________________________ > Help-cfengine mailing list > Help-cfengine@cfengine.org > https://cfengine.org/mailman/listinfo/help-cfengine -- Mike Hoskins : micho...@cisco.com : 415.506.UNIX (8649) Peon #307 : Cisco STBU : IronPort San Bruno Never get so busy making a living that you forget to make a life. -- Unknown _______________________________________________ Help-cfengine mailing list Help-cfengine@cfengine.org https://cfengine.org/mailman/listinfo/help-cfengine