Apologies for email without enough coffee -- Please s/cf-execd/cf-serverd/g
in my comments.  I'd be mostly curious if this is easily reproducible, or
simply an edge case that needs identified.

On 11/9/10 9:16 AM, "Mike Hoskins" <micho...@cisco.com> wrote:

> If he'd kill -9'd cf-execd, I'd expect corruption.  Since the output he
> pasted looked like it was a signal 15, I would have expected it to be caught
> and cleaned up after (e.g. Finish in-progress transfers).  Further, the cf2
> behavior of copying to a temp file and then moving into place does still
> work from his output as well...it showed the copy to a temporary location.
> This seems to be all the more reason to ask -- why was there any corruption?
> 
> Sure the promise could be modified, or the schedule staggered to avoid the
> issue...but those are workarounds vs. real solutions it seems.
> 
> Question: Can the condition be reliably reproduced by just kill -15'ing
> cf-execd with in-progress file transfers?
> 
> If yes, then that would seem bad to me and get reported as a bug.  The kill
> -9 (hard stop) vs. kill -15 (graceful shutdown) thing is not cfengine
> specific, it's how many UNIX programs work.  If signal 15's can cause
> corruption, it may very well catch people off guard.
> 
> On 11/9/10 1:22 AM, "Seva Gluschenko" <seva.glusche...@gmail.com> wrote:
> 
>> Of course, Cfengine3 acts the same way, file never gets installed
>> directly in place of an older file.
>> 
>> 2010/11/9 Bas van der Vlies <b...@sara.nl>:
>>> 
>>> On 9 nov 2010, at 08:37, Seva Gluschenko wrote:
>>> 
>>>> Frans,
>>>> 
>>>> since you're terminating cf-serverd in the middle of a file transfer,
>>>> the receiving agent reasonably treats it as a corruption. There's
>>>> nothing wrong with it. On the other hand, why terminating cf-serverd
>>>> when you just need to restart cf-execd? Modify your promise and feel
>>>> safe.
>>>> 
>>> 
>>> I thought cfengine has some logic for transfering files: (i think this
>>> cfengine2 style, did not check it for cfengine3)
>>>  * first copy it to <filename>.cfnew
>>>  * if this succeed and it is correct move to <filename>
>>> 
>>> This is to avoid corruption like this. I server can crash and you don't want
>>> the clients to sufffer from this with file that are corrupted.
>>> 
>>> 
>>>> 2010/11/8 Frans Lawaetz <fr...@broadinstitute.org>:
>>>>> Hi-
>>>>> 
>>>>> I recently implemented a "service cfengine3 restart" weekly cron job as a
>>>>> workaround to the MAX_FD bug that others and myself have seen.  I
>>>>> neglected
>>>>> to except the master from the restart so when cf-serverd was killed a
>>>>> number
>>>>> of hosts complained about in-flight transfers or not being able to reach
>>>>> the
>>>>> master.  This is quite reasonable however I found one host that suffered a
>>>>> complete loss or corruption of its limits.conf file.  It essentially
>>>>> bricked
>>>>> the system, requiring a rebuild.
>>>>> 
>>>>> Here is the sequence:
>>>>> 
>>>>> cron job restarts cf3.  cf3 reports to syslog:
>>>>> 
>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
>>>>> while doing [lock.independent.server_cfengine.-cfengine3.the_server_d
>>>>> aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]
>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
>>>>> while doing
>>>>> 
[lock.independent.server_cfengine.-cfengine3.the_server_daemon_2542_MD5=5b>>>>>
2
>>>>> c904169606aa9b27ec369fd13e016]
>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Logical start time Fri Oct
>>>>> 29
>>>>> 04:41:01 2010
>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
>>>>> at Thu Oct 28 12:28:34 2010
>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Logical start time Thu Oct
>>>>> 28
>>>>> 12:28:34 2010
>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
>>>>> at Thu Oct 28 12:28:34 2010
>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
>>>>> while doing [lock.independent.server_cfengine.-cfengine3.the_server_d
>>>>> aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]Nov  7 04:23:06 cfengine3
>>>>> cf-serverd[14585]:  Logical start time Fri Oct 29 04:41:01 2010
>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
>>>>> at Thu Oct 28 12:28:34 2010
>>>>> 
>>>>> 
>>>>> cf3 on the client host emailed me at approximately the time of the restart
>>>>> that it failed to copy limits.conf
>>>>> 
>>>>> 
>>>>> date: Sun, Nov 7, 2010 at 4:23 AM
>>>>> subject: community [hap10.broadinstitute.org/192.168.32.34]
>>>>> 
>>>>> Was not able to copy /cfengine/farm/etc/security/limits.conf.crdwga to
>>>>> /etc/security/limits.conf
>>>>> I: Made in version 'not specified' of '/var/cfengine/inputs/farm.cf' near
>>>>> line 279
>>>>> 
>>>>> 
>>>>> I have noticed other similar such failures on other hosts before but cf3
>>>>> usually makes a note that it aborted the transaction:
>>>>> 
>>>>> !! New file /etc/security/limits.conf.cfnew seems to have been corrupted
>>>>> in
>>>>> transit (dest 0 and src 1844), aborting!
>>>>> Was not able to copy /cfengine/farm/etc/security/limits.conf to
>>>>> /etc/security/limits.conf
>>>>> 
>>>>> Immediately after the failure on the host in question it started reporting
>>>>> over the network that limits.conf was corrupt.
>>>>> 
>>>>> Nov  7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): cannot read
>>>>> settings from /etc/security/limits.conf: No such file or directory
>>>>> Nov  7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): error
>>>>> parsing
>>>>> the configuration file: '/etc/security/limits.conf'
>>>>> 
>>>>> I was of course unable to login to the system to investigate further so
>>>>> rebuilt it.
>>>>> 
>>>>> I've since excepted the master from the weekly restart but I am alarmed
>>>>> that
>>>>> there is a use case where cf-agent can corrupt a file.  Any ideas on how
>>>>> this might have happened and whether there are any added safeguards that
>>>>> can
>>>>> be put in place?
>>>>> 
>>>>> The client is running cfengine3-community 3.0.5 and the master is running
>>>>> 3.1.0b2.  Both are on CentOS5.5 x86_64.
>>>>> 
>>>>> Thanks,
>>>>> Frans

_______________________________________________
Help-cfengine mailing list
Help-cfengine@cfengine.org
https://cfengine.org/mailman/listinfo/help-cfengine

Reply via email to