I'm pretty sure there's some uncovered detail behind that issue.
Usually that detail hides in promises. I understand that one rarely
feels happy about showing their promises to the public since they can
tell too much about the system, but this is the case when exact
promises snapshot would really shed the light.

2010/11/9 Mike Hoskins <micho...@cisco.com>:
> Apologies for email without enough coffee -- Please s/cf-execd/cf-serverd/g
> in my comments.  I'd be mostly curious if this is easily reproducible, or
> simply an edge case that needs identified.
>
> On 11/9/10 9:16 AM, "Mike Hoskins" <micho...@cisco.com> wrote:
>
>> If he'd kill -9'd cf-execd, I'd expect corruption.  Since the output he
>> pasted looked like it was a signal 15, I would have expected it to be caught
>> and cleaned up after (e.g. Finish in-progress transfers).  Further, the cf2
>> behavior of copying to a temp file and then moving into place does still
>> work from his output as well...it showed the copy to a temporary location.
>> This seems to be all the more reason to ask -- why was there any corruption?
>>
>> Sure the promise could be modified, or the schedule staggered to avoid the
>> issue...but those are workarounds vs. real solutions it seems.
>>
>> Question: Can the condition be reliably reproduced by just kill -15'ing
>> cf-execd with in-progress file transfers?
>>
>> If yes, then that would seem bad to me and get reported as a bug.  The kill
>> -9 (hard stop) vs. kill -15 (graceful shutdown) thing is not cfengine
>> specific, it's how many UNIX programs work.  If signal 15's can cause
>> corruption, it may very well catch people off guard.
>>
>> On 11/9/10 1:22 AM, "Seva Gluschenko" <seva.glusche...@gmail.com> wrote:
>>
>>> Of course, Cfengine3 acts the same way, file never gets installed
>>> directly in place of an older file.
>>>
>>> 2010/11/9 Bas van der Vlies <b...@sara.nl>:
>>>>
>>>> On 9 nov 2010, at 08:37, Seva Gluschenko wrote:
>>>>
>>>>> Frans,
>>>>>
>>>>> since you're terminating cf-serverd in the middle of a file transfer,
>>>>> the receiving agent reasonably treats it as a corruption. There's
>>>>> nothing wrong with it. On the other hand, why terminating cf-serverd
>>>>> when you just need to restart cf-execd? Modify your promise and feel
>>>>> safe.
>>>>>
>>>>
>>>> I thought cfengine has some logic for transfering files: (i think this
>>>> cfengine2 style, did not check it for cfengine3)
>>>>  * first copy it to <filename>.cfnew
>>>>  * if this succeed and it is correct move to <filename>
>>>>
>>>> This is to avoid corruption like this. I server can crash and you don't 
>>>> want
>>>> the clients to sufffer from this with file that are corrupted.
>>>>
>>>>
>>>>> 2010/11/8 Frans Lawaetz <fr...@broadinstitute.org>:
>>>>>> Hi-
>>>>>>
>>>>>> I recently implemented a "service cfengine3 restart" weekly cron job as a
>>>>>> workaround to the MAX_FD bug that others and myself have seen.  I
>>>>>> neglected
>>>>>> to except the master from the restart so when cf-serverd was killed a
>>>>>> number
>>>>>> of hosts complained about in-flight transfers or not being able to reach
>>>>>> the
>>>>>> master.  This is quite reasonable however I found one host that suffered 
>>>>>> a
>>>>>> complete loss or corruption of its limits.conf file.  It essentially
>>>>>> bricked
>>>>>> the system, requiring a rebuild.
>>>>>>
>>>>>> Here is the sequence:
>>>>>>
>>>>>> cron job restarts cf3.  cf3 reports to syslog:
>>>>>>
>>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 
>>>>>> (SIGTERM)
>>>>>> while doing [lock.independent.server_cfengine.-cfengine3.the_server_d
>>>>>> aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]
>>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 
>>>>>> (SIGTERM)
>>>>>> while doing
>>>>>>
> [lock.independent.server_cfengine.-cfengine3.the_server_daemon_2542_MD5=5b>>>>>
> 2
>>>>>> c904169606aa9b27ec369fd13e016]
>>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Logical start time Fri Oct
>>>>>> 29
>>>>>> 04:41:01 2010
>>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started 
>>>>>> really
>>>>>> at Thu Oct 28 12:28:34 2010
>>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Logical start time Thu Oct
>>>>>> 28
>>>>>> 12:28:34 2010
>>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started 
>>>>>> really
>>>>>> at Thu Oct 28 12:28:34 2010
>>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 
>>>>>> (SIGTERM)
>>>>>> while doing [lock.independent.server_cfengine.-cfengine3.the_server_d
>>>>>> aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]Nov  7 04:23:06 cfengine3
>>>>>> cf-serverd[14585]:  Logical start time Fri Oct 29 04:41:01 2010
>>>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started 
>>>>>> really
>>>>>> at Thu Oct 28 12:28:34 2010
>>>>>>
>>>>>>
>>>>>> cf3 on the client host emailed me at approximately the time of the 
>>>>>> restart
>>>>>> that it failed to copy limits.conf
>>>>>>
>>>>>>
>>>>>> date: Sun, Nov 7, 2010 at 4:23 AM
>>>>>> subject: community [hap10.broadinstitute.org/192.168.32.34]
>>>>>>
>>>>>> Was not able to copy /cfengine/farm/etc/security/limits.conf.crdwga to
>>>>>> /etc/security/limits.conf
>>>>>> I: Made in version 'not specified' of '/var/cfengine/inputs/farm.cf' near
>>>>>> line 279
>>>>>>
>>>>>>
>>>>>> I have noticed other similar such failures on other hosts before but cf3
>>>>>> usually makes a note that it aborted the transaction:
>>>>>>
>>>>>> !! New file /etc/security/limits.conf.cfnew seems to have been corrupted
>>>>>> in
>>>>>> transit (dest 0 and src 1844), aborting!
>>>>>> Was not able to copy /cfengine/farm/etc/security/limits.conf to
>>>>>> /etc/security/limits.conf
>>>>>>
>>>>>> Immediately after the failure on the host in question it started 
>>>>>> reporting
>>>>>> over the network that limits.conf was corrupt.
>>>>>>
>>>>>> Nov  7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): cannot 
>>>>>> read
>>>>>> settings from /etc/security/limits.conf: No such file or directory
>>>>>> Nov  7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): error
>>>>>> parsing
>>>>>> the configuration file: '/etc/security/limits.conf'
>>>>>>
>>>>>> I was of course unable to login to the system to investigate further so
>>>>>> rebuilt it.
>>>>>>
>>>>>> I've since excepted the master from the weekly restart but I am alarmed
>>>>>> that
>>>>>> there is a use case where cf-agent can corrupt a file.  Any ideas on how
>>>>>> this might have happened and whether there are any added safeguards that
>>>>>> can
>>>>>> be put in place?
>>>>>>
>>>>>> The client is running cfengine3-community 3.0.5 and the master is running
>>>>>> 3.1.0b2.  Both are on CentOS5.5 x86_64.
>>>>>>
>>>>>> Thanks,
>>>>>> Frans
>
>



-- 
SY, Seva Gluschenko.
_______________________________________________
Help-cfengine mailing list
Help-cfengine@cfengine.org
https://cfengine.org/mailman/listinfo/help-cfengine

Reply via email to