If he'd kill -9'd cf-execd, I'd expect corruption.  Since the output he
pasted looked like it was a signal 15, I would have expected it to be caught
and cleaned up after (e.g. Finish in-progress transfers).  Further, the cf2
behavior of copying to a temp file and then moving into place does still
work from his output as well...it showed the copy to a temporary location.
This seems to be all the more reason to ask -- why was there any corruption?

Sure the promise could be modified, or the schedule staggered to avoid the
issue...but those are workarounds vs. real solutions it seems.

Question: Can the condition be reliably reproduced by just kill -15'ing
cf-execd with in-progress file transfers?

If yes, then that would seem bad to me and get reported as a bug.  The kill
-9 (hard stop) vs. kill -15 (graceful shutdown) thing is not cfengine
specific, it's how many UNIX programs work.  If signal 15's can cause
corruption, it may very well catch people off guard.

On 11/9/10 1:22 AM, "Seva Gluschenko" <seva.glusche...@gmail.com> wrote:

> Of course, Cfengine3 acts the same way, file never gets installed
> directly in place of an older file.
> 
> 2010/11/9 Bas van der Vlies <b...@sara.nl>:
>> 
>> On 9 nov 2010, at 08:37, Seva Gluschenko wrote:
>> 
>>> Frans,
>>> 
>>> since you're terminating cf-serverd in the middle of a file transfer,
>>> the receiving agent reasonably treats it as a corruption. There's
>>> nothing wrong with it. On the other hand, why terminating cf-serverd
>>> when you just need to restart cf-execd? Modify your promise and feel
>>> safe.
>>> 
>> 
>> I thought cfengine has some logic for transfering files: (i think this
>> cfengine2 style, did not check it for cfengine3)
>>  * first copy it to <filename>.cfnew
>>  * if this succeed and it is correct move to <filename>
>> 
>> This is to avoid corruption like this. I server can crash and you don't want
>> the clients to sufffer from this with file that are corrupted.
>> 
>> 
>>> 2010/11/8 Frans Lawaetz <fr...@broadinstitute.org>:
>>>> Hi-
>>>> 
>>>> I recently implemented a "service cfengine3 restart" weekly cron job as a
>>>> workaround to the MAX_FD bug that others and myself have seen.  I neglected
>>>> to except the master from the restart so when cf-serverd was killed a
>>>> number
>>>> of hosts complained about in-flight transfers or not being able to reach
>>>> the
>>>> master.  This is quite reasonable however I found one host that suffered a
>>>> complete loss or corruption of its limits.conf file.  It essentially
>>>> bricked
>>>> the system, requiring a rebuild.
>>>> 
>>>> Here is the sequence:
>>>> 
>>>> cron job restarts cf3.  cf3 reports to syslog:
>>>> 
>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
>>>> while doing [lock.independent.server_cfengine.-cfengine3.the_server_d
>>>> aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]
>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
>>>> while doing
>>>> [lock.independent.server_cfengine.-cfengine3.the_server_daemon_2542_MD5=5b2
>>>> c904169606aa9b27ec369fd13e016]
>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Logical start time Fri Oct 29
>>>> 04:41:01 2010
>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
>>>> at Thu Oct 28 12:28:34 2010
>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Logical start time Thu Oct 28
>>>> 12:28:34 2010
>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
>>>> at Thu Oct 28 12:28:34 2010
>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  Received signal 15 (SIGTERM)
>>>> while doing [lock.independent.server_cfengine.-cfengine3.the_server_d
>>>> aemon_2542_MD5=5b2c904169606aa9b27ec369fd13e016]Nov  7 04:23:06 cfengine3
>>>> cf-serverd[14585]:  Logical start time Fri Oct 29 04:41:01 2010
>>>> Nov  7 04:23:06 cfengine3 cf-serverd[14585]:  This sub-task started really
>>>> at Thu Oct 28 12:28:34 2010
>>>> 
>>>> 
>>>> cf3 on the client host emailed me at approximately the time of the restart
>>>> that it failed to copy limits.conf
>>>> 
>>>> 
>>>> date: Sun, Nov 7, 2010 at 4:23 AM
>>>> subject: community [hap10.broadinstitute.org/192.168.32.34]
>>>> 
>>>> Was not able to copy /cfengine/farm/etc/security/limits.conf.crdwga to
>>>> /etc/security/limits.conf
>>>> I: Made in version 'not specified' of '/var/cfengine/inputs/farm.cf' near
>>>> line 279
>>>> 
>>>> 
>>>> I have noticed other similar such failures on other hosts before but cf3
>>>> usually makes a note that it aborted the transaction:
>>>> 
>>>> !! New file /etc/security/limits.conf.cfnew seems to have been corrupted in
>>>> transit (dest 0 and src 1844), aborting!
>>>> Was not able to copy /cfengine/farm/etc/security/limits.conf to
>>>> /etc/security/limits.conf
>>>> 
>>>> Immediately after the failure on the host in question it started reporting
>>>> over the network that limits.conf was corrupt.
>>>> 
>>>> Nov  7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): cannot read
>>>> settings from /etc/security/limits.conf: No such file or directory
>>>> Nov  7 04:23:02 hap10 crond[13650]: pam_limits(crond:session): error
>>>> parsing
>>>> the configuration file: '/etc/security/limits.conf'
>>>> 
>>>> I was of course unable to login to the system to investigate further so
>>>> rebuilt it.
>>>> 
>>>> I've since excepted the master from the weekly restart but I am alarmed
>>>> that
>>>> there is a use case where cf-agent can corrupt a file.  Any ideas on how
>>>> this might have happened and whether there are any added safeguards that
>>>> can
>>>> be put in place?
>>>> 
>>>> The client is running cfengine3-community 3.0.5 and the master is running
>>>> 3.1.0b2.  Both are on CentOS5.5 x86_64.
>>>> 
>>>> Thanks,
>>>> Frans
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Help-cfengine mailing list
>>>> Help-cfengine@cfengine.org
>>>> https://cfengine.org/mailman/listinfo/help-cfengine
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> SY, Seva Gluschenko.
>>> _______________________________________________
>>> Help-cfengine mailing list
>>> Help-cfengine@cfengine.org
>>> https://cfengine.org/mailman/listinfo/help-cfengine
>> 
>> --
>> Bas van der Vlies
>> b...@sara.nl
>> 
>> 
>> 
>> 
> 
> 
> 
> --
> SY, Seva Gluschenko.
> _______________________________________________
> Help-cfengine mailing list
> Help-cfengine@cfengine.org
> https://cfengine.org/mailman/listinfo/help-cfengine

-- 
Mike Hoskins : micho...@cisco.com : 415.506.UNIX (8649)
Peon #307 : Cisco STBU : IronPort San Bruno

Never get so busy making a living that you forget to make a life.
                -- Unknown

_______________________________________________
Help-cfengine mailing list
Help-cfengine@cfengine.org
https://cfengine.org/mailman/listinfo/help-cfengine

Reply via email to