Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Tom Parker Wed, 16 Oct 2013 14:29:10 -0700

Some more reading of the source code makes me think the " || [
"$__OCF_ACTION" != "stop" ]; "is not needed.


Xen_Status_with_Retry() is only called from Stop and Monitor so we only
need to check if it's a probe.  Everything else should be handled in the
case statement in the loop.

Tom

On 10/16/2013 05:16 PM, Tom Parker wrote:
> Hi.  I think there is an issue with the Updated Xen RA.
>
> I think there is an issue with the if statement here but I am not sure. 
> I may be confused about how bash || works but I don't see my servers
> ever entering the loop on a vm disappearing.
>
> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>         return $rc
> fi
>
> Does this not mean that if we run a monitor operation that is not a
> probe we will have:
>
> (ocf_is_probe) return false
> (stop != monitor) return true
> (false || true) return true
>
> which will cause the if statement to return $rc and never enter the loop? 
>
> Xen_Status_with_Retry() {
>   local rc cnt=5
>
>   Xen_Status $1
>   rc=$?
>   if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>         return $rc
>   fi
>   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
>         case "$__OCF_ACTION" in
>         stop)
>           ocf_log debug "domain $1 reported as not running, waiting $cnt
> seconds ..."
>           ;;
>         monitor)
>           ocf_log warn "domain $1 reported as not running, but it is
> expected to be running! Retrying for $cnt seconds ..."
>           ;;
>         *) : not reachable
>                 ;;
>         esac
>         sleep 1
>         Xen_Status $1
>         rc=$?
>         let cnt=$((cnt-1))
>   done
>   return $rc
> }
>
>
>
> On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
>> Hi Tom,
>>
>> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
>>> Hi Dejan
>>>
>>> Just a quick question.  I cannot see your new log messages being logged
>>> to syslog
>>>
>>> ocf_log warn "domain $1 reported as not running, but it is expected to
>>> be running! Retrying for $cnt seconds ...
>>>
>>> Do you know where I can set my logging to see warn level messages?  I
>>> expected to see them in my testing by default but that does not seem to
>>> be true.
>> You should see them by default. But note that these warnings may
>> not happen, depending on the circumstances on your host. In my
>> experiments they were logged only while the guest was rebooting
>> and then just once or maybe twice. If you have recent
>> resource-agents and crmsh, you can enable operation tracing (with
>> crm resource trace <rsc> monitor <interval>).
>>
>> Thanks,
>>
>> Dejan
>>
>>> Thanks
>>>
>>> Tom
>>>
>>>
>>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
>>>> Hi,
>>>>
>>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
>>>>> Hi!
>>>>>
>>>>> I thought, I'll never be bitten by this bug, but I actually was! Now I'm
>>>>> wondering whether the Xen RA sees the guest if you use pygrub, and pygrub 
>>>>> is
>>>>> still counting down for actual boot...
>>>>>
>>>>> But the reason why I'm writing is that I think I've discovered another 
>>>>> bug in
>>>>> the RA:
>>>>>
>>>>> CRM decided to "recover" the guest VM "v02":
>>>>> [...]
>>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 
>>>>> 14906:
>>>>> pid 19516 exited with return code 7
>>>>> [...]
>>>>>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
>>>>> [...]
>>>>>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
>>>>> prm_xen_v02_stop_0 on h05 (local)
>>>>> [...]
>>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
>>>>> [...]
>>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: 
>>>>> pid
>>>>> 19552 exited with return code 0
>>>>> [...]
>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
>>>>> prm_xen_v02_start_0 on h05 (local)
>>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
>>>>> [...]
>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 
>>>>> 'v02'
>>>>> already exists with ID '3'
>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config 
>>>>> file
>>>>> "/etc/xen/vm/v02".
>>>>> [...]
>>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: 
>>>>> pid
>>>>> 19686 exited with return code 1
>>>>> [...]
>>>>> crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
>>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
>>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on 
>>>>> h05
>>>>> failed (target: 0 vs. rc: 1): Error
>>>>> [...]
>>>>>
>>>>> As you can clearly see "start" failed, because the guest was found up 
>>>>> already!
>>>>> IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).
>>>> Yes, I've seen that. It's basically the same issue, i.e. the
>>>> domain being gone for a while and then reappearing.
>>>>
>>>>> I guess the following test is problematic:
>>>>> ---
>>>>>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
>>>>>   rc=$?
>>>>>   if [ $rc -ne 0 ]; then
>>>>>     return $OCF_ERR_GENERIC
>>>>> ---
>>>>> Here "xm create" probably fails if the guest is already created...
>>>> It should fail too. Note that this is a race, but the race is
>>>> anyway caused by the strange behaviour of xen. With the recent
>>>> fix (or workaround) in the RA, this shouldn't be happening.
>>>>
>>>> Thanks,
>>>>
>>>> Dejan
>>>>
>>>>> Regards,
>>>>> Ulrich
>>>>>
>>>>>
>>>>>>>> Dejan Muhamedagic <[email protected]> schrieb am 01.10.2013 um 12:24 
>>>>>>>> in
>>>>> Nachricht <[email protected]>:
>>>>>> Hi,
>>>>>>
>>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
>>>>>>> On 2013-10-01T00:53:15, Tom Parker <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks for paying attention to this issue (not really a bug) as I am
>>>>>>>> sure I am not the only one with this issue.  For now I have set all my
>>>>>>>> VMs to destroy so that the cluster is the only thing managing them but
>>>>>>>> this is not super clean as I get failures in my logs that are not 
>>>>>>>> really
>>>>>>>> failures.
>>>>>>> It is very much a severe bug.
>>>>>>>
>>>>>>> The Xen RA has gained a workaround for this now, but we're also pushing
>>>>>> Take a look here:
>>>>>>
>>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dejan
>>>>>>
>>>>>>> the Xen team (where the real problem is) to investigate and fix.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>     Lars
>>>>>>>
>>>>>>> -- 
>>>>>>> Architect Storage/HA
>>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
>>>>>>> Imendörffer,
>>>>>> HRB 21284 (AG Nürnberg)
>>>>>>> "Experience is the name everyone gives to their mistakes." -- Oscar 
>>>>>>> Wilde
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected] 
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected] 
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to