Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Ulrich Windl Mon, 21 Oct 2013 23:11:41 -0700

Hi!

I'm no expert, but checkout the previous commit, apply you patch, tehn do "git
add --interactive" and you can pick each chunk for the next commit. The rest is
still there, but won't be committed. You may repeat the "git add" then.


Regards,
Ulrich

>>> Tom Parker <[email protected]> schrieb am 21.10.2013 um 17:14 in
Nachricht
<[email protected]>:
> Hi Dejan.
> 
> How can I revert my commits so that they are not include multiple
> things? I will submit one patch with the logging cleanup and then if
> needed another with my changes to the meta-data.
> 
> Tom
> 
> On 10/21/2013 09:39 AM, Dejan Muhamedagic wrote:
>> Hi Ulrich!
>>
>> On Mon, Oct 21, 2013 at 09:28:50AM +0200, Ulrich Windl wrote:
>>> Hi!
>>>
>>> Basically I think there should be no hard-coded constants whose value
depends
>>> on some performance measurements, like 5s for rebooting a VM.
>> It's actually not 5s, but the status is run 5 times. If the load
>> is high, my guess is that the Xen tools used by the RA would
>> suffer proportionally.
>>
>>> So I support
>>> Tom's changes.
>>>
>>> However I noticed:
>>>
>>> +running; apparently, this period lasts only for a second or
>>> +two
>>>
>>> (missing full stop at end of sentence)
>> That's at the end of the comment and, typically, comments end
>> with a carriage return (as is here the case).
>>
>>> Actually I'd rephrase the description:
>>>
>>> "When the guest is rebooting, there is a short interval where the guest
>>> completely disappears from "xm list", which, in turn, will cause the
monitor
>>> operation to return a "not running" status. If the guest cannot be found ,

> this
>>> value will cause some extra delay in the monitor operation to work around

> the
>>> problem."
>>>
>>> (I.e. try to describe the effect, not the implementation)
>> That's the code, so the implementation is described. The very
>> top of the comment says:
>>
>>      # If the guest is rebooting, it may completely disappear from the
>>      # list of defined guests
>>
>> I was hoping that that was enough of an explanation. Look for
>> a more thorough description of the cause in the changelog. BTW,
>> note that this is a _workaround_ and that the thing should
>> eventually be fixed in Xen.
>>
>>> And yes, I appreciate consistent log formats also ;-)
>> That's always welcome, of course. It should also go in a
>> separate commit.
>>
>> Thanks,
>>
>> Dejan
>>
>>> Regards,
>>> Ulrich
>>>
>>>>>> Tom Parker <[email protected]> schrieb am 18.10.2013 um 19:30 in
>>> Nachricht
>>> <[email protected]>:
>>>> Hi Dejan.  Sorry to be slow to respond to this.  I have done some
>>>> testing and everything looks good. 
>>>>
>>>> I spent some time tweaking the RA and I added a parameter called
>>>> wait_for_reboot (default 5s) to allow us to override the reboot sleep
>>>> times (in case it's more than 5 seconds on really loaded hypervisors). 
>>>> I also cleaned up a few log entries to make them consistent in the RA
>>>> and edited your entries for xen status to be a little bit more clear as
>>>> to why we think we should be waiting. 
>>>>
>>>> I have attached a patch here because I have NO idea how to create a
>>>> branch and pull request.  If there are links to a good place to start I
>>>> may be able to contribute occasionally to some other RAs that I use.
>>>>
>>>> Please let me know what you think.
>>>>
>>>> Thanks for your help
>>>>
>>>> Tom
>>>>
>>>>
>>>> On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
>>>>> On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
>>>>>> Hi Tom,
>>>>>>
>>>>>> On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
>>>>>>> Some more reading of the source code makes me think the " || [
>>>>>>> "$__OCF_ACTION" != "stop" ]; "is not needed. 
>>>>>> Yes, you're right. I'll drop that part of the if statement. Many
>>>>>> thanks for testing.
>>>>> Fixed now. The if statement, which was obviously hard to follow,
>>>>> got relegated to the monitor function.  Which makes the
>>>>> Xen_Status_with_Retry really stand for what's happening in there ;-)
>>>>>
>>>>> Tom, hope you can test again.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Dejan
>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Dejan
>>>>>>
>>>>>>> Xen_Status_with_Retry() is only called from Stop and Monitor so we
only
>>>>>>> need to check if it's a probe.  Everything else should be handled in
the
>>>>>>> case statement in the loop.
>>>>>>>
>>>>>>> Tom
>>>>>>>
>>>>>>> On 10/16/2013 05:16 PM, Tom Parker wrote:
>>>>>>>> Hi.  I think there is an issue with the Updated Xen RA.
>>>>>>>>
>>>>>>>> I think there is an issue with the if statement here but I am not
sure.
>>>>>>>> I may be confused about how bash || works but I don't see my servers
>>>>>>>> ever entering the loop on a vm disappearing.
>>>>>>>>
>>>>>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>>>>>>         return $rc
>>>>>>>> fi
>>>>>>>>
>>>>>>>> Does this not mean that if we run a monitor operation that is not a
>>>>>>>> probe we will have:
>>>>>>>>
>>>>>>>> (ocf_is_probe) return false
>>>>>>>> (stop != monitor) return true
>>>>>>>> (false || true) return true
>>>>>>>>
>>>>>>>> which will cause the if statement to return $rc and never enter the
>>> loop? 
>>>>>>>> Xen_Status_with_Retry() {
>>>>>>>>   local rc cnt=5
>>>>>>>>
>>>>>>>>   Xen_Status $1
>>>>>>>>   rc=$?
>>>>>>>>   if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>>>>>>         return $rc
>>>>>>>>   fi
>>>>>>>>   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
>>>>>>>>         case "$__OCF_ACTION" in
>>>>>>>>         stop)
>>>>>>>>           ocf_log debug "domain $1 reported as not running, waiting
>>> $cnt
>>>>>>>> seconds ..."
>>>>>>>>           ;;
>>>>>>>>         monitor)
>>>>>>>>           ocf_log warn "domain $1 reported as not running, but it is
>>>>>>>> expected to be running! Retrying for $cnt seconds ..."
>>>>>>>>           ;;
>>>>>>>>         *) : not reachable
>>>>>>>>                 ;;
>>>>>>>>         esac
>>>>>>>>         sleep 1
>>>>>>>>         Xen_Status $1
>>>>>>>>         rc=$?
>>>>>>>>         let cnt=$((cnt-1))
>>>>>>>>   done
>>>>>>>>   return $rc
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
>>>>>>>>> Hi Tom,
>>>>>>>>>
>>>>>>>>> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
>>>>>>>>>> Hi Dejan
>>>>>>>>>>
>>>>>>>>>> Just a quick question.  I cannot see your new log messages being
>>> logged
>>>>>>>>>> to syslog
>>>>>>>>>>
>>>>>>>>>> ocf_log warn "domain $1 reported as not running, but it is
expected
>>> to
>>>>>>>>>> be running! Retrying for $cnt seconds ...
>>>>>>>>>>
>>>>>>>>>> Do you know where I can set my logging to see warn level messages? 
I
>>>>>>>>>> expected to see them in my testing by default but that does not
seem
>>> to
>>>>>>>>>> be true.
>>>>>>>>> You should see them by default. But note that these warnings may
>>>>>>>>> not happen, depending on the circumstances on your host. In my
>>>>>>>>> experiments they were logged only while the guest was rebooting
>>>>>>>>> and then just once or maybe twice. If you have recent
>>>>>>>>> resource-agents and crmsh, you can enable operation tracing (with
>>>>>>>>> crm resource trace <rsc> monitor <interval>).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dejan
>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Tom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
>>>>>>>>>>>> Hi!
>>>>>>>>>>>>
>>>>>>>>>>>> I thought, I'll never be bitten by this bug, but I actually was!
Now
>>> I'm
>>>>>>>>>>>> wondering whether the Xen RA sees the guest if you use pygrub,
and
>>> pygrub is
>>>>>>>>>>>> still counting down for actual boot...
>>>>>>>>>>>>
>>>>>>>>>>>> But the reason why I'm writing is that I think I've discovered
>>> another bug 
>>>> in
>>>>>>>>>>>> the RA:
>>>>>>>>>>>>
>>>>>>>>>>>> CRM decided to "recover" the guest VM "v02":
>>>>>>>>>>>> [...]
>>>>>>>>>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for
client
>>> 14906:
>>>>>>>>>>>> pid 19516 exited with return code 7
>>>>>>>>>>>> [...]
>>>>>>>>>>>>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02
(Started
>>> h05)
>>>>>>>>>>>> [...]
>>>>>>>>>>>>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
>>>>>>>>>>>> prm_xen_v02_stop_0 on h05 (local)
>>>>>>>>>>>> [...]
>>>>>>>>>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
>>>>>>>>>>>> [...]
>>>>>>>>>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for
client
>>> 14906: pid
>>>>>>>>>>>> 19552 exited with return code 0
>>>>>>>>>>>> [...]
>>>>>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
>>>>>>>>>>>> prm_xen_v02_start_0 on h05 (local)
>>>>>>>>>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
>>>>>>>>>>>> [...]
>>>>>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr)
Error:
>>> Domain 
>>>> 'v02'
>>>>>>>>>>>> already exists with ID '3'
>>>>>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using
>>> config file
>>>>>>>>>>>> "/etc/xen/vm/v02".
>>>>>>>>>>>> [...]
>>>>>>>>>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for
client
>>> 14906: 
>>>> pid
>>>>>>>>>>>> 19686 exited with return code 1
>>>>>>>>>>>> [...]
>>>>>>>>>>>> crmd: [14906]: info: process_lrm_event: LRM operation
>>> prm_xen_v02_start_0
>>>>>>>>>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
>>>>>>>>>>>> crmd: [14906]: WARN: status_from_rc: Action 78
(prm_xen_v02_start_0)
>>> on h05
>>>>>>>>>>>> failed (target: 0 vs. rc: 1): Error
>>>>>>>>>>>> [...]
>>>>>>>>>>>>
>>>>>>>>>>>> As you can clearly see "start" failed, because the guest was
found
>>> up 
>>>> already!
>>>>>>>>>>>> IMHO this is a bug in the RA (SLES11 SP2:
>>> resource-agents-3.9.4-0.26.84).
>>>>>>>>>>> Yes, I've seen that. It's basically the same issue, i.e. the
>>>>>>>>>>> domain being gone for a while and then reappearing.
>>>>>>>>>>>
>>>>>>>>>>>> I guess the following test is problematic:
>>>>>>>>>>>> ---
>>>>>>>>>>>>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
>>>>>>>>>>>>   rc=$?
>>>>>>>>>>>>   if [ $rc -ne 0 ]; then
>>>>>>>>>>>>     return $OCF_ERR_GENERIC
>>>>>>>>>>>> ---
>>>>>>>>>>>> Here "xm create" probably fails if the guest is already
created...
>>>>>>>>>>> It should fail too. Note that this is a race, but the race is
>>>>>>>>>>> anyway caused by the strange behaviour of xen. With the recent
>>>>>>>>>>> fix (or workaround) in the RA, this shouldn't be happening.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Dejan
>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Ulrich
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dejan Muhamedagic <[email protected]> schrieb am 01.10.2013
um
>>> 12:24 in
>>>>>>>>>>>> Nachricht <[email protected]>:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree
>>> wrote:
>>>>>>>>>>>>>> On 2013-10-01T00:53:15, Tom Parker <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for paying attention to this issue (not really a bug)
as I
>>> am
>>>>>>>>>>>>>>> sure I am not the only one with this issue.  For now I have
set
>>> all my
>>>>>>>>>>>>>>> VMs to destroy so that the cluster is the only thing managing
>>> them but
>>>>>>>>>>>>>>> this is not super clean as I get failures in my logs that are
not
>>> really
>>>>>>>>>>>>>>> failures.
>>>>>>>>>>>>>> It is very much a severe bug.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The Xen RA has gained a workaround for this now, but we're
also
>>> pushing
>>>>>>>>>>>>> Take a look here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dejan
>>>>>>>>>>>>>
>>>>>>>>>>>>>> the Xen team (where the real problem is) to investigate and
fix.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>     Lars
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> Architect Storage/HA
>>>>>>>>>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
>>> Imendörffer,
>>>>>>>>>>>>> HRB 21284 (AG Nürnberg)
>>>>>>>>>>>>>> "Experience is the name everyone gives to their mistakes." --
>>> Oscar Wilde
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>>>>> [email protected] 
>>>>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>>>> [email protected] 
>>>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>>> [email protected] 
>>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>> [email protected] 
>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>> [email protected] 
>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>> _______________________________________________
>>>>>>>>> Linux-HA mailing list
>>>>>>>>> [email protected] 
>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>> _______________________________________________
>>>>>>>> Linux-HA mailing list
>>>>>>>> [email protected] 
>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected] 
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected] 
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected] 
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected] 
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>> See also: http://linux-ha.org/ReportingProblems 
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected] 
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>> See also: http://linux-ha.org/ReportingProblems 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected] 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to