Re: [Linux-HA] Antw: Re: Node remains offline after host restart

James Guthrie Fri, 26 Oct 2012 03:53:16 -0700

Hi Ulrich,

Yes, `crm_verify -L` is fine.


Regards,
James

On 10/26/2012 12:34 PM, Ulrich Windl wrote:
> Hi!
>
> Just one idea: "crm_verify -L" is fine?
>
> Regards,
> Ulrich
>
>>>> James Guthrie <[email protected]> schrieb am 26.10.2012 um 11:14 in Nachricht
> <[email protected]>:
>> Hi Emmanuel,
>>
>> I should maybe have mentioned earlier that I'm not using either of the
>> subshells for pacemaker, I'm configuring everything via XML. Also, I
>> don't and won't have python compiled in my environment, so any crm
>> commands are a no-go.
>>
>> Regards,
>> James
>>
>>
>> On 10/26/2012 11:10 AM, Emmanuel Saint-Joanis wrote:
>>> just to see the syntax (not easy in XML), if it shows something
>>> obviously bad
>>> can U paste the : crm configure show
>>>
>>> 2012/10/26 James Guthrie <[email protected] <mailto:[email protected]>>
>>>
>>>      Hi Emmanuel,
>>>
>>>      It might help for further debugging to attach my pacemaker config, so
>>>      here's a pastebin of `cibadmin -Ql` as it is on the cluster right now -
>>>      still in the state of one node being "offline" and the other online.
>>>
>>>      http://pastebin.com/s3kr6Fxx
>>>
>>>      As you can see in the config, I have stonith disabled.
>>>
>>>      Regards,
>>>      James
>>>
>>>      On 10/26/2012 10:48 AM, Emmanuel Saint-Joanis wrote:
>>>       > It seems like (CRMd/pEngine) thinks : "I didn't manage to shoot the
>>>       > failing node, therefore I (kind of) blacklist it as soon as I get
>>>       > control on it"
>>>       > Did you test extensively that your config works with ->
>>>       > stonith-enabled="false" <- first ?
>>>       >
>>>       >
>>>       > 2012/10/26 James Guthrie <[email protected] <mailto:[email protected]>
>>>      <mailto:[email protected] <mailto:[email protected]>>>
>>>       >
>>>       >     Hi Emmanuel,
>>>       >
>>>       >     corosync is bound to the correct interface on both hosts.
>>>       >
>>>       >     I looked for that line in the logs, but it didn't appear.
>>>       >
>>>       >     My previous e-mail addressed to Ulrich contains logfiles and
>>>      a broad
>>>       >     explanation of the process that those logfiles capture.
>>>       >
>>>       >     Regards,
>>>       >     James
>>>       >
>>>       >     On 10/25/2012 06:34 PM, Emmanuel Saint-Joanis wrote:
>>>       >      > Looks like a common timeout issue in network upcoming.
>>>       >      >
>>>       >      > See if corosync is bound to 127.0.0.1 instead of real
>>>      interface
>>>       >     with :
>>>       >      > corosync-cmapctl | grep member
>>>       >      >
>>>       >      > Also check if no line is appearing in /var/log/messages :
>>>       >      > WARN: cib_peer_callback: Discarding cib_apply_diff message
>>>      (322) from
>>>       >      > server2: not in our membership
>>>       >      >
>>>       >      > Send logs to any web service as pastebin.com
>>>      <http://pastebin.com>
>>>       >     <http://pastebin.com> <http://pastebin.com>.
>>>       >      >
>>>       >      > 2012/10/25 James Guthrie <[email protected] <mailto:[email protected]>
>>>      <mailto:[email protected] <mailto:[email protected]>>
>>>       >     <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected]
>>>      <mailto:[email protected]>>>>
>>>       >      >
>>>       >      >     Hi all,
>>>       >      >
>>>       >      >     I've been battling with this problem for a few hours now,
>>>       >     I've gone over
>>>       >      >     the obvious errors that it could have been with the
>>>      guys in
>>>       >     the linux-ha
>>>       >      >     IRC. I'd really like some help in trying to solve this
>>>      problem.
>>>       >      >
>>>       >      >     I have a two node corosync/pacemaker cluster
>>>      (corosync: 2.0.1
>>>       >     pacemaker:
>>>       >      >     1.1.8). I can get the cluster to work fine, but I can 
>>> also
>>>       >     very easily
>>>       >      >     get the cluster into a state from which it seems unable 
>>> to
>>>       >     recover. All
>>>       >      >     I have to do is reboot one of the cluster node's
>>>      hosts. When
>>>       >     doing so,
>>>       >      >     any resources that were running on it are transferred
>>>      to the
>>>       >     second
>>>       >      >     host. When the host comes back up though it appears as
>>>       >     OFFLINE in the
>>>       >      >     crm_mon of both cluster nodes.
>>>       >      >
>>>       >      >     Regardless of what I do on the "offline" host, nothing
>>>      gets
>>>       >     better. If I
>>>       >      >     however stop and restart corosync/pacemaker on the other
>>>       >     "online" host,
>>>       >      >     then everything seems to work again.
>>>       >      >
>>>       >      >     I tried waiting a while with one node offline, after a
>>>      while
>>>       >     the online
>>>       >      >     node went offline, stating that the other node was now
>>>       >     offline. For a
>>>       >      >     few minutes the output of crm_mon was different on
>>>      both hosts
>>>       >     (both
>>>       >      >     thought the other was online, they were offline). Then
>>>      finally it
>>>       >      >     settled in the exact opposite state as previously.
>>>       >      >
>>>       >      >     I've had a long look through the logs but I don't seem
>>>      to be
>>>       >     able to
>>>       >      >     pinpoint anything particular that tells me that there is 
>>> a
>>>       >     reason for
>>>       >      >     that host failing to be online.
>>>       >      >
>>>       >      >     I'd like to attach the logs, but thought that approx 1500
>>>       >     lines of
>>>       >      >     additional text in this e-mail might be a bit too much.
>>>       >      >
>>>       >      >     How should I best attach the logs and config files? Which
>>>       >     parts of the
>>>       >      >     logs and config files would most likely reveal the
>>>      problem in
>>>       >     this case?
>>>       >      >
>>>       >      >     Regards,
>>>       >      >     James
>>>       >      >
>>>       >      >     _______________________________________________
>>>       >      >     Linux-HA mailing list
>>>       >      > [email protected]
>>>      <mailto:[email protected]>
>>>      <mailto:[email protected]
>>>      <mailto:[email protected]>>
>>>       >     <mailto:[email protected]
>>>      <mailto:[email protected]>
>>>       >     <mailto:[email protected]
>>>      <mailto:[email protected]>>>
>>>       >      > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>       >      >     See also: http://linux-ha.org/ReportingProblems
>>>       >      >
>>>       >      >
>>>       >
>>>       >     _______________________________________________
>>>       >     Linux-HA mailing list
>>>       > [email protected] <mailto:[email protected]>
>>>      <mailto:[email protected]
>>>      <mailto:[email protected]>>
>>>       > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>       >     See also: http://linux-ha.org/ReportingProblems
>>>       >
>>>       >
>>>
>>>      _______________________________________________
>>>      Linux-HA mailing list
>>>      [email protected] <mailto:[email protected]>
>>>      http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>      See also: http://linux-ha.org/ReportingProblems
>>>
>>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Node remains offline after host restart

Reply via email to