Re: [Linux-HA] New user can't get cman to recognize other systems

jayknowsunix Tue, 21 Oct 2014 13:37:28 -0700

Yep, my network engineer and I found that the multicast packets were being 
blocked by the underlying hypervisor for the VM systems. At first we thought it 
was just iptables on the servers, but i was certain I had actually turned that 
off. The issue has been bumped up to the operations team for a fixing this, but 
since I've gotten it to work with unicast, there's no pressure


Sent from my iPad

> On Oct 21, 2014, at 3:15 PM, Digimer <[email protected]> wrote:
> 
> Glad you sorted it out!
> 
> So then, it was almost certainly a multicast issue. I would still strongly 
> recommend trying to source and fix the problem, and reverting to mcast if you 
> can. More efficient. :)
> 
> digimer
> 
>> On 21/10/14 02:59 PM, John Scalia wrote:
>> Ok, got it working after a little more effort, and the cluster is now
>> properly reporting.
>> 
>>> On Tue, Oct 21, 2014 at 1:34 PM, John Scalia <[email protected]> wrote:
>>> 
>>> So, I set "transport="udpi"' in the cluster.conf file, and it now looks
>>> like this:
>>> 
>>> <cluster config_version="11" name="pgdb_cluster" transport="udpu">
>>> 
>>>   <fence_daemon/>
>>>   <clusternodes>
>>>     <clusternode name="csgha1" nodeid="1">
>>>       <fence>
>>>         <method name="pcmk-redirect">
>>>           <device name="pcmk" port="csgha1"/>
>>>         </method>
>>>       </fence>
>>>     </clusternode>
>>>     <clusternode name="csgha2" nodeid="2">
>>>       <fence>
>>>         <method name="pcmk-redirect">
>>>           <device name="pcmk" port="csgha2"/>
>>>         </method>
>>>       </fence>
>>>     </clusternode>
>>>     <clusternode name="csgha3" nodeid="3">
>>>       <fence>
>>>         <method name="pcmk-redirect">
>>>           <device name="pcmk" port="csgha3"/>
>>>         </method>
>>>       </fence>
>>>     </clusternode>
>>>   </clusternodes>
>>>   <cman/>
>>>   <fencedevices>
>>>     <fencedevice agent="fence_pcmk" name="pcmk"/>
>>>   </fencedevices>
>>>   <rm>
>>>     <failoverdomains/>
>>>     <resources/>
>>>   </rm>
>>> </cluster>
>>> 
>>> But, after restarting the cluster I don't see any difference. Did I do
>>> something wrong?
>>> --
>>> Jay
>>> 
>>>> On Tue, Oct 21, 2014 at 12:25 PM, Digimer <[email protected]> wrote:
>>>> 
>>>> No, you don't need to specify anything in cluster.conf for unicast to
>>>> work. Corosync will divine the IPs by resolving the node names to IPs. If
>>>> you set multicast and don't want to use the auto-selected mcast IP, then
>>>> you can specify the mcast IP group to use via <multicast... />.
>>>> 
>>>> digimer
>>>> 
>>>> 
>>>>> On 21/10/14 12:22 PM, John Scalia wrote:
>>>>> 
>>>>> OK, looking at the cman man page on this system, I see the line saying
>>>>> "the corosync.conf file is not used." So, I'm guessing I need to set a
>>>>> unicast address somewhere in the cluster.conf file, but the man page
>>>>> only mentions the <multicast addr="..."/> parameter. What can I use to
>>>>> set this to a unicast address for ports 5404 and 5405? I'm assuming I
>>>>> can't just put a unicast address for the multicast parameter, and the
>>>>> man page for cluster.conf wasn't much help either.
>>>>> 
>>>>> We're still working on having the security team permit these 3 systems
>>>>> to use multicast.
>>>>> 
>>>>>> On 10/21/2014 11:51 AM, Digimer wrote:
>>>>>> 
>>>>>> Keep us posted. :)
>>>>>> 
>>>>>>> On 21/10/14 08:40 AM, John Scalia wrote:
>>>>>>> 
>>>>>>> I've been check hostname resolution this morning, and all the systems
>>>>>>> are listed in each /etc/hosts file (No DNS in this environment.) and
>>>>>>> ping works on every system both to itself and all the other systems. At
>>>>>>> least it's working on the 10.10.1.0/24 network.
>>>>>>> 
>>>>>>> I ran tcpdump trying to see what traffic is on port 5405 on each
>>>>>>> system,
>>>>>>> and I'm only seeing outbound on each, even though netstat shows each is
>>>>>>> listening on the multicast address. My suspicion is that the router is
>>>>>>> eating the multicast broadcasts, so I may try the unicast address
>>>>>>> instead, but I'm waiting on one of our network engineers to see if my
>>>>>>> suspicion is correct about the router. He volunteered to help late
>>>>>>> yesterday.
>>>>>>> 
>>>>>>>> On 10/20/2014 4:34 PM, Digimer wrote:
>>>>>>>> 
>>>>>>>> It looks sane on the surface. The 'gethostip' tool comes from the
>>>>>>>> 'syslinux' package, and it's really handy! The '-d' says to give the
>>>>>>>> IP in dotted-decimanl notation only.
>>>>>>>> 
>>>>>>>> What I was trying to see was whether the 'uname -n' resolved to the IP
>>>>>>>> on the same network card as the other nodes. This is how corosync
>>>>>>>> decides which interface to send cluster traffic onto. I suspect you
>>>>>>>> might have a general network issue, possibly related to multicast.
>>>>>>>> (Some switches and some hypervisor virtual networks don't play nice
>>>>>>>> with corosync).
>>>>>>>> 
>>>>>>>> Have you tried unicast? If not, try setting the <cman ../> element to
>>>>>>>> have the <cman transport="udpu" ... /> attribute. Do note that unicast
>>>>>>>> isn't as efficient as multicast, so thought it might work, I'd
>>>>>>>> personally treat it as a debug tool to isolate the source of the
>>>>>>>> problem.
>>>>>>>> 
>>>>>>>> cheers
>>>>>>>> 
>>>>>>>> digimer
>>>>>>>> 
>>>>>>>> PS - Can you share your pacemaker configuration?
>>>>>>>> 
>>>>>>>>> On 20/10/14 03:40 PM, John Scalia wrote:
>>>>>>>>> 
>>>>>>>>> Sure, and thanks for helping.
>>>>>>>>> 
>>>>>>>>> Here's the /etc/cluster/cluster.conf file and it is identical on all
>>>>>>>>> three
>>>>>>>>> systems:
>>>>>>>>> 
>>>>>>>>> <cluster config_version="11" name="pgdb_cluster">
>>>>>>>>>    <fence_daemon/>
>>>>>>>>>    <clusternodes>
>>>>>>>>>      <clusternode name="csgha1" nodeid="1">
>>>>>>>>>        <fence>
>>>>>>>>>          <method name="pcmk-redirect">
>>>>>>>>>            <device name="pcmk" port="csgha1"/>
>>>>>>>>>          </method>
>>>>>>>>>        </fence>
>>>>>>>>>      </clusternode>
>>>>>>>>>      <clusternode name="csgha2" nodeid="2">
>>>>>>>>>        <fence>
>>>>>>>>>          <method name="pcmk-redirect">
>>>>>>>>>            <device name="pcmk" port="csgha2"/>
>>>>>>>>>          </method>
>>>>>>>>>        </fence>
>>>>>>>>>      </clusternode>
>>>>>>>>>      <clusternode name="csgha3" nodeid="3">
>>>>>>>>>        <fence>
>>>>>>>>>          <method name="pcmk-redirect">
>>>>>>>>>            <device name="pcmk" port="csgha3"/>
>>>>>>>>>          </method>
>>>>>>>>>        </fence>
>>>>>>>>>      </clusternode>
>>>>>>>>>    </clusternodes>
>>>>>>>>>    <cman/>
>>>>>>>>>    <fencedevices>
>>>>>>>>>      <fencedevice agent="fence_pcmk" name="pcmk"/>
>>>>>>>>>    </fencedevices>
>>>>>>>>>    <rm>
>>>>>>>>>      <failoverdomains/>
>>>>>>>>>      <resources/>
>>>>>>>>>    </rm>
>>>>>>>>> </cluster>
>>>>>>>>> 
>>>>>>>>> uname -n reports "csgha1" on that system, "csgha2" on its system, and
>>>>>>>>> "csgha3" on the last system.
>>>>>>>>> I don't seem to have gethostip on any of these systems, so I don't
>>>>>>>>> know if
>>>>>>>>> the next section helps or not.
>>>>>>>>> "ifconfig -a" reports csgha1: eth0 = 172.17.1.21
>>>>>>>>>                                           eth1 = 10.10.1.128
>>>>>>>>>                              csgha2: eth0 = 10.10.1.129
>>>>>>>>> Yeah, I know this looks a little weird, but it was the way our
>>>>>>>>> automated VM
>>>>>>>>> control did the interfaces
>>>>>>>>>                                           eth1 = 172.,17.1.3
>>>>>>>>>                              csgha3: eth0 = 172.17.1.23
>>>>>>>>>                                           eth1 = 10.10.1.130
>>>>>>>>> The /etc/hosts file on each system only has the 10.10.1.0/24
>>>>>>>>> address for
>>>>>>>>> each system in in it.
>>>>>>>>> iptables is not running on these systems.
>>>>>>>>> 
>>>>>>>>> Let me know if you need more information, and I very much appreciate
>>>>>>>>> your
>>>>>>>>> assistance.
>>>>>>>>> --
>>>>>>>>> Jay
>>>>>>>>> 
>>>>>>>>> On Mon, Oct 20, 2014 at 3:18 PM, Digimer <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>  On 20/10/14 02:50 PM, John Scalia wrote:
>>>>>>>>>> 
>>>>>>>>>>  Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> I'm trying to build my first ever HA cluster and I'm using 3 VMs
>>>>>>>>>>> running
>>>>>>>>>>> CentOS 6.5. I followed the instructions to the letter at:
>>>>>>>>>>> 
>>>>>>>>>>> http://clusterlabs.org/quickstart-redhat.html
>>>>>>>>>>> 
>>>>>>>>>>> and everything appears to start normally, but if I run "cman_tool
>>>>>>>>>>> nodes
>>>>>>>>>>> -a", I only see:
>>>>>>>>>>> 
>>>>>>>>>>> Node     Sts    Inc          Joined Name
>>>>>>>>>>>           1      M     64         2014-10--20 14:00:00 csgha1
>>>>>>>>>>>                   Addresses: 10.10.1.128
>>>>>>>>>>>           2      X 0
>>>>>>>>>>> csgha2
>>>>>>>>>>>           3      X 0
>>>>>>>>>>> csgha3
>>>>>>>>>>> 
>>>>>>>>>>> In the other systems, the output is the same except for which
>>>>>>>>>>> system is
>>>>>>>>>>> shown as joined. Each shows just itself as belonging to the
>>>>>>>>>>> cluster.
>>>>>>>>>>> Also, "pcs status" reflects similarly with non-self systems showing
>>>>>>>>>>> offline. I've checked "netstat -an" and see each machine
>>>>>>>>>>> listening on
>>>>>>>>>>> ports 5405 and 5405. And the logs are rather involved, but I'm not
>>>>>>>>>>> seeing errors in it.
>>>>>>>>>>> 
>>>>>>>>>>> Any ideas for where to look for what's causing them to not
>>>>>>>>>>> communicate?
>>>>>>>>>>> --
>>>>>>>>>>> Jay
>>>>>>>>>> Can you share your cluster.conf file please? Also, for each node:
>>>>>>>>>> 
>>>>>>>>>> * uname -n
>>>>>>>>>> * gethostip -d $(uname -n)
>>>>>>>>>> * ifconfig |grep -B 1 $(gethostip -d $(uname -n)) | grep HWaddr |
>>>>>>>>>> awk '{
>>>>>>>>>> print $1 }'
>>>>>>>>>> * iptables-save | grep -i multi
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Digimer
>>>>>>>>>> Papers and Projects: https://alteeve.ca/w/
>>>>>>>>>> What if the cure for cancer is trapped in the mind of a person
>>>>>>>>>> without
>>>>>>>>>> access to education?
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>>> 
>>>>>>>>>>  _______________________________________________
>>>>>>>>> Linux-HA mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>> 
>>>> 
>>>> --
>>>> Digimer
>>>> Papers and Projects: https://alteeve.ca/w/
>>>> What if the cure for cancer is trapped in the mind of a person without
>>>> access to education?
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> 
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without access 
> to education?
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] New user can't get cman to recognize other systems

Reply via email to