Yep, my network engineer and I found that the multicast packets were being blocked by the underlying hypervisor for the VM systems. At first we thought it was just iptables on the servers, but i was certain I had actually turned that off. The issue has been bumped up to the operations team for a fixing this, but since I've gotten it to work with unicast, there's no pressure
Sent from my iPad > On Oct 21, 2014, at 3:15 PM, Digimer <[email protected]> wrote: > > Glad you sorted it out! > > So then, it was almost certainly a multicast issue. I would still strongly > recommend trying to source and fix the problem, and reverting to mcast if you > can. More efficient. :) > > digimer > >> On 21/10/14 02:59 PM, John Scalia wrote: >> Ok, got it working after a little more effort, and the cluster is now >> properly reporting. >> >>> On Tue, Oct 21, 2014 at 1:34 PM, John Scalia <[email protected]> wrote: >>> >>> So, I set "transport="udpi"' in the cluster.conf file, and it now looks >>> like this: >>> >>> <cluster config_version="11" name="pgdb_cluster" transport="udpu"> >>> >>> <fence_daemon/> >>> <clusternodes> >>> <clusternode name="csgha1" nodeid="1"> >>> <fence> >>> <method name="pcmk-redirect"> >>> <device name="pcmk" port="csgha1"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="csgha2" nodeid="2"> >>> <fence> >>> <method name="pcmk-redirect"> >>> <device name="pcmk" port="csgha2"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="csgha3" nodeid="3"> >>> <fence> >>> <method name="pcmk-redirect"> >>> <device name="pcmk" port="csgha3"/> >>> </method> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <cman/> >>> <fencedevices> >>> <fencedevice agent="fence_pcmk" name="pcmk"/> >>> </fencedevices> >>> <rm> >>> <failoverdomains/> >>> <resources/> >>> </rm> >>> </cluster> >>> >>> But, after restarting the cluster I don't see any difference. Did I do >>> something wrong? >>> -- >>> Jay >>> >>>> On Tue, Oct 21, 2014 at 12:25 PM, Digimer <[email protected]> wrote: >>>> >>>> No, you don't need to specify anything in cluster.conf for unicast to >>>> work. Corosync will divine the IPs by resolving the node names to IPs. If >>>> you set multicast and don't want to use the auto-selected mcast IP, then >>>> you can specify the mcast IP group to use via <multicast... />. >>>> >>>> digimer >>>> >>>> >>>>> On 21/10/14 12:22 PM, John Scalia wrote: >>>>> >>>>> OK, looking at the cman man page on this system, I see the line saying >>>>> "the corosync.conf file is not used." So, I'm guessing I need to set a >>>>> unicast address somewhere in the cluster.conf file, but the man page >>>>> only mentions the <multicast addr="..."/> parameter. What can I use to >>>>> set this to a unicast address for ports 5404 and 5405? I'm assuming I >>>>> can't just put a unicast address for the multicast parameter, and the >>>>> man page for cluster.conf wasn't much help either. >>>>> >>>>> We're still working on having the security team permit these 3 systems >>>>> to use multicast. >>>>> >>>>>> On 10/21/2014 11:51 AM, Digimer wrote: >>>>>> >>>>>> Keep us posted. :) >>>>>> >>>>>>> On 21/10/14 08:40 AM, John Scalia wrote: >>>>>>> >>>>>>> I've been check hostname resolution this morning, and all the systems >>>>>>> are listed in each /etc/hosts file (No DNS in this environment.) and >>>>>>> ping works on every system both to itself and all the other systems. At >>>>>>> least it's working on the 10.10.1.0/24 network. >>>>>>> >>>>>>> I ran tcpdump trying to see what traffic is on port 5405 on each >>>>>>> system, >>>>>>> and I'm only seeing outbound on each, even though netstat shows each is >>>>>>> listening on the multicast address. My suspicion is that the router is >>>>>>> eating the multicast broadcasts, so I may try the unicast address >>>>>>> instead, but I'm waiting on one of our network engineers to see if my >>>>>>> suspicion is correct about the router. He volunteered to help late >>>>>>> yesterday. >>>>>>> >>>>>>>> On 10/20/2014 4:34 PM, Digimer wrote: >>>>>>>> >>>>>>>> It looks sane on the surface. The 'gethostip' tool comes from the >>>>>>>> 'syslinux' package, and it's really handy! The '-d' says to give the >>>>>>>> IP in dotted-decimanl notation only. >>>>>>>> >>>>>>>> What I was trying to see was whether the 'uname -n' resolved to the IP >>>>>>>> on the same network card as the other nodes. This is how corosync >>>>>>>> decides which interface to send cluster traffic onto. I suspect you >>>>>>>> might have a general network issue, possibly related to multicast. >>>>>>>> (Some switches and some hypervisor virtual networks don't play nice >>>>>>>> with corosync). >>>>>>>> >>>>>>>> Have you tried unicast? If not, try setting the <cman ../> element to >>>>>>>> have the <cman transport="udpu" ... /> attribute. Do note that unicast >>>>>>>> isn't as efficient as multicast, so thought it might work, I'd >>>>>>>> personally treat it as a debug tool to isolate the source of the >>>>>>>> problem. >>>>>>>> >>>>>>>> cheers >>>>>>>> >>>>>>>> digimer >>>>>>>> >>>>>>>> PS - Can you share your pacemaker configuration? >>>>>>>> >>>>>>>>> On 20/10/14 03:40 PM, John Scalia wrote: >>>>>>>>> >>>>>>>>> Sure, and thanks for helping. >>>>>>>>> >>>>>>>>> Here's the /etc/cluster/cluster.conf file and it is identical on all >>>>>>>>> three >>>>>>>>> systems: >>>>>>>>> >>>>>>>>> <cluster config_version="11" name="pgdb_cluster"> >>>>>>>>> <fence_daemon/> >>>>>>>>> <clusternodes> >>>>>>>>> <clusternode name="csgha1" nodeid="1"> >>>>>>>>> <fence> >>>>>>>>> <method name="pcmk-redirect"> >>>>>>>>> <device name="pcmk" port="csgha1"/> >>>>>>>>> </method> >>>>>>>>> </fence> >>>>>>>>> </clusternode> >>>>>>>>> <clusternode name="csgha2" nodeid="2"> >>>>>>>>> <fence> >>>>>>>>> <method name="pcmk-redirect"> >>>>>>>>> <device name="pcmk" port="csgha2"/> >>>>>>>>> </method> >>>>>>>>> </fence> >>>>>>>>> </clusternode> >>>>>>>>> <clusternode name="csgha3" nodeid="3"> >>>>>>>>> <fence> >>>>>>>>> <method name="pcmk-redirect"> >>>>>>>>> <device name="pcmk" port="csgha3"/> >>>>>>>>> </method> >>>>>>>>> </fence> >>>>>>>>> </clusternode> >>>>>>>>> </clusternodes> >>>>>>>>> <cman/> >>>>>>>>> <fencedevices> >>>>>>>>> <fencedevice agent="fence_pcmk" name="pcmk"/> >>>>>>>>> </fencedevices> >>>>>>>>> <rm> >>>>>>>>> <failoverdomains/> >>>>>>>>> <resources/> >>>>>>>>> </rm> >>>>>>>>> </cluster> >>>>>>>>> >>>>>>>>> uname -n reports "csgha1" on that system, "csgha2" on its system, and >>>>>>>>> "csgha3" on the last system. >>>>>>>>> I don't seem to have gethostip on any of these systems, so I don't >>>>>>>>> know if >>>>>>>>> the next section helps or not. >>>>>>>>> "ifconfig -a" reports csgha1: eth0 = 172.17.1.21 >>>>>>>>> eth1 = 10.10.1.128 >>>>>>>>> csgha2: eth0 = 10.10.1.129 >>>>>>>>> Yeah, I know this looks a little weird, but it was the way our >>>>>>>>> automated VM >>>>>>>>> control did the interfaces >>>>>>>>> eth1 = 172.,17.1.3 >>>>>>>>> csgha3: eth0 = 172.17.1.23 >>>>>>>>> eth1 = 10.10.1.130 >>>>>>>>> The /etc/hosts file on each system only has the 10.10.1.0/24 >>>>>>>>> address for >>>>>>>>> each system in in it. >>>>>>>>> iptables is not running on these systems. >>>>>>>>> >>>>>>>>> Let me know if you need more information, and I very much appreciate >>>>>>>>> your >>>>>>>>> assistance. >>>>>>>>> -- >>>>>>>>> Jay >>>>>>>>> >>>>>>>>> On Mon, Oct 20, 2014 at 3:18 PM, Digimer <[email protected]> wrote: >>>>>>>>> >>>>>>>>> On 20/10/14 02:50 PM, John Scalia wrote: >>>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I'm trying to build my first ever HA cluster and I'm using 3 VMs >>>>>>>>>>> running >>>>>>>>>>> CentOS 6.5. I followed the instructions to the letter at: >>>>>>>>>>> >>>>>>>>>>> http://clusterlabs.org/quickstart-redhat.html >>>>>>>>>>> >>>>>>>>>>> and everything appears to start normally, but if I run "cman_tool >>>>>>>>>>> nodes >>>>>>>>>>> -a", I only see: >>>>>>>>>>> >>>>>>>>>>> Node Sts Inc Joined Name >>>>>>>>>>> 1 M 64 2014-10--20 14:00:00 csgha1 >>>>>>>>>>> Addresses: 10.10.1.128 >>>>>>>>>>> 2 X 0 >>>>>>>>>>> csgha2 >>>>>>>>>>> 3 X 0 >>>>>>>>>>> csgha3 >>>>>>>>>>> >>>>>>>>>>> In the other systems, the output is the same except for which >>>>>>>>>>> system is >>>>>>>>>>> shown as joined. Each shows just itself as belonging to the >>>>>>>>>>> cluster. >>>>>>>>>>> Also, "pcs status" reflects similarly with non-self systems showing >>>>>>>>>>> offline. I've checked "netstat -an" and see each machine >>>>>>>>>>> listening on >>>>>>>>>>> ports 5405 and 5405. And the logs are rather involved, but I'm not >>>>>>>>>>> seeing errors in it. >>>>>>>>>>> >>>>>>>>>>> Any ideas for where to look for what's causing them to not >>>>>>>>>>> communicate? >>>>>>>>>>> -- >>>>>>>>>>> Jay >>>>>>>>>> Can you share your cluster.conf file please? Also, for each node: >>>>>>>>>> >>>>>>>>>> * uname -n >>>>>>>>>> * gethostip -d $(uname -n) >>>>>>>>>> * ifconfig |grep -B 1 $(gethostip -d $(uname -n)) | grep HWaddr | >>>>>>>>>> awk '{ >>>>>>>>>> print $1 }' >>>>>>>>>> * iptables-save | grep -i multi >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Digimer >>>>>>>>>> Papers and Projects: https://alteeve.ca/w/ >>>>>>>>>> What if the cure for cancer is trapped in the mind of a person >>>>>>>>>> without >>>>>>>>>> access to education? >>>>>>>>>> _______________________________________________ >>>>>>>>>> Linux-HA mailing list >>>>>>>>>> [email protected] >>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>> Linux-HA mailing list >>>>>>>>> [email protected] >>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>>>> _______________________________________________ >>>>>>> Linux-HA mailing list >>>>>>> [email protected] >>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>>> See also: http://linux-ha.org/ReportingProblems >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> [email protected] >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>> >>>> >>>> -- >>>> Digimer >>>> Papers and Projects: https://alteeve.ca/w/ >>>> What if the cure for cancer is trapped in the mind of a person without >>>> access to education? >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> [email protected] >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without access > to education? > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
