Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
(putting this back on the list where others can reply as well, and if we solve it, the solution will be google-ized) According to your debug output: [apex-backpack:31956] btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 It *is* trying to connect to the right IP address. Are you able to ping to .203 from apex-backpack? I also notice that you ethernet configuration does not exactly match between linux and osx: en0: flags=8863 mtu 1500 inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: 255.255.240.0 On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote: There is no firewall running between the machines. I tried using the IP address instead of localhost but it gave me the same output. MPI is not even timing out..it keeps eternally hanging on..:( I have disabled the ethernet interface on the linux box, keeping only the wireless up. On the mac i only have the ethernet turned on. My mac is a 8 core mac pro. Please help me debug this.. thanks in advance, regards, pallab (only replying to users list) Some suggestions: - MPI seems to startup but the additional TCP connections required for MPI connections seem to be failing / timing out / some other error. - Are you running firewalls between your machines? If so, can you disable them? - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but one of the debug lines reads: [apex-backpack:31956] btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 - Try not using the name "localhost", but rather the IP address of the local machine On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote: The following are the ifconfig for both the Mac and the Linux respectively: fuji:openmpi-1.3.3 pallabdatta$ ifconfig lo0: flags=8049 mtu 16384 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 inet 127.0.0.1 netmask 0xff00 inet6 ::1 prefixlen 128 gif0: flags=8010 mtu 1280 stf0: flags=0<> mtu 1280 en0: flags=8863 mtu 1500 inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4 inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 ether 00:1f:5b:3d:ea:ac media: autoselect (100baseTX ) status: active supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT en1: flags=8863 mtu 1500 ether 00:1f:5b:3d:ea:ad media: autoselect status: inactive supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT fw0: flags=8863 mtu 4078 lladdr 00:22:41:ff:fe:ed:7d:a8 media: autoselect status: inactive supported media: autoselect LINUX: pallabdatta@apex-backpack:~/backpack/src$ ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:116 errors:0 dropped:0 overruns:0 frame:0 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB) wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: 255.255.240.0 inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:72531 errors:0 dropped:0 overruns:0 frame:0 TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB) wmaster0 Link encap:UNSPEC HWaddr 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux Box is Ubuntu Server Edition 9.04. The Mac has the ethernet interface to connect to the network and the linux box connects via a wireless adapter (IOGEAR). Please help me any way I can fix this issue. It really needs to work for our project. thanks in advance, regards, pallab My other concern was the following but I am not sure it applies here. If you have multiple interfaces on the node, and they are on the same subnet, then you cannot actually select what IP address to go out of. You can only select the IP address you want to connect to. In these cases, I have seen a hang because we think we are selecting an IP address to go out of, but it actually goes out the other one. Perhaps you can send the User's list the
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Hi All, Yes I can ping and ssh from apex-backpack to my Mac (fuji.local). I fixed the wireless broadcast to reflect the same on both ends (10.11.14.255) but still the problem persists. I have tried other wireless adapters as well. But no luck till far. Please let me know what can be done... regards, pallab > (putting this back on the list where others can reply as well, and if > we solve it, the solution will be google-ized) > > According to your debug output: > >>> [apex-backpack:31956] btl: tcp: attempting to connect() to address >>> 10.11.14.203 on port 9360 > > It *is* trying to connect to the right IP address. Are you able to > ping to .203 from apex-backpack? > > I also notice that you ethernet configuration does not exactly match > between linux and osx: > > en0: flags=8863 mtu 1500 > inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 > > wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 >inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: > 255.255.240.0 > > > On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote: > >> There is no firewall running between the machines. I tried using the >> IP >> address instead of localhost but it gave me the same output. MPI is >> not >> even timing out..it keeps eternally hanging on..:( >> >> I have disabled the ethernet interface on the linux box, keeping >> only the >> wireless up. On the mac i only have the ethernet turned on. My mac >> is a 8 >> core mac pro. >> >> Please help me debug this.. >> thanks in advance, regards, >> pallab >> >> >>> (only replying to users list) >>> >>> Some suggestions: >>> >>> - MPI seems to startup but the additional TCP connections required >>> for >>> MPI connections seem to be failing / timing out / some other error. >>> - Are you running firewalls between your machines? If so, can you >>> disable them? >>> - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but >>> one of the debug lines reads: [apex-backpack:31956] btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 >>> - Try not using the name "localhost", but rather the IP address of >>> the >>> local machine >>> >>> >>> On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote: >>> The following are the ifconfig for both the Mac and the Linux respectively: fuji:openmpi-1.3.3 pallabdatta$ ifconfig lo0: flags=8049 mtu 16384 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 inet 127.0.0.1 netmask 0xff00 inet6 ::1 prefixlen 128 gif0: flags=8010 mtu 1280 stf0: flags=0<> mtu 1280 en0: flags=8863 mtu 1500 inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4 inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 ether 00:1f:5b:3d:ea:ac media: autoselect (100baseTX ) status: active supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT en1: flags=8863 mtu 1500 ether 00:1f:5b:3d:ea:ad media: autoselect status: inactive supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT fw0: flags=8863 mtu 4078 lladdr 00:22:41:ff:fe:ed:7d:a8 media: autoselect status: inactive supported media: autoselect LINUX: pallabdatta@apex-backpack:~/backpack/src$ ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:116 errors:0 dropped:0 overruns:0 frame:0 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB) wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: 255.255.240.0 inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:72531 errors:0 dropped:0 overruns:0 frame:0 TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB) wmaster0 Link encap:UNSPEC HWaddr 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux Box is Ubuntu
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Have you tried running without limiting the port range? On Sep 24, 2009, at 12:39 PM, Pallab Datta wrote: Hi All, Yes I can ping and ssh from apex-backpack to my Mac (fuji.local). I fixed the wireless broadcast to reflect the same on both ends (10.11.14.255) but still the problem persists. I have tried other wireless adapters as well. But no luck till far. Please let me know what can be done... regards, pallab (putting this back on the list where others can reply as well, and if we solve it, the solution will be google-ized) According to your debug output: [apex-backpack:31956] btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 It *is* trying to connect to the right IP address. Are you able to ping to .203 from apex-backpack? I also notice that you ethernet configuration does not exactly match between linux and osx: en0: flags=8863 mtu 1500 inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: 255.255.240.0 On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote: There is no firewall running between the machines. I tried using the IP address instead of localhost but it gave me the same output. MPI is not even timing out..it keeps eternally hanging on..:( I have disabled the ethernet interface on the linux box, keeping only the wireless up. On the mac i only have the ethernet turned on. My mac is a 8 core mac pro. Please help me debug this.. thanks in advance, regards, pallab (only replying to users list) Some suggestions: - MPI seems to startup but the additional TCP connections required for MPI connections seem to be failing / timing out / some other error. - Are you running firewalls between your machines? If so, can you disable them? - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but one of the debug lines reads: [apex-backpack:31956] btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 - Try not using the name "localhost", but rather the IP address of the local machine On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote: The following are the ifconfig for both the Mac and the Linux respectively: fuji:openmpi-1.3.3 pallabdatta$ ifconfig lo0: flags=8049 mtu 16384 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 inet 127.0.0.1 netmask 0xff00 inet6 ::1 prefixlen 128 gif0: flags=8010 mtu 1280 stf0: flags=0<> mtu 1280 en0: flags=8863 mtu 1500 inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4 inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 ether 00:1f:5b:3d:ea:ac media: autoselect (100baseTX ) status: active supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT en1: flags=8863 mtu 1500 ether 00:1f:5b:3d:ea:ad media: autoselect status: inactive supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT fw0: flags=8863 mtu 4078 lladdr 00:22:41:ff:fe:ed:7d:a8 media: autoselect status: inactive supported media: autoselect LINUX: pallabdatta@apex-backpack:~/backpack/src$ ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:116 errors:0 dropped:0 overruns:0 frame:0 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB) wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: 255.255.240.0 inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:72531 errors:0 dropped:0 overruns:0 frame:0 TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB) wmaster0 Link encap:UNSPEC HWaddr 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux Box is Ubuntu Server Edition 9.04. The Mac has the ethernet interface to connect to the network and the linux box connects via a wireless adapter (IOGEAR). Please help me any way I can fix this issue. It really needs to work for our project. thanks in advance, regards, pallab My other concern was the following but I am not sure it applies here. If
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Yes I had tried that initially it (apex-backpack) was trying to connect the Mac (10.11.14.203) at port number 4 which is too low. So that's why I made the port range higher.. > Have you tried running without limiting the port range? > > On Sep 24, 2009, at 12:39 PM, Pallab Datta wrote: > >> Hi All, >> >> Yes I can ping and ssh from apex-backpack to my Mac (fuji.local). >> I fixed the wireless broadcast to reflect the same on both ends >> (10.11.14.255) but still the problem persists. >> >> I have tried other wireless adapters as well. But no luck till far. >> Please let me know what can be done... >> regards, pallab >> >>> (putting this back on the list where others can reply as well, and if >>> we solve it, the solution will be google-ized) >>> >>> According to your debug output: >>> > [apex-backpack:31956] btl: tcp: attempting to connect() to address > 10.11.14.203 on port 9360 >>> >>> It *is* trying to connect to the right IP address. Are you able to >>> ping to .203 from apex-backpack? >>> >>> I also notice that you ethernet configuration does not exactly match >>> between linux and osx: >>> >>> en0: flags=8863 mtu >>> 1500 >>> inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 >>> >>> wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 >>> inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: >>> 255.255.240.0 >>> >>> >>> On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote: >>> There is no firewall running between the machines. I tried using the IP address instead of localhost but it gave me the same output. MPI is not even timing out..it keeps eternally hanging on..:( I have disabled the ethernet interface on the linux box, keeping only the wireless up. On the mac i only have the ethernet turned on. My mac is a 8 core mac pro. Please help me debug this.. thanks in advance, regards, pallab > (only replying to users list) > > Some suggestions: > > - MPI seems to startup but the additional TCP connections required > for > MPI connections seem to be failing / timing out / some other error. > - Are you running firewalls between your machines? If so, can you > disable them? > - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" > but > one of the debug lines reads: >> [apex-backpack:31956] btl: tcp: attempting to connect() to address >> 10.11.14.203 on port 9360 > - Try not using the name "localhost", but rather the IP address of > the > local machine > > > On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote: > >> The following are the ifconfig for both the Mac and the Linux >> respectively: >> >> fuji:openmpi-1.3.3 pallabdatta$ ifconfig >> lo0: flags=8049 mtu 16384 >> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 >> inet 127.0.0.1 netmask 0xff00 >> inet6 ::1 prefixlen 128 >> gif0: flags=8010 mtu 1280 >> stf0: flags=0<> mtu 1280 >> en0: flags=8863 mtu >> 1500 >> inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4 >> inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 >> ether 00:1f:5b:3d:ea:ac >> media: autoselect (100baseTX ) status: active >> supported media: autoselect 10baseT/UTP 10baseT/UTP >> 10baseT/UTP 10baseT/UTP >> 100baseTX 100baseTX >> 100baseTX 100baseTX >> 1000baseT 1000baseT >> 1000baseT >> en1: flags=8863 mtu >> 1500 >> ether 00:1f:5b:3d:ea:ad >> media: autoselect status: inactive >> supported media: autoselect 10baseT/UTP 10baseT/UTP >> 10baseT/UTP 10baseT/UTP >> 100baseTX 100baseTX >> 100baseTX 100baseTX >> 1000baseT 1000baseT >> 1000baseT >> fw0: flags=8863 mtu >> 4078 >> lladdr 00:22:41:ff:fe:ed:7d:a8 >> media: autoselect status: inactive >> supported media: autoselect >> >> >> LINUX: >> >> pallabdatta@apex-backpack:~/backpack/src$ ifconfig >> loLink encap:Local Loopback >>inet addr:127.0.0.1 Mask:255.0.0.0 >>inet6 addr: ::1/128 Scope:Host >>UP LOOPBACK RUNNING MTU:16436 Metric:1 >>RX packets:116 errors:0 dropped:0 overruns:0 frame:0 >>TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 >>collisions:0 txqueuelen:0 >>RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB) >> >> wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 >>inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: >> 255.255.240.0 >>inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link >>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>RX packets:72531 errors:0 dropped:0 overruns:0 frame:0 >>TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0 >>collisions:0 txqueuelen:1000 >>RX bytes:5459312
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote: Yes I had tried that initially it (apex-backpack) was trying to connect the Mac (10.11.14.203) at port number 4 which is too low. So that's why I made the port range higher.. Port 4? OMPI should never connect at port 4; it's privileged. Was that in the debug output? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Yes it came up when i put the verbose mode in i.e. the debug output.. yes i knew its privileged so thats why i explicity asked it to connect to a higher port but still it blocks there..:( > On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote: > >> Yes I had tried that initially it (apex-backpack) was trying to >> connect >> the Mac (10.11.14.203) at port number 4 which is too low. So that's >> why I >> made the port range higher.. > > Port 4? OMPI should never connect at port 4; it's privileged. Was > that in the debug output? > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Ok, I think we're outta options here :-( -- our debugging output is not sufficient to tell us what is going wrong. If I make a mercurial repo with some extra debugging output in it, can you check it out and build it? That way we can run it and add relevant printf's in the Right places to see exactly what is failing. Here's the requirements we need for you to be able to build a developer's copy of Open MPI (if you can do this, I'll send you a specific Mercurial repository URL): http://www.open-mpi.org/svn/building.php On Sep 24, 2009, at 1:19 PM, Pallab Datta wrote: Yes it came up when i put the verbose mode in i.e. the debug output.. yes i knew its privileged so thats why i explicity asked it to connect to a higher port but still it blocks there..:( On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote: Yes I had tried that initially it (apex-backpack) was trying to connect the Mac (10.11.14.203) at port number 4 which is too low. So that's why I made the port range higher.. Port 4? OMPI should never connect at port 4; it's privileged. Was that in the debug output? -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Hmm, On another angle, could this be a name resolution issue? Perhaps apex-backpack isn't able to resolve fuji.local and visa versa. Can you ping between the two of them using their hostnames rather then their IPs? -Joshua Bernstein Senior Software Engineer Penguin Computing Pallab Datta wrote: Yes it came up when i put the verbose mode in i.e. the debug output.. yes i knew its privileged so thats why i explicity asked it to connect to a higher port but still it blocks there..:( On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote: Yes I had tried that initially it (apex-backpack) was trying to connect the Mac (10.11.14.203) at port number 4 which is too low. So that's why I made the port range higher.. Port 4? OMPI should never connect at port 4; it's privileged. Was that in the debug output? -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
I can do it ..please send me the URL..i can rebuild ompi and see what the output looks like.. > Ok, I think we're outta options here :-( -- our debugging output is > not sufficient to tell us what is going wrong. If I make a mercurial > repo with some extra debugging output in it, can you check it out and > build it? That way we can run it and add relevant printf's in the > Right places to see exactly what is failing. > > Here's the requirements we need for you to be able to build a > developer's copy of Open MPI (if you can do this, I'll send you a > specific Mercurial repository URL): > > http://www.open-mpi.org/svn/building.php > > > > On Sep 24, 2009, at 1:19 PM, Pallab Datta wrote: > >> Yes it came up when i put the verbose mode in i.e. the debug output.. >> yes i knew its privileged so thats why i explicity asked it to >> connect to >> a higher port but still it blocks there..:( >> >>> On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote: >>> Yes I had tried that initially it (apex-backpack) was trying to connect the Mac (10.11.14.203) at port number 4 which is too low. So that's why I made the port range higher.. >>> >>> Port 4? OMPI should never connect at port 4; it's privileged. Was >>> that in the debug output? >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?
Jonathan Dursi wrote: So to summarize: OpenMPI 1.3.2 + gcc4.4.0 Test problem with periodic (left neighbour of proc 0 is proc N-1) Sendrecv()s: Default always hangs in Sendrecv after random number of iterations Turning off sm (-mca btl self,tcp) not observed to hang Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang Using fewer than 5 fifos hangs in Sendrecv after random number of iterations or Finalize OpenMPI 1.3.3 + gcc4.4.0 Test problem with periodic (left neighbour of proc 0 is proc N-1) Sendrecv()s: Default sometimes (~20% of time) hangs in Sendrecv after random number of iterations Turning off sm (-mca btl self,tcp) not observed to hang Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang Using fewer than 5 fifos but more than 2 not observed to hang Using 2 fifos sometimes (~20% of time) hangs in Finalize or Sendrecv after random number of iterations but sometimes completes OpenMPI 1.3.2 + intel 11.0 compilers We are seeing a problem which we believe to be related; ~1% of certain single-node jobs hang, turning off sm or setting num_fifos to NP-1 eliminates this. I can reproduce this with just Barriers, which keeps the processes all in sync. So, this has nothing to do with processes outrunning one another (which wasn't likely in the first place given that you had Sendrecv calls). The problem is fickle. E.g., building OMPI with -g seems to make the problem go away. I did observe that the sm FIFO would fill up. That's weird since there aren't ever a lot of in-flight messages. I tried adding a line of code that would make a process pause if ever it tried to write to a FIFO that seemed full. That pretty much made the problem go away. So, I guess it's a memory coherency problem: receive clears the FIFO, but writer thinks it's congested. I tried all sorts of GCC compilers. The problem seems to set in with 4.4.0. I don't know what's significant about that. It requires moving to the 2.18 assembler, but I tried the 2.18 assembler with 4.3.3 and that worked okay. I'd think this has to do with GCC 4.4.x, but you say you see the problem with Intel compilers as well. Hmm. Maybe an OMPI problem that's better exposed with GCC 4.4.x?
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
I did try to ping using the hostname but i can't..can that be an issue..?? both of them are sitting on the same subnet !!! let me check if i can resolve this thing.. > Hmm, > > On another angle, could this be a name resolution issue? Perhaps > apex-backpack > isn't able to resolve fuji.local and visa versa. Can you ping between the > two of > them using their hostnames rather then their IPs? > > -Joshua Bernstein > Senior Software Engineer > Penguin Computing > > Pallab Datta wrote: >> Yes it came up when i put the verbose mode in i.e. the debug output.. >> yes i knew its privileged so thats why i explicity asked it to connect >> to >> a higher port but still it blocks there..:( >> >>> On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote: >>> Yes I had tried that initially it (apex-backpack) was trying to connect the Mac (10.11.14.203) at port number 4 which is too low. So that's why I made the port range higher.. >>> Port 4? OMPI should never connect at port 4; it's privileged. Was >>> that in the debug output? >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] best way to ALLREDUCE multi-dimensional arrays in Fortran?
(I apologize in advance for the simplistic/newbie question.) I'm performing an ALLREDUCE operation on a multi-dimensional array. This operation is the biggest bottleneck in the code, and I'm wondering if there's a way to do it more efficiently than what I'm doing now. Here's a representative example of what's happening: ir=1 do ikl=1,km do ij=1,jm do ii=1,im albuf(ir)=array(ii,ij,ikl,nl,0,ng) ir=ir+1 enddo enddo enddo agbuf=0.0 call mpi_allreduce(albuf,agbuf,im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) ir=1 do ikl=1,km do ij=1,jm do ii=1,im phim(ii,ij,ikl,nl,0,ng)=agbuf(ir) ir=ir+1 enddo enddo enddo Is there any way to just do this in one fell swoop, rather than buffering, transmitting, and unbuffering? This operation is looped over many times. Are there savings to be had here? Thanks, Greg
Re: [OMPI users] best way to ALLREDUCE multi-dimensional arrays in Fortran?
Greg Fischer wrote: (I apologize in advance for the simplistic/newbie question.) I'm performing an ALLREDUCE operation on a multi-dimensional array. This operation is the biggest bottleneck in the code, and I'm wondering if there's a way to do it more efficiently than what I'm doing now. Here's a representative example of what's happening: ir=1 do ikl=1,km do ij=1,jm do ii=1,im albuf(ir)=array(ii,ij,ikl,nl,0,ng) ir=ir+1 enddo enddo enddo agbuf=0.0 call mpi_allreduce(albuf,agbuf,im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) ir=1 do ikl=1,km do ij=1,jm do ii=1,im phim(ii,ij,ikl,nl,0,ng)=agbuf(ir) ir=ir+1 enddo enddo enddo Is there any way to just do this in one fell swoop, rather than buffering, transmitting, and unbuffering? This operation is looped over many times. Are there savings to be had here? There are three steps here: buffering, transmitting, and unbuffering. Any idea how the run time is distributed among those three steps? E.g., if most time is spent in the MPI call, then combining all three steps into one is unlikely to buy you much... and might even hurt. If most of the time is spent in the MPI call, then there may be some tuning of collective algorithms to do. I don't have any experience doing this with OMPI. I'm just saying it makes some sense to isolate the problem a little bit more.
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
I resolved the name resolution issue and re-ran it but it still hangs at the send-receive calls. I ran it using: /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205 /tmp/hello > Hmm, > > On another angle, could this be a name resolution issue? Perhaps > apex-backpack > isn't able to resolve fuji.local and visa versa. Can you ping between the > two of > them using their hostnames rather then their IPs? > > -Joshua Bernstein > Senior Software Engineer > Penguin Computing > > Pallab Datta wrote: >> Yes it came up when i put the verbose mode in i.e. the debug output.. >> yes i knew its privileged so thats why i explicity asked it to connect >> to >> a higher port but still it blocks there..:( >> >>> On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote: >>> Yes I had tried that initially it (apex-backpack) was trying to connect the Mac (10.11.14.203) at port number 4 which is too low. So that's why I made the port range higher.. >>> Port 4? OMPI should never connect at port 4; it's privileged. Was >>> that in the debug output? >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >