On Nov 14, 2014, at 10:52 AM, Reuti wrote:
> I appreciate your replies and will read them thoroughly. I think it's best to
> continue with the discussion after SC14. I don't want to put any burden on
> anyone when time is tight.
Cool; many thanks. This is complicated stuff; we might not have
Jeff, Gus, Gilles,
Am 14.11.2014 um 15:56 schrieb Jeff Squyres (jsquyres):
> I lurked on this thread for a while, but I have some thoughts on the many
> issues that were discussed on this thread (sorry, I'm still pretty under
> water trying to get ready for SC next week...).
I appreciate your
I lurked on this thread for a while, but I have some thoughts on the many
issues that were discussed on this thread (sorry, I'm still pretty under water
trying to get ready for SC next week...). These points are in no particular
order...
0. Two fundamental points have been missed in this threa
My 0.02 US$
first, the root cause of the problem was a default gateway was
configured on the node,
but this gateway was unreachable.
imho, this is incorrect system setting that can lead to unpredictable
results :
- openmpi 1.8.1 works (you are lucky, good for you)
- openmpi 1.8.3 fails (no luck th
On 11/13/2014 11:14 AM, Ralph Castain wrote:
Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
assign different hostnames to their interfaces - I’ve seen it in the
Hadoop world, but not in HPC. Still, no law against it.
No, not so unusual.
I have clusters from respectable
Hi Reuti
See below, please.
On 11/13/2014 07:19 AM, Reuti wrote:
Gus,
Am 13.11.2014 um 02:59 schrieb Gus Correa:
On 11/12/2014 05:45 PM, Reuti wrote:
Am 12.11.2014 um 17:27 schrieb Reuti:
Am 11.11.2014 um 02:25 schrieb Ralph Castain:
Another thing you can do is (a) ensure you built with
> On Nov 13, 2014, at 9:20 AM, Reuti wrote:
>
> Am 13.11.2014 um 17:14 schrieb Ralph Castain:
>
>> Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
>> assign different hostnames to their interfaces - I’ve seen it in the Hadoop
>> world, but not in HPC. Still, no law aga
Am 13.11.2014 um 17:14 schrieb Ralph Castain:
> Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
> assign different hostnames to their interfaces - I’ve seen it in the Hadoop
> world, but not in HPC. Still, no law against it.
Maybe it depends on the background to do it th
Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to assign
different hostnames to their interfaces - I’ve seen it in the Hadoop world, but
not in HPC. Still, no law against it.
This will take a little thought to figure out a solution. One problem that
immediately occurs is i
Am 13.11.2014 um 00:34 schrieb Ralph Castain:
>> On Nov 12, 2014, at 2:45 PM, Reuti wrote:
>>
>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>
>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>>
Another thing you can do is (a) ensure you built with —enable-debug, and
then (b) run it wi
Gus,
Am 13.11.2014 um 02:59 schrieb Gus Correa:
> On 11/12/2014 05:45 PM, Reuti wrote:
>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>
>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>>
Another thing you can do is (a) ensure you built with —enable-debug,
>> and then (b) run it with -mca oob
On 11/12/2014 05:45 PM, Reuti wrote:
Am 12.11.2014 um 17:27 schrieb Reuti:
Am 11.11.2014 um 02:25 schrieb Ralph Castain:
Another thing you can do is (a) ensure you built with —enable-debug,
and then (b) run it with -mca oob_base_verbose 100
(without the tcp_if_include option) so we can watch
> On Nov 12, 2014, at 2:45 PM, Reuti wrote:
>
> Am 12.11.2014 um 17:27 schrieb Reuti:
>
>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>
>>> Another thing you can do is (a) ensure you built with —enable-debug, and
>>> then (b) run it with -mca oob_base_verbose 100 (without the tcp_if_incl
Am 12.11.2014 um 17:27 schrieb Reuti:
> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>
>> Another thing you can do is (a) ensure you built with —enable-debug, and
>> then (b) run it with -mca oob_base_verbose 100 (without the tcp_if_include
>> option) so we can watch the connection handshake
Am 11.11.2014 um 02:25 schrieb Ralph Castain:
> Another thing you can do is (a) ensure you built with —enable-debug, and then
> (b) run it with -mca oob_base_verbose 100 (without the tcp_if_include
> option) so we can watch the connection handshake and see what it is doing.
> The —hetero-nodes
Am 11.11.2014 um 02:12 schrieb Gilles Gouaillardet:
> Hi,
>
> IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really use
> all the published interfaces.
>
> by any change, are you running a firewall on your head node ?
Yes, but only for the interface to the outside world. N
Another thing you can do is (a) ensure you built with —enable-debug, and then
(b) run it with -mca oob_base_verbose 100 (without the tcp_if_include option)
so we can watch the connection handshake and see what it is doing. The
—hetero-nodes will have not affect here and can be ignored.
Ralph
Hi,
IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really
use all the published interfaces.
by any change, are you running a firewall on your head node ?
one possible explanation is the compute node tries to access the public
interface of the head node, and packets get dropped
Hi,
Am 10.11.2014 um 16:39 schrieb Ralph Castain:
> That is indeed bizarre - we haven’t heard of anything similar from other
> users. What is your network configuration? If you use oob_tcp_if_include or
> exclude, can you resolve the problem?
Thx - this option helped to get it working.
These
That is indeed bizarre - we haven’t heard of anything similar from other users.
What is your network configuration? If you use oob_tcp_if_include or exclude,
can you resolve the problem?
> On Nov 10, 2014, at 4:50 AM, Reuti wrote:
>
> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
>
Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> Wow, that's pretty terrible! :(
>
> Is the behavior BTL-specific, perchance? E.G., if you only use certain BTLs,
> does the delay disappear?
You mean something like:
reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile m
Wow, that's pretty terrible! :(
Is the behavior BTL-specific, perchance? E.G., if you only use certain BTLs,
does the delay disappear?
FWIW: the use-all-IP interfaces approach has been in OMPI forever.
Sent from my phone. No type good.
> On Nov 10, 2014, at 6:42 AM, Reuti wrote:
>
>> Am
Am 10.11.2014 um 12:24 schrieb Reuti:
> Hi,
>
> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
>
>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each
>> process receives a complete map of that info for every process in the job.
>> So when the TCP btl sets itself up, it
Hi,
Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each
> process receives a complete map of that info for every process in the job. So
> when the TCP btl sets itself up, it attempts to connect across -all- the
> interface
FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each
process receives a complete map of that info for every process in the job. So
when the TCP btl sets itself up, it attempts to connect across -all- the
interfaces published by the other end.
So it doesn’t matter what hos
Ok I figured, i'm going to have to read some more for my own curiosity. The
reason I mention the Resource Manager we use, and that the hostnames given but
PBS/Torque match the 1gig-e interfaces, i'm curious what path it would take to
get to a peer node when the node list given all match the 1gig
Ralph is right: OMPI aggressively uses all Ethernet interfaces by default.
This short FAQ has links to 2 other FAQs that provide detailed information
about reachability:
http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
The usNIC BTL uses UDP for its wire transport and actually
OMPI discovers all active interfaces and automatically considers them available
for its use unless instructed otherwise via the params. I’d have to look at the
TCP BTL code to see the loadbalancing algo - I thought we didn’t have that “on”
by default across BTLs, but I don’t know if the TCP one
28 matches
Mail list logo