[OMPI users] FW: problems with hostfile when doing MPMD

Ralph H Castain Mon, 14 Apr 2008 08:38:02 -0400

Hi Jody

I believe this was intended for the Users mailing list, so I'm sending the
reply there.


We do plan to provide more explanation on these in the 1.3 release - believe
me, you are not alone in puzzling over all the configuration params! Many of
us in the developer community also sometimes wonder what they all do.

Sorry it is confusing - hopefully, it will become clearer soon

Ralph



------ Forwarded Message
> From: jody <jody....@gmail.com>
> Date: Mon, 14 Apr 2008 10:23:00 +0200
> To: Ralph Castain <r...@lanl.gov>
> Subject: Re: [OMPI users] problems with hostfile when doing MPMD
> 
> Ralph, Rolf
> 
> Thanks for your suggestion.
> After i rebuilt --enable-heterogeneous it did indeed work.
> Fortunately i have only 8 machines in my cluster otherwise
> listing all the nodes in the commandline would be quite exhausting.
> 
> BTW: is there a sort of overview over all the parameters one can pass
> to configure?
> In the FAQ some of them are spread across all questions, and searching
> on the MPI
> also did not work.
> Given an open-mpi installation, are all parameters given to configure
> listed there?
> I found "Heterogeneous support : yes", and "Prefix : /opt/openmpi"
> which are the ones
> i used. As to the others, i assume they are default settings. But if i
> would want to change
> any of those settings, it would be difficult to find the parameter name for
> it.
> 
> Jody
> 
> On Mon, Apr 14, 2008 at 1:14 AM, Ralph Castain <r...@lanl.gov> wrote:
>> I believe this -should- work, but can't verify it myself. The most important
>>  thing is to be sure you built with --enable-heterogeneous or else it will
>>  definitely fail.
>> 
>>  Ralph
>> 
>> 
>> 
>> 
>> 
>>  On 4/10/08 7:17 AM, "Rolf Vandevaart" <rolf.vandeva...@sun.com> wrote:
>> 
>>> 
>>> On a CentOS Linux box, I see the following:
>>> 
>>>> grep 113 /usr/include/asm-i386/errno.h
>>> #define EHOSTUNREACH 113 /* No route to host */
>>> 
>>> I have also seen folks do this to figure out the errno.
>>> 
>>>> perl -e 'die$!=113'
>>> No route to host at -e line 1.
>>> 
>>> I am not sure why this is happening, but you could also check the Open
>>> MPI User's Mailing List Archives where there are other examples of
>>> people running into this error.  A search of "113" had a few hits.
>>> 
>>> http://www.open-mpi.org/community/lists/users
>>> 
>>> Also, I assume you would see this problem with or without the
>>> MPI_Barrier if you add this parameter to your mpirun line:
>>> 
>>>      --mca mpi_preconnect_all 1
>>> 
>>> The MPI_Barrier is causing the bad behavior because by default
>>> connections are setup up lazily. Therefore only when the MPI_Barrier
>>> call is made and we start communicating and establishing connections do
>>> we start seeing the communication problems.
>>> 
>>> Rolf
>>> 
>>> jody wrote:
>>>> Rolf,
>>>> I was able to run hostname on the two noes that way,
>>>> and also a simplified version of my testprogram (without a barrier)
>>>> works. Only MPI_Barrier shows bad behaviour.
>>>> 
>>>> Do you know what this message means?
>>>> [aim-plankton][0,1,2][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() failed with errno=113
>>>> Does it give an idea what could be the problem?
>>>> 
>>>> Jody
>>>> 
>>>> On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
>>>> <rolf.vandeva...@sun.com> wrote:
>>>>> This worked for me although I am not sure how extensive our 32/64
>>>>> interoperability support is.  I tested on Solaris using the TCP
>>>>> interconnect and a 1.2.5 version of Open MPI.  Also, we configure
>>>>> with
>>>>> the --enable-heterogeneous flag which may make a difference here.
>>>>> Also
>>>>> this did not work for me over the sm btl.
>>>>> 
>>>>> By the way, can you run a simple /bin/hostname across the two nodes?
>>>>> 
>>>>> 
>>>>>  burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
>>>>> simple.32
>>>>>  burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
>>>>> simple.64
>>>>>  burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
>>>>> btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -
>>>>> np 3
>>>>> simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
>>>>> [burl-ct-v20z-4]I am #0/6 before the barrier
>>>>> [burl-ct-v20z-5]I am #3/6 before the barrier
>>>>> [burl-ct-v20z-5]I am #4/6 before the barrier
>>>>> [burl-ct-v20z-4]I am #1/6 before the barrier
>>>>> [burl-ct-v20z-4]I am #2/6 before the barrier
>>>>> [burl-ct-v20z-5]I am #5/6 before the barrier
>>>>> [burl-ct-v20z-5]I am #3/6 after the barrier
>>>>> [burl-ct-v20z-4]I am #1/6 after the barrier
>>>>> [burl-ct-v20z-5]I am #5/6 after the barrier
>>>>> [burl-ct-v20z-5]I am #4/6 after the barrier
>>>>> [burl-ct-v20z-4]I am #2/6 after the barrier
>>>>> [burl-ct-v20z-4]I am #0/6 after the barrier
>>>>>  burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
>>>>> MPI) 1.2.5r16572
>>>>> 
>>>>> Report bugs to http://www.open-mpi.org/community/help/
>>>>>  burl-ct-v20z-4 65 =>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> jody wrote:
>>>>>> i narrowed it down:
>>>>>> The majority of processes get stuck in MPI_Barrier.
>>>>>> My Test application looks like this:
>>>>>> 
>>>>>> #include <stdio.h>
>>>>>> #include <unistd.h>
>>>>>> #include "mpi.h"
>>>>>> 
>>>>>> int main(int iArgC, char *apArgV[]) {
>>>>>>    int iResult = 0;
>>>>>>    int iRank1;
>>>>>>    int iNum1;
>>>>>> 
>>>>>>    char sName[256];
>>>>>>    gethostname(sName, 255);
>>>>>> 
>>>>>>    MPI_Init(&iArgC, &apArgV);
>>>>>> 
>>>>>>    MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
>>>>>>    MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
>>>>>> 
>>>>>>    printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1,
>>>>>> iNum1);
>>>>>>    MPI_Barrier(MPI_COMM_WORLD);
>>>>>>    printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1,
>>>>>> iNum1);
>>>>>> 
>>>>>>    MPI_Finalize();
>>>>>> 
>>>>>>    return iResult;
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> If i make this call:
>>>>>> mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
>>>>>> ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
>>>>>> ./run_gdb.sh ./MPITest64
>>>>>> 
>>>>>> (run_gdb.sh is a script which starts gdb in a xterm for each
>>>>>> process)
>>>>>> Process 0 (on aim-plankton) passes the barrier and gets stuck in
>>>>>> PMPI_Finalize,
>>>>>> all other processes get stuck in PMPI_Barrier,
>>>>>> Process 1 (on aim-plankton) displays the message
>>>>>>   [aim-plankton][0,1,1][btl_tcp_endpoint.c:
>>>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>>>> connect() failed with errno=113
>>>>>> Process 2 on (aim-plankton) displays the same message twice.
>>>>>> 
>>>>>> Any ideas?
>>>>>> 
>>>>>>  Thanks Jody
>>>>>> 
>>>>>> On Thu, Apr 10, 2008 at 1:05 PM, jody <jody....@gmail.com> wrote:
>>>>>>> Hi
>>>>>>> Using a more realistic application than a simple "Hello, world"
>>>>>>> even the --host version doesn't work correctly
>>>>>>> Called this way
>>>>>>> 
>>>>>>> mpirun -np 3 --host aim-plankton ./QHGLauncher
>>>>>>> --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-
>>>>>>> fanta4
>>>>>>> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
>>>>>>> 
>>>>>>> the application starts but seems to hang after a while.
>>>>>>> 
>>>>>>> Running the application in gdb:
>>>>>>> 
>>>>>>> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./
>>>>>>> QHGLauncher
>>>>>>> --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-
>>>>>>> fanta4
>>>>>>> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-
>>>>>>> config=pureveg_new.cfg
>>>>>>> -o bruzlopf -n 12
>>>>>>> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
>>>>>>> 
>>>>>>> i can see that the processes on aim-fanta4 have indeed gotten stuck
>>>>>>> after a few initial outputs,
>>>>>>> and the processes on aim-plankton all have a messsage:
>>>>>>> 
>>>>>>> [aim-plankton][0,1,1][btl_tcp_endpoint.c:
>>>>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>>>>> connect() failed with errno=113
>>>>>>> 
>>>>>>> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung
>>>>>>> runs
>>>>>>> as expected.
>>>>>>> 
>>>>>>> BTW: i'm, using open MPI 1.2.2
>>>>>>> 
>>>>>>> Thanks
>>>>>>>  Jody
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody....@gmail.com> wrote:
>>>>>>>> HI
>>>>>>>> In my network i have some 32 bit machines and some 64 bit
>>>>>>>> machines.
>>>>>>>> With --host i successfully call my application:
>>>>>>>>  mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./
>>>>>>>> MPITest :
>>>>>>>> -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
>>>>>>>> (MPITest64 has the same code as MPITest, but was compiled on the
>>>>>>>> 64 bit machine)
>>>>>>>> 
>>>>>>>> But when i use hostfiles:
>>>>>>>>  mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./
>>>>>>>> MPITest :
>>>>>>>> -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
>>>>>>>> all 6 processes are started on the 64 bit machine aim-fanta4.
>>>>>>>> 
>>>>>>>> hosts32:
>>>>>>>>   aim-plankton slots=3
>>>>>>>> hosts64
>>>>>>>>  aim-fanta4 slots
>>>>>>>> 
>>>>>>>> Is this a bug or a feature?  ;)
>>>>>>>> 
>>>>>>>> Jody
>>>>>>>> 
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> =========================
>>>>> rolf.vandeva...@sun.com
>>>>> 781-442-3043
>>>>> =========================
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>> 
>> 
>> 
>> 

------ End of Forwarded Message

[OMPI users] FW: problems with hostfile when doing MPMD

Reply via email to