Re: [OMPI users] Qlogic & openmpi

2011-12-05 Thread arnaud Heritier
Hello,

I found the solution, thanks to Qlogic support.

The "can't open /dev/ipath, network down (err=26)" message from the ipath
driver is really misleading.

Actually, this is an hardware context problem on the Qlogic PSM. PSM can't
allocate any hardware context for the job because  other(s) MPI job(s) have
already used all available contexts. In order to avoid this problem, every
MPI jobs have to use the  PSM_SHAREDCONTEXTS_MAX variable set with the good
value, according to the number of processes that will run on the node. If
we don't use this variable, PSM will "greedily" use all contexts with the
first mpi job spawned on the node.

Regards,

Arnaud


On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres  wrote:

> On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote:
>
> > I do have a contract and i tried to open a case, but their support is
> ..
>
> What happens if you put a delay between the two jobs?  E.g., if you just
> delay a few seconds before the 2nd job starts?  Perhaps the ipath device
> just needs a little time before it will be available...?  (that's a total
> guess)
>
> I suggest this because the PSM device will definitely give you better
> overall performance than the QLogic verbs support.  Their verbs support
> basically barely works -- PSM is their primary device and the one that we
> always recommend.
>
> > Anyway. I'm stii working on the strange error message from mpirun saying
> it can't allocate memory when at the same time it also reports that the
> memory is unlimited ...
> >
> >
> > Arnaud
> >
> > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres 
> wrote:
> > I'm afraid we don't have any contacts left at QLogic to ask them any
> more... do you have a support contract, perchance?
> >
> > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote:
> >
> > > Hello,
> > >
> > > I run into a stange problem with qlogic OFED and openmpi. When i
> submit (through SGE) 2 jobs on the same node, the second job ends up with:
> > >
> > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26)
> > >
> > > I'm pretty sure the infiniband is working well as the other job runs
> fine.
> > >
> > > Here is details about the configuration:
> > >
> > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a
> switch)
> > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll)
> > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge)
> > >
> > > -
> > >
> > > In order to fix this problem i recompiled openmpi without psm support,
> but i faced an other problem:
> > >
> > > The OpenFabrics (openib) BTL failed to initialize while trying to
> > > allocate some locked memory.  This typically can indicate that the
> > > memlock limits are set too low.  For most HPC installations, the
> > > memlock limits should be set to "unlimited".  The failure occured
> > > here:
> > >
> > >   Local host:compute-0-6.local
> > >   OMPI source:   btl_openib.c:329
> > >   Function:  ibv_create_srq()
> > >   Device:qib0
> > >   Memlock limit: unlimited
> > >
> > >
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Qlogic & openmpi

2011-12-05 Thread Ralph Castain

On Dec 5, 2011, at 5:49 AM, arnaud Heritier wrote:

> Hello,
> 
> I found the solution, thanks to Qlogic support.
> 
> The "can't open /dev/ipath, network down (err=26)" message from the ipath 
> driver is really misleading.
> 
> Actually, this is an hardware context problem on the Qlogic PSM. PSM can't 
> allocate any hardware context for the job because  other(s) MPI job(s) have 
> already used all available contexts. In order to avoid this problem, every 
> MPI jobs have to use the  PSM_SHAREDCONTEXTS_MAX variable set with the good 
> value, according to the number of processes that will run on the node. If we 
> don't use this variable, PSM will "greedily" use all contexts with the first 
> mpi job spawned on the node.

Sounds like we should be setting this value when starting the process - yes? If 
so, what is the "good" value, and how do we compute it?

> 
> Regards,
> 
> Arnaud
> 
> 
> On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres  wrote:
> On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote:
> 
> > I do have a contract and i tried to open a case, but their support is ..
> 
> What happens if you put a delay between the two jobs?  E.g., if you just 
> delay a few seconds before the 2nd job starts?  Perhaps the ipath device just 
> needs a little time before it will be available...?  (that's a total guess)
> 
> I suggest this because the PSM device will definitely give you better overall 
> performance than the QLogic verbs support.  Their verbs support basically 
> barely works -- PSM is their primary device and the one that we always 
> recommend.
> 
> > Anyway. I'm stii working on the strange error message from mpirun saying it 
> > can't allocate memory when at the same time it also reports that the memory 
> > is unlimited ...
> >
> >
> > Arnaud
> >
> > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres  wrote:
> > I'm afraid we don't have any contacts left at QLogic to ask them any 
> > more... do you have a support contract, perchance?
> >
> > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote:
> >
> > > Hello,
> > >
> > > I run into a stange problem with qlogic OFED and openmpi. When i submit 
> > > (through SGE) 2 jobs on the same node, the second job ends up with:
> > >
> > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26)
> > >
> > > I'm pretty sure the infiniband is working well as the other job runs fine.
> > >
> > > Here is details about the configuration:
> > >
> > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a 
> > > switch)
> > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll)
> > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge)
> > >
> > > -
> > >
> > > In order to fix this problem i recompiled openmpi without psm support, 
> > > but i faced an other problem:
> > >
> > > The OpenFabrics (openib) BTL failed to initialize while trying to
> > > allocate some locked memory.  This typically can indicate that the
> > > memlock limits are set too low.  For most HPC installations, the
> > > memlock limits should be set to "unlimited".  The failure occured
> > > here:
> > >
> > >   Local host:compute-0-6.local
> > >   OMPI source:   btl_openib.c:329
> > >   Function:  ibv_create_srq()
> > >   Device:qib0
> > >   Memlock limit: unlimited
> > >
> > >
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Qlogic & openmpi

2011-12-05 Thread arnaud Heritier
-
Arnaud HERITIER
Meteo France International
+33 561432940
arnaud.herit...@mfi.fr
--


On Mon, Dec 5, 2011 at 6:12 PM, Ralph Castain  wrote:

>
> On Dec 5, 2011, at 5:49 AM, arnaud Heritier wrote:
>
> Hello,
>
> I found the solution, thanks to Qlogic support.
>
> The "can't open /dev/ipath, network down (err=26)" message from the ipath
> driver is really misleading.
>
> Actually, this is an hardware context problem on the Qlogic PSM. PSM can't
> allocate any hardware context for the job because  other(s) MPI job(s) have
> already used all available contexts. In order to avoid this problem, every
> MPI jobs have to use the  PSM_SHAREDCONTEXTS_MAX variable set with the good
> value, according to the number of processes that will run on the node. If
> we don't use this variable, PSM will "greedily" use all contexts with the
> first mpi job spawned on the node.
>
>
> Sounds like we should be setting this value when starting the process -
> yes? If so, what is the "good" value, and how do we compute it?
>

The good value is roundup( $OMPI_COMM_WORLD_LOCAL_SIZE / Context shared
ratio)  (ratio max 4 on my HCA)
Qlogic provided me with simple script to compute this value, i just changed
my mpirun script to call this script , set the PSM_SHAREDCONTEXTS_MAX with
the returned value and the call the mpi binary.
Script attached.

Arnaud


> Regards,
>
> Arnaud
>
>
> On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres  wrote:
>
>> On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote:
>>
>> > I do have a contract and i tried to open a case, but their support is
>> ..
>>
>> What happens if you put a delay between the two jobs?  E.g., if you just
>> delay a few seconds before the 2nd job starts?  Perhaps the ipath device
>> just needs a little time before it will be available...?  (that's a total
>> guess)
>>
>> I suggest this because the PSM device will definitely give you better
>> overall performance than the QLogic verbs support.  Their verbs support
>> basically barely works -- PSM is their primary device and the one that we
>> always recommend.
>>
>> > Anyway. I'm stii working on the strange error message from mpirun
>> saying it can't allocate memory when at the same time it also reports that
>> the memory is unlimited ...
>> >
>> >
>> > Arnaud
>> >
>> > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres 
>> wrote:
>> > I'm afraid we don't have any contacts left at QLogic to ask them any
>> more... do you have a support contract, perchance?
>> >
>> > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote:
>> >
>> > > Hello,
>> > >
>> > > I run into a stange problem with qlogic OFED and openmpi. When i
>> submit (through SGE) 2 jobs on the same node, the second job ends up with:
>> > >
>> > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26)
>> > >
>> > > I'm pretty sure the infiniband is working well as the other job runs
>> fine.
>> > >
>> > > Here is details about the configuration:
>> > >
>> > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a
>> switch)
>> > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll)
>> > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge)
>> > >
>> > > -
>> > >
>> > > In order to fix this problem i recompiled openmpi without psm
>> support, but i faced an other problem:
>> > >
>> > > The OpenFabrics (openib) BTL failed to initialize while trying to
>> > > allocate some locked memory.  This typically can indicate that the
>> > > memlock limits are set too low.  For most HPC installations, the
>> > > memlock limits should be set to "unlimited".  The failure occured
>> > > here:
>> > >
>> > >   Local host:compute-0-6.local
>> > >   OMPI source:   btl_openib.c:329
>> > >   Function:  ibv_create_srq()
>> > >   Device:qib0
>> > >   Memlock limit: unlimited
>> > >
>> > >
>> > > ___
>> > > users mailing list
>> > > us...@open-mpi.org
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > --
>> > Jeff Squyres
>> > jsquy...@cisco.com
>> > For corporate legal information go to:
>> > http://www.cisco.com/web/about/doing_business/legal/cri/
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.o

Re: [OMPI users] Open MPI and DAPL 2.0.34 are incompatible?

2011-12-05 Thread Jeff Squyres
We've never recommended the use of dapl on Linux.  I think it might have worked 
at one time, but I don't think anyone bothered to maintain it.  

On Linux, you should probably use native verbs support, instead.


On Dec 2, 2011, at 1:21 PM, Paul Kapinos wrote:

> Dear Open MPI developer,
> 
> OFED 1.5.4 will contain DAPL 2.0.34.
> 
> I tried to compile the newest release of Open MPI (1.5.4) with this DAPL 
> release and I was not successful.
> 
> Configuring with --with-udapl=/path/to/2.0.34/dapl
> got the error "/path/to/2.0.34/dapl/include/dat/udat.h not found"
> Looking into include dir: there is no 'dat' subdir but 'dat2'.
> 
> Just for fun I also tried to move 'dat2' to 'dat' back (dirty hack I know :-) 
> - the configure stage was then successful but the compilation failed. The 
> header seem to be really changed, not just moved.
> 
> The question: are the Open MPI developer aware of this changes, and when a 
> version of Open MPI will be available with support for DAPL 2.0.34?
> 
> (Background: we have some trouble with Intel MPI and current DAPL which we do 
> not have with DAPL 2.0.34, so our dream is to update as soon as possible)
> 
> Best wishes and an nice weekend,
> 
> Paul
> 
> 
> 
> 
> 
> 
> http://www.openfabrics.org/downloads/OFED/release_notes/OFED_1.5.4_release_notes
> 
> 
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Qlogic & openmpi

2011-12-05 Thread Bogdan Costescu
On Mon, Dec 5, 2011 at 16:12, Ralph Castain  wrote:
> Sounds like we should be setting this value when starting the process - yes?
> If so, what is the "good" value, and how do we compute it?

I've also been just looking at this for the past few days. What I came
up with is a small script psm_shctx which sets the envvar then execs
the MPI binary and is inserted between mpirun and the MPI binary:

mpirun psm_shctx my_mpi_app

Of course the same effect can be obtained if the orted would set the
envvar before starting the process. There is however a problem:
deciding how many contexts to use. For max. performance, one should
use a ratio of 1:1 between MPI ranks and contexts; the highest ratio
possible (but with lowest performance) is 4 MPI ranks per context;
another restriction is that each job should have at least 1 context.

F.e. on AMD cluster nodes with 4 CPUs of 12 cores (so total of 48
cores) one gets 16 contexts; assigning all 16 contexts to 48 ranks
would mean a ratio of 1:3 but this can only apply if allocation of
cores is done in multiples of 4; with a less advantageous allocation
strategy more contexts are lost due to rounding up. At the extreme, if
there's only one rank per job, there can only be maximum 16 jobs -
using all 16 contexts and the rest of 32 cores have to remain idle or
be used for other jobs that don't require communication over
InfiniPath.

There is a further issue though: MPI-2 dynamic creation of processes -
if it's not known how many ranks there will be, I guess one should use
the highest context sharing ratio (1:4) to be on the safe side.

I've found a mention of this envvar being handled in the changelog for
MVAPICH2 1.4.1 - maybe that can serve as source of inspiration ? (but
I haven't looked at it...)

Hope this helps,
Bogdan


[OMPI users] MPI_Comm_spawn problem

2011-12-05 Thread Fernanda Oliveira
Hi,

I'm working with MPI_Comm_spawn and I have some error messages.

The code is relatively simple:

#include 
#include 
#include 
#include 
#include 

int main(int argc, char ** argv){

int i;
int rank, size, child_rank;
char nomehost[20];
MPI_Comm parent, intercomm1, intercomm2;
int erro;
int level, curr_level;


MPI_Init(&argc, &argv);
level = atoi(argv[1]);

MPI_Comm_get_parent(&parent);

if(parent == MPI_COMM_NULL){
rank=0;
}
else{
MPI_Recv(&rank, 1, MPI_INT, 0, 0, parent, MPI_STATUS_IGNORE);
}

curr_level = (int) log2(rank+1);

printf(" --> rank: %d and curr_level: %d\n", rank, curr_level);

// Node propagation
if(curr_level < level){

// 2^(curr_level+1) - 1 + 2*(rank - 2^curr_level - 1)
= 2*rank + 1
child_rank = 2*rank + 1;
printf("(%d) Before create rank %d\n", rank, child_rank);
MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &intercomm1, &erro);
printf("(%d) After create rank %d\n", rank, child_rank);

MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm1);

//sleep(1);

child_rank = child_rank + 1;
printf("(%d) Before create rank %d\n", rank, child_rank);
MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &intercomm2, &erro);
printf("(%d) After create rank %d\n", rank, child_rank);

MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm2);

}

gethostname(nomehost, 20);
printf("(%d) in %s\n", rank, nomehost);

MPI_Finalize();
return(0);

}

The program will create a binary tree of process until get a specific
level determined by the variable "level". If the level is 2, the tree
will be:
(0)
  / \
  (1)   (2)
  /  \   /  \
(3) (4)  (5) (6)

Error messages are (when a use 1 host):

Compiling: mpicc test.c -o test -lm
Running: mpirun -np 1 ./test 3

 --> rank: 0 and curr_level: 0
(0) Before create rank 1
(0) After create rank 1
(0) Before create rank 2
 --> rank: 1 and curr_level: 1
(1) Before create rank 3
[cacau.ic.uff.br:17892] [[31928,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 75

When I use 2 hosts, error is worst. The code is similar to the writing
here (I have to set hosts before spawn by MPI_Info_set).
Using MPILAM, program runs normally.

I think something wrong occurs when I try to use 2 MPI_Comm_spawn
consecutively and children processes spawn another processes too.
Seems to be a race condition because the error does not always happen
(when the level is 2, for example). Using 3 levels or more, error is
recurrent.

Similar error has been previously posted in another thread:
http://www.open-mpi.org/community/lists/users/2009/12/11601.php
However, I used the stable version 1.4.4 and this problem still happens.
Developers think of to fix it?

Thanks,
Fernanda