Re: [OMPI users] Performance Issues with OpenMPI v1.5.1 MPI_Barrier on Windows XP SP3

2011-02-24 Thread David Zhang
How many cores does your processor has?

On Wed, Feb 23, 2011 at 8:52 PM, Li Zuwei  wrote:

>  Dear Users,
>
> I'm measuring barrier synchronization performance on the v1.5.1 build of
> OpenMPI. I am currently trying to measure synchronization performance on a
> single node, with 5 processes. I'm getting pretty weak results as follows:
>
> Testing procedure - initialize the timer at the start of the barrier, stop
> the timer when the process break from the barrier. Cycle through N number of
> times and calculate the average.
>
> 1 Node 5 processes: 299.38ms
> 1 Node 7 processes: 513.95ms
> 1 Node 10 processes: 749.94ms
>
> I am wondering if this is the expected performance on a single nodes. I
> presume Open MPI automatically uses Shared Memory for barrier
> synchronization on a single node which I think should be able to provide
> better performance when running on a single node. Is there a way to
> determine what transport layer I am using and I would greatly appreciate
> tips on how can I tune this performance.
>
> Regards,
> Zuwei
>
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
David Zhang
University of California, San Diego


[OMPI users] multicast not available

2011-02-24 Thread Vasiliy G Tolstov
Hello. I'm use vps hosting under xen.
Multicast not available, only tcp/udp one to one worked. How can i use
openmpi to send and recive data from many nodes in this environment?

-- 
Vasiliy G Tolstov 
Selfip.Ru



Re: [OMPI users] multicast not available

2011-02-24 Thread Jeff Squyres (jsquyres)
I'm not sure what you're asking. Open MPI should work just fine in a Xen 
environment. 

If you're unsure about how to use the MPI API, you might want to take a 
tutorial to get you familiar with MPI concepts, etc. Google around; there are a 
bunch available. My personal favorite is at the UIUC NCSA web site; if you sign 
up for a free account, they have a beginners and intermediate MPI tutorials 
available. 

Sent from my PDA. No type good. 

On Feb 24, 2011, at 4:37 AM, "Vasiliy G Tolstov"  wrote:

> Hello. I'm use vps hosting under xen.
> Multicast not available, only tcp/udp one to one worked. How can i use
> openmpi to send and recive data from many nodes in this environment?
> 
> -- 
> Vasiliy G Tolstov 
> Selfip.Ru
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Performance Issues with OpenMPI v1.5.1 MPI_Barrieron Windows XP SP3

2011-02-24 Thread Jeff Squyres (jsquyres)
You should:

- do N warmup barriers
- start the timers
- do M barriers (M should be a lot)
- stop the timers
- divide the time by M

Benchmarking is tricky to get right. 

Sent from my PDA. No type good. 

On Feb 23, 2011, at 11:54 PM, "Li Zuwei"  wrote:

> Dear Users,
> 
> I'm measuring barrier synchronization performance on the v1.5.1 build of 
> OpenMPI. I am currently trying to measure synchronization performance on a 
> single node, with 5 processes. I'm getting pretty weak results as follows:
> 
> Testing procedure - initialize the timer at the start of the barrier, stop 
> the timer when the process break from the barrier. Cycle through N number of 
> times and calculate the average.
> 
> 1 Node 5 processes: 299.38ms
> 1 Node 7 processes: 513.95ms
> 1 Node 10 processes: 749.94ms
> 
> I am wondering if this is the expected performance on a single nodes. I 
> presume Open MPI automatically uses Shared Memory for barrier synchronization 
> on a single node which I think should be able to provide better performance 
> when running on a single node. Is there a way to determine what transport 
> layer I am using and I would greatly appreciate tips on how can I tune this 
> performance.
> 
> Regards,
> Zuwei
> 
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] multicast not available

2011-02-24 Thread Vasiliy G Tolstov
On Thu, 2011-02-24 at 06:32 -0500, Jeff Squyres (jsquyres) wrote:
> I'm not sure what you're asking. Open MPI should work just fine in a Xen 
> environment. 
> 
> If you're unsure about how to use the MPI API, you might want to take a 
> tutorial to get you familiar with MPI concepts, etc. Google around; there are 
> a bunch available. My personal favorite is at the UIUC NCSA web site; if you 
> sign up for a free account, they have a beginners and intermediate MPI 
> tutorials available. 
> 
> Sent from my PDA. No type good. 
> 
> On Feb 24, 2011, at 4:37 AM, "Vasiliy G Tolstov"  wrote:

This is not xen specific problem. but routing specific. Admins deny
multicast packets.

-- 
Vasiliy G Tolstov 
Selfip.Ru



Re: [OMPI users] nonblock alternative to MPI_Win_complete

2011-02-24 Thread Toon Knapen
In that case, I have a small question concerning design:

Suppose task-based parallellism where one node (master) distributes
work/tasks to 2 other nodes (slaves) by means of an MPI_Put. The master
allocates 2 buffers locally in which it will store all necessary data that
is needed by the slave to perform the task. So I do an MPI_Put on each of my
2 buffers to send each buffer to a specific slave. Now I need to know when I
can reuse one of my buffers to already store the next task (that I will
MPI_Put later on). The only way to know this is call MPI_Complete. But since
this is blocking and if this buffer is not ready to be reused yet, I
can neither verify if the other buffer is already available to me again (in
the same thread).

I would very much appreciate input on how to solve such issue !

thanks in advance,

toon

On Tue, Feb 22, 2011 at 7:21 PM, Barrett, Brian W wrote:

>  On Feb 18, 2011, at 8:59 AM, Toon Knapen wrote:
>
> > (Probably this issue has been discussed at length before but
> unfortunately I did not find any threads (on this site or anywhere else) on
> this topic, if you are able to provide me with links to earlier discussions
> on this topic, please do not hesitate)
> >
> > Is there an alternative to MPI_Win_complete that does not 'enforce
> completion of preceding RMS calls at the origin' (as said on pag 353 of the
> mpi-2.2 standard) ?
> >
> > I would like to know if I can reuse the buffer I gave to MPI_Put but
> without blocking on it, if the MPI lib is still using it, I want to be able
> to continue (and use another buffer).
>
>
> There is not.   MPI_Win_complete is the only way to finish a MPI_Win_start
> epoch, and is always blocking until local completion of all messages started
> during the epoch.
>
> Brian
>
> --
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] RoCE (IBoE) & OpenMPI

2011-02-24 Thread Michael Shuey
Late yesterday I did have a chance to test the patch Jeff provided
(against 1.4.3 - testing 1.5.x is on the docket for today).  While it
works, in that I can specify a gid_index, it doesn't do everything
required - my traffic won't match a lossless CoS on the ethernet
switch.  Specifying a GID is only half of it; I really need to also
specify a service level.

The bottom 3 bits of the IB SL are mapped to ethernet's PCP bits in
the VLAN tag.  With a non-default gid, I can select an available VLAN
(so RoCE's packets will include the PCP bits), but the only way to
specify a priority is to use an SL.  So far, the only RoCE-enabled app
I've been able to make work correctly (such that traffic matches a
lossless CoS on the switch) is ibv_rc_pingpong - and then, I need to
use both a specific GID and a specific SL.

The slides Pavel found seem a little misleading to me.  The VLAN isn't
determined by bound netdev; all VLAN netdevs map to the same IB
adapter for RoCE.  VLAN is determined by gid index.  Also, the SL
isn't determined by a set kernel policy; it's provided via the IB
interfaces.  As near as I can tell from Mellanox's documentation, OFED
test apps, and the driver source, a RoCE adapter is an Infiniband card
in almost all respects (even more so than an iWARP adapter).

--
Mike Shuey



On Wed, Feb 23, 2011 at 5:03 PM, Jeff Squyres  wrote:
> On Feb 23, 2011, at 3:54 PM, Shamis, Pavel wrote:
>
>> I remember that I updated the trunk to select by default RDMACM connection 
>> manager for RoCE ports - https://svn.open-mpi.org/trac/ompi/changeset/22311
>>
>> I'm not sure it the change made his way to any production version. I don't 
>> work on this part code anymore :-)
>
> Mellanox -- can you follow up on this?
>
> Also, in addition to the patches I provided for selecting an arbitrary GID (I 
> was planning on committing them when Mike tested them at Purdue, but perhaps 
> I should just commit to the trunk anyway), perhaps we should check if a 
> non-default SL is supplied via MCA param in the RoCE case and output an 
> orte_show_help to warn that it will have no effect (i.e., principle of least 
> surprise and all that).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] nonblock alternative to MPI_Win_complete

2011-02-24 Thread James Dinan

Hi Toon,

Can you use non-blocking send/recv?  It sounds like this will give you 
the completion semantics you want.


Best,
 ~Jim.

On 2/24/11 6:07 AM, Toon Knapen wrote:

In that case, I have a small question concerning design:
Suppose task-based parallellism where one node (master) distributes
work/tasks to 2 other nodes (slaves) by means of an MPI_Put. The master
allocates 2 buffers locally in which it will store all necessary data
that is needed by the slave to perform the task. So I do an MPI_Put on
each of my 2 buffers to send each buffer to a specific slave. Now I need
to know when I can reuse one of my buffers to already store the next
task (that I will MPI_Put later on). The only way to know this is call
MPI_Complete. But since this is blocking and if this buffer is not ready
to be reused yet, I can neither verify if the other buffer is already
available to me again (in the same thread).
I would very much appreciate input on how to solve such issue !
thanks in advance,
toon
On Tue, Feb 22, 2011 at 7:21 PM, Barrett, Brian W mailto:bwba...@sandia.gov>> wrote:

On Feb 18, 2011, at 8:59 AM, Toon Knapen wrote:

 > (Probably this issue has been discussed at length before but
unfortunately I did not find any threads (on this site or anywhere
else) on this topic, if you are able to provide me with links to
earlier discussions on this topic, please do not hesitate)
 >
 > Is there an alternative to MPI_Win_complete that does not
'enforce completion of preceding RMS calls at the origin' (as said
on pag 353 of the mpi-2.2 standard) ?
 >
 > I would like to know if I can reuse the buffer I gave to MPI_Put
but without blocking on it, if the MPI lib is still using it, I want
to be able to continue (and use another buffer).


There is not.   MPI_Win_complete is the only way to finish a
MPI_Win_start epoch, and is always blocking until local completion
of all messages started during the epoch.

Brian

--
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories



___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] nonblock alternative to MPI_Win_complete

2011-02-24 Thread Toon Knapen
But that is what surprises me. Indeed the scenario I described can be
implemented using two-sided communication, but it seems not to be possible
when using one sided communication.

Additionally the MPI 2.2. standard describes on page 356 the matching rules
for post and start, complete and wait and there it says :
"MPI_WIN_COMPLETE(win)
initiate a nonblocking send with tag tag1 to each process in the group of
the preceding start call. No need to wait for the completion of these
sends."
The wording 'nonblocking send' startles me somehow !?

toon


On Thu, Feb 24, 2011 at 2:05 PM, James Dinan  wrote:

> Hi Toon,
>
> Can you use non-blocking send/recv?  It sounds like this will give you the
> completion semantics you want.
>
> Best,
>  ~Jim.
>
>
> On 2/24/11 6:07 AM, Toon Knapen wrote:
>
>> In that case, I have a small question concerning design:
>> Suppose task-based parallellism where one node (master) distributes
>> work/tasks to 2 other nodes (slaves) by means of an MPI_Put. The master
>> allocates 2 buffers locally in which it will store all necessary data
>> that is needed by the slave to perform the task. So I do an MPI_Put on
>> each of my 2 buffers to send each buffer to a specific slave. Now I need
>> to know when I can reuse one of my buffers to already store the next
>> task (that I will MPI_Put later on). The only way to know this is call
>> MPI_Complete. But since this is blocking and if this buffer is not ready
>> to be reused yet, I can neither verify if the other buffer is already
>> available to me again (in the same thread).
>> I would very much appreciate input on how to solve such issue !
>> thanks in advance,
>> toon
>> On Tue, Feb 22, 2011 at 7:21 PM, Barrett, Brian W > > wrote:
>>
>>On Feb 18, 2011, at 8:59 AM, Toon Knapen wrote:
>>
>> > (Probably this issue has been discussed at length before but
>>unfortunately I did not find any threads (on this site or anywhere
>>else) on this topic, if you are able to provide me with links to
>>earlier discussions on this topic, please do not hesitate)
>> >
>> > Is there an alternative to MPI_Win_complete that does not
>>'enforce completion of preceding RMS calls at the origin' (as said
>>on pag 353 of the mpi-2.2 standard) ?
>> >
>> > I would like to know if I can reuse the buffer I gave to MPI_Put
>>but without blocking on it, if the MPI lib is still using it, I want
>>to be able to continue (and use another buffer).
>>
>>
>>There is not.   MPI_Win_complete is the only way to finish a
>>MPI_Win_start epoch, and is always blocking until local completion
>>of all messages started during the epoch.
>>
>>Brian
>>
>>--
>>  Brian W. Barrett
>>  Dept. 1423: Scalable System Software
>>  Sandia National Laboratories
>>
>>
>>
>>___
>>users mailing list
>>us...@open-mpi.org 
>>
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] multicast not available

2011-02-24 Thread Jeff Squyres
I'm still not sure what you're asking -- are you asking how to get Open MPI to 
work if multicast is disabled in your network?

If so, not to worry; Open MPI doesn't currently use multicast.


On Feb 24, 2011, at 6:39 AM, Vasiliy G Tolstov wrote:

> On Thu, 2011-02-24 at 06:32 -0500, Jeff Squyres (jsquyres) wrote:
>> I'm not sure what you're asking. Open MPI should work just fine in a Xen 
>> environment. 
>> 
>> If you're unsure about how to use the MPI API, you might want to take a 
>> tutorial to get you familiar with MPI concepts, etc. Google around; there 
>> are a bunch available. My personal favorite is at the UIUC NCSA web site; if 
>> you sign up for a free account, they have a beginners and intermediate MPI 
>> tutorials available. 
>> 
>> Sent from my PDA. No type good. 
>> 
>> On Feb 24, 2011, at 4:37 AM, "Vasiliy G Tolstov"  wrote:
> 
> This is not xen specific problem. but routing specific. Admins deny
> multicast packets.
> 
> -- 
> Vasiliy G Tolstov 
> Selfip.Ru
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] nonblock alternative to MPI_Win_complete

2011-02-24 Thread Jeff Squyres
I personally find the entire MPI one-sided chapter to be incredibly confusing 
and subject to arbitrary interpretation.  I have consistently advised people to 
not use it since the late '90s.

That being said, the MPI one-sided chapter is being overhauled in the MPI-3 
forum; the standardization process for that chapter is getting pretty close to 
consensus.  The new chapter is promised to be much better.

My $0.02 is that you might be better severed staying away from the MPI-2 
one-sided stuff because of exactly the surprises and limitations that you've 
run in to, and wait for MPI-3 implementations for real one-sided support.


On Feb 24, 2011, at 8:21 AM, Toon Knapen wrote:

> But that is what surprises me. Indeed the scenario I described can be 
> implemented using two-sided communication, but it seems not to be possible 
> when using one sided communication.
>  
> Additionally the MPI 2.2. standard describes on page 356 the matching rules 
> for post and start, complete and wait and there it says : 
> "MPI_WIN_COMPLETE(win) initiate a nonblocking send with tag tag1 to each 
> process in the group of the preceding start call. No need to wait for the 
> completion of these sends." 
> The wording 'nonblocking send' startles me somehow !?
>  
> toon
> 
>  
> On Thu, Feb 24, 2011 at 2:05 PM, James Dinan  wrote:
> Hi Toon,
> 
> Can you use non-blocking send/recv?  It sounds like this will give you the 
> completion semantics you want.
> 
> Best,
>  ~Jim.
> 
> 
> On 2/24/11 6:07 AM, Toon Knapen wrote:
> In that case, I have a small question concerning design:
> Suppose task-based parallellism where one node (master) distributes
> work/tasks to 2 other nodes (slaves) by means of an MPI_Put. The master
> allocates 2 buffers locally in which it will store all necessary data
> that is needed by the slave to perform the task. So I do an MPI_Put on
> each of my 2 buffers to send each buffer to a specific slave. Now I need
> to know when I can reuse one of my buffers to already store the next
> task (that I will MPI_Put later on). The only way to know this is call
> MPI_Complete. But since this is blocking and if this buffer is not ready
> to be reused yet, I can neither verify if the other buffer is already
> available to me again (in the same thread).
> I would very much appreciate input on how to solve such issue !
> thanks in advance,
> toon
> On Tue, Feb 22, 2011 at 7:21 PM, Barrett, Brian W  > wrote:
> 
>On Feb 18, 2011, at 8:59 AM, Toon Knapen wrote:
> 
> > (Probably this issue has been discussed at length before but
>unfortunately I did not find any threads (on this site or anywhere
>else) on this topic, if you are able to provide me with links to
>earlier discussions on this topic, please do not hesitate)
> >
> > Is there an alternative to MPI_Win_complete that does not
>'enforce completion of preceding RMS calls at the origin' (as said
>on pag 353 of the mpi-2.2 standard) ?
> >
> > I would like to know if I can reuse the buffer I gave to MPI_Put
>but without blocking on it, if the MPI lib is still using it, I want
>to be able to continue (and use another buffer).
> 
> 
>There is not.   MPI_Win_complete is the only way to finish a
>MPI_Win_start epoch, and is always blocking until local completion
>of all messages started during the epoch.
> 
>Brian
> 
>--
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
> 
> 
> 
>___
>users mailing list
>us...@open-mpi.org 
> 
>http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] RoCE (IBoE) & OpenMPI

2011-02-24 Thread Jeff Squyres
On Feb 24, 2011, at 8:00 AM, Michael Shuey wrote:

> Late yesterday I did have a chance to test the patch Jeff provided
> (against 1.4.3 - testing 1.5.x is on the docket for today).  While it
> works, in that I can specify a gid_index,

Great!  I'll commit that to the trunk and start the process of moving it to the 
v1.5.x series (I know you haven't tested it yet, but it's essentially the same 
patch, just slightly adjusted for each of the 3 branches).

> it doesn't do everything
> required - my traffic won't match a lossless CoS on the ethernet
> switch.  Specifying a GID is only half of it; I really need to also
> specify a service level.

RoCE requires the use of the RDMA CM (I think?), and I didn't think there was a 
way to request a specific SL via the RDMA CM...?  (I could certainly be wrong 
here)

I think Mellanox will need to follow up with these questions...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Jeff Squyres
I'm afraid I don't see the problem.  Let's get 4 nodes from slurm:

$ salloc -N 4 

Now let's run env and see what SLURM_ env variables we see:

$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144

Good -- there's 144 of them.  Let's save them to a file for comparison, later.

$ srun env | egrep ^SLURM_ | sort > srun.out

Now let's repeat the process with mpirun.  Note that mpirun defaults to running 
one process per core (vs. srun's default of running one per node).  So let's 
tone mpirun down to use one process per node and look for the SLURM_ env 
variables.

$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144

Good -- we also got 144.  Save them to a file.

$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out

Now let's compare what we got from srun and from mpirun:

$ diff srun.out mpirun.out 
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
> SLURM_SRUN_COMM_PORT=54184
> SLURM_SRUN_COMM_PORT=54184
> SLURM_SRUN_COMM_PORT=54184
> SLURM_SRUN_COMM_PORT=54184
> SLURM_STEP_ID=18
> SLURM_STEP_ID=18
> SLURM_STEP_ID=18
> SLURM_STEP_ID=18
> SLURM_STEPID=18
> SLURM_STEPID=18
> SLURM_STEPID=18
> SLURM_STEPID=18
> SLURM_STEP_LAUNCHER_PORT=54184
> SLURM_STEP_LAUNCHER_PORT=54184
> SLURM_STEP_LAUNCHER_PORT=54184
> SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
> SLURM_TASK_PID=3924
> SLURM_TASK_PID=3933
> SLURM_TASK_PID=3934
> SLURM_TASK_PID=4039
$

They're identical except for per-step values (ports, PIDs, etc.) -- these 
differences are expected.

What version of OMPI are you running?  What happens if you repeat this 
experiment?

I would find it very strange if Open MPI's mpirun is filtering some SLURM env 
variables to some processes and not to all -- your output shows disparate 
output between the different processes.  That's just plain weird.



On Feb 23, 2011, at 12:05 PM, Henderson, Brent wrote:

> SLURM seems to be doing this in the case of a regular srun:
>  
> [brent@node1 mpi]$ srun -N 2 -n 4 env | egrep 
> SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
> SLURM_LOCALID=0
> SLURM_LOCALID=0
> SLURM_LOCALID=1
> SLURM_LOCALID=1
> SLURM_NODEID=0
> SLURM_NODEID=0
> SLURM_NODEID=1
> SLURM_NODEID=1
> SLURM_PROCID=0
> SLURM_PROCID=1
> SLURM_PROCID=2
> SLURM_PROCID=3
> [brent@node1 mpi]$
>  
> Since srun is not supported currently by OpenMPI, I have to use salloc – 
> right?  In this case, it is up to OpenMPI to interpret the SLURM environment 
> variables it sees in the one process that is launched and ‘do the right 
> thing’ – whatever that means in this case.  How does OpenMPI start the 
> processes on the remote nodes under the covers?  (use srun, generate a 
> hostfile and launch as you would outside SLURM, …)  This may be the 
> difference between HP-MPI and OpenMPI.
>  
> Thanks,
>  
> Brent
>  
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: Wednesday, February 23, 2011 10:07 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] SLURM environment variables at runtime
>  
> Resource managers generally frown on the idea of any program passing 
> RM-managed envars from one node to another, and this is certainly true of 
> slurm. The reason is that the RM reserves those values for its own use when 
> managing remote nodes. For example, if you got an allocation and then used 
> mpirun to launch a job across only a portion of that allocation, and then ran 
> another mpirun instance in parallel on the remainder of the nodes, the slurm 
> envars for those two mpirun instances -need- to be quite different. Having 
> mpirun forward the values it sees would cause the system to become very 
> confused.
>  
> We learned the hard way never to cross that line :-(
>  
> You have two options:
>  
> (a) you could get your sys admin to configure slurm correctly to provide your 
> desired envars on the remote nodes. This is the recommended (by slurm and 
> other RMs) way of getting what you requested. It is a simple configuration 
> option - if he needs he

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Ralph Castain
Like I said, this isn't an OMPI problem. You have your slurm configured to
pass certain envars to the remote nodes, and Brent doesn't. It truly is just
that simple.

I've seen this before with other slurm installations. Which envars get set
on the backend is configurable, that's all.

Has nothing to do with OMPI.


On Thu, Feb 24, 2011 at 7:18 AM, Jeff Squyres  wrote:

> I'm afraid I don't see the problem.  Let's get 4 nodes from slurm:
>
> $ salloc -N 4
>
> Now let's run env and see what SLURM_ env variables we see:
>
> $ srun env | egrep ^SLURM_ | head
> SLURM_JOB_ID=95523
> SLURM_JOB_NUM_NODES=4
> SLURM_JOB_NODELIST=svbu-mpi[001-004]
> SLURM_JOB_CPUS_PER_NODE=4(x4)
> SLURM_JOBID=95523
> SLURM_NNODES=4
> SLURM_NODELIST=svbu-mpi[001-004]
> SLURM_TASKS_PER_NODE=1(x4)
> SLURM_PRIO_PROCESS=0
> SLURM_UMASK=0002
> $ srun env | egrep ^SLURM_ | wc -l
> 144
>
> Good -- there's 144 of them.  Let's save them to a file for comparison,
> later.
>
> $ srun env | egrep ^SLURM_ | sort > srun.out
>
> Now let's repeat the process with mpirun.  Note that mpirun defaults to
> running one process per core (vs. srun's default of running one per node).
>  So let's tone mpirun down to use one process per node and look for the
> SLURM_ env variables.
>
> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
> SLURM_JOB_ID=95523
> SLURM_JOB_NUM_NODES=4
> SLURM_JOB_NODELIST=svbu-mpi[001-004]
> SLURM_JOB_ID=95523
> SLURM_JOB_NUM_NODES=4
> SLURM_JOB_CPUS_PER_NODE=4(x4)
> SLURM_JOBID=95523
> SLURM_NNODES=4
> SLURM_NODELIST=svbu-mpi[001-004]
> SLURM_TASKS_PER_NODE=1(x4)
> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
> 144
>
> Good -- we also got 144.  Save them to a file.
>
> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
>
> Now let's compare what we got from srun and from mpirun:
>
> $ diff srun.out mpirun.out
> 93,108c93,108
> < SLURM_SRUN_COMM_PORT=33571
> < SLURM_SRUN_COMM_PORT=33571
> < SLURM_SRUN_COMM_PORT=33571
> < SLURM_SRUN_COMM_PORT=33571
> < SLURM_STEP_ID=15
> < SLURM_STEP_ID=15
> < SLURM_STEP_ID=15
> < SLURM_STEP_ID=15
> < SLURM_STEPID=15
> < SLURM_STEPID=15
> < SLURM_STEPID=15
> < SLURM_STEPID=15
> < SLURM_STEP_LAUNCHER_PORT=33571
> < SLURM_STEP_LAUNCHER_PORT=33571
> < SLURM_STEP_LAUNCHER_PORT=33571
> < SLURM_STEP_LAUNCHER_PORT=33571
> ---
> > SLURM_SRUN_COMM_PORT=54184
> > SLURM_SRUN_COMM_PORT=54184
> > SLURM_SRUN_COMM_PORT=54184
> > SLURM_SRUN_COMM_PORT=54184
> > SLURM_STEP_ID=18
> > SLURM_STEP_ID=18
> > SLURM_STEP_ID=18
> > SLURM_STEP_ID=18
> > SLURM_STEPID=18
> > SLURM_STEPID=18
> > SLURM_STEPID=18
> > SLURM_STEPID=18
> > SLURM_STEP_LAUNCHER_PORT=54184
> > SLURM_STEP_LAUNCHER_PORT=54184
> > SLURM_STEP_LAUNCHER_PORT=54184
> > SLURM_STEP_LAUNCHER_PORT=54184
> 125,128c125,128
> < SLURM_TASK_PID=3899
> < SLURM_TASK_PID=3907
> < SLURM_TASK_PID=3908
> < SLURM_TASK_PID=3997
> ---
> > SLURM_TASK_PID=3924
> > SLURM_TASK_PID=3933
> > SLURM_TASK_PID=3934
> > SLURM_TASK_PID=4039
> $
>
> They're identical except for per-step values (ports, PIDs, etc.) -- these
> differences are expected.
>
> What version of OMPI are you running?  What happens if you repeat this
> experiment?
>
> I would find it very strange if Open MPI's mpirun is filtering some SLURM
> env variables to some processes and not to all -- your output shows
> disparate output between the different processes.  That's just plain weird.
>
>
>
> On Feb 23, 2011, at 12:05 PM, Henderson, Brent wrote:
>
> > SLURM seems to be doing this in the case of a regular srun:
> >
> > [brent@node1 mpi]$ srun -N 2 -n 4 env | egrep
> SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
> > SLURM_LOCALID=0
> > SLURM_LOCALID=0
> > SLURM_LOCALID=1
> > SLURM_LOCALID=1
> > SLURM_NODEID=0
> > SLURM_NODEID=0
> > SLURM_NODEID=1
> > SLURM_NODEID=1
> > SLURM_PROCID=0
> > SLURM_PROCID=1
> > SLURM_PROCID=2
> > SLURM_PROCID=3
> > [brent@node1 mpi]$
> >
> > Since srun is not supported currently by OpenMPI, I have to use salloc –
> right?  In this case, it is up to OpenMPI to interpret the SLURM environment
> variables it sees in the one process that is launched and ‘do the right
> thing’ – whatever that means in this case.  How does OpenMPI start the
> processes on the remote nodes under the covers?  (use srun, generate a
> hostfile and launch as you would outside SLURM, …)  This may be the
> difference between HP-MPI and OpenMPI.
> >
> > Thanks,
> >
> > Brent
> >
> >
> > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> > Sent: Wednesday, February 23, 2011 10:07 AM
> > To: Open MPI Users
> > Subject: Re: [OMPI users] SLURM environment variables at runtime
> >
> > Resource managers generally frown on the idea of any program passing
> RM-managed envars from one node to another, and this is certainly true of
> slurm. The reason is that the RM reserves those values for its own use when
> managing remote nodes. For example, if you got an allocation and then used
> mpirun to launch a job across only a portion of that allocation, and then
>

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Jeff Squyres
The weird thing is that when running his test, he saw different results with HP 
MPI vs. Open MPI.

What his test didn't say was whether those were the same exact nodes or not.  
It would be good to repeat my experiment with the same exact nodes (e.g., 
inside one SLURM salloc job, or use the -w param to specify the same nodes for 
salloc for OMPI and srun for HP MPI).


On Feb 24, 2011, at 10:02 AM, Ralph Castain wrote:

> Like I said, this isn't an OMPI problem. You have your slurm configured to 
> pass certain envars to the remote nodes, and Brent doesn't. It truly is just 
> that simple.
> 
> I've seen this before with other slurm installations. Which envars get set on 
> the backend is configurable, that's all.
> 
> Has nothing to do with OMPI.
> 
> 
> On Thu, Feb 24, 2011 at 7:18 AM, Jeff Squyres  wrote:
> I'm afraid I don't see the problem.  Let's get 4 nodes from slurm:
> 
> $ salloc -N 4
> 
> Now let's run env and see what SLURM_ env variables we see:
> 
> $ srun env | egrep ^SLURM_ | head
> SLURM_JOB_ID=95523
> SLURM_JOB_NUM_NODES=4
> SLURM_JOB_NODELIST=svbu-mpi[001-004]
> SLURM_JOB_CPUS_PER_NODE=4(x4)
> SLURM_JOBID=95523
> SLURM_NNODES=4
> SLURM_NODELIST=svbu-mpi[001-004]
> SLURM_TASKS_PER_NODE=1(x4)
> SLURM_PRIO_PROCESS=0
> SLURM_UMASK=0002
> $ srun env | egrep ^SLURM_ | wc -l
> 144
> 
> Good -- there's 144 of them.  Let's save them to a file for comparison, later.
> 
> $ srun env | egrep ^SLURM_ | sort > srun.out
> 
> Now let's repeat the process with mpirun.  Note that mpirun defaults to 
> running one process per core (vs. srun's default of running one per node).  
> So let's tone mpirun down to use one process per node and look for the SLURM_ 
> env variables.
> 
> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
> SLURM_JOB_ID=95523
> SLURM_JOB_NUM_NODES=4
> SLURM_JOB_NODELIST=svbu-mpi[001-004]
> SLURM_JOB_ID=95523
> SLURM_JOB_NUM_NODES=4
> SLURM_JOB_CPUS_PER_NODE=4(x4)
> SLURM_JOBID=95523
> SLURM_NNODES=4
> SLURM_NODELIST=svbu-mpi[001-004]
> SLURM_TASKS_PER_NODE=1(x4)
> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
> 144
> 
> Good -- we also got 144.  Save them to a file.
> 
> $ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
> 
> Now let's compare what we got from srun and from mpirun:
> 
> $ diff srun.out mpirun.out
> 93,108c93,108
> < SLURM_SRUN_COMM_PORT=33571
> < SLURM_SRUN_COMM_PORT=33571
> < SLURM_SRUN_COMM_PORT=33571
> < SLURM_SRUN_COMM_PORT=33571
> < SLURM_STEP_ID=15
> < SLURM_STEP_ID=15
> < SLURM_STEP_ID=15
> < SLURM_STEP_ID=15
> < SLURM_STEPID=15
> < SLURM_STEPID=15
> < SLURM_STEPID=15
> < SLURM_STEPID=15
> < SLURM_STEP_LAUNCHER_PORT=33571
> < SLURM_STEP_LAUNCHER_PORT=33571
> < SLURM_STEP_LAUNCHER_PORT=33571
> < SLURM_STEP_LAUNCHER_PORT=33571
> ---
> > SLURM_SRUN_COMM_PORT=54184
> > SLURM_SRUN_COMM_PORT=54184
> > SLURM_SRUN_COMM_PORT=54184
> > SLURM_SRUN_COMM_PORT=54184
> > SLURM_STEP_ID=18
> > SLURM_STEP_ID=18
> > SLURM_STEP_ID=18
> > SLURM_STEP_ID=18
> > SLURM_STEPID=18
> > SLURM_STEPID=18
> > SLURM_STEPID=18
> > SLURM_STEPID=18
> > SLURM_STEP_LAUNCHER_PORT=54184
> > SLURM_STEP_LAUNCHER_PORT=54184
> > SLURM_STEP_LAUNCHER_PORT=54184
> > SLURM_STEP_LAUNCHER_PORT=54184
> 125,128c125,128
> < SLURM_TASK_PID=3899
> < SLURM_TASK_PID=3907
> < SLURM_TASK_PID=3908
> < SLURM_TASK_PID=3997
> ---
> > SLURM_TASK_PID=3924
> > SLURM_TASK_PID=3933
> > SLURM_TASK_PID=3934
> > SLURM_TASK_PID=4039
> $
> 
> They're identical except for per-step values (ports, PIDs, etc.) -- these 
> differences are expected.
> 
> What version of OMPI are you running?  What happens if you repeat this 
> experiment?
> 
> I would find it very strange if Open MPI's mpirun is filtering some SLURM env 
> variables to some processes and not to all -- your output shows disparate 
> output between the different processes.  That's just plain weird.
> 
> 
> 
> On Feb 23, 2011, at 12:05 PM, Henderson, Brent wrote:
> 
> > SLURM seems to be doing this in the case of a regular srun:
> >
> > [brent@node1 mpi]$ srun -N 2 -n 4 env | egrep 
> > SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
> > SLURM_LOCALID=0
> > SLURM_LOCALID=0
> > SLURM_LOCALID=1
> > SLURM_LOCALID=1
> > SLURM_NODEID=0
> > SLURM_NODEID=0
> > SLURM_NODEID=1
> > SLURM_NODEID=1
> > SLURM_PROCID=0
> > SLURM_PROCID=1
> > SLURM_PROCID=2
> > SLURM_PROCID=3
> > [brent@node1 mpi]$
> >
> > Since srun is not supported currently by OpenMPI, I have to use salloc – 
> > right?  In this case, it is up to OpenMPI to interpret the SLURM 
> > environment variables it sees in the one process that is launched and ‘do 
> > the right thing’ – whatever that means in this case.  How does OpenMPI 
> > start the processes on the remote nodes under the covers?  (use srun, 
> > generate a hostfile and launch as you would outside SLURM, …)  This may be 
> > the difference between HP-MPI and OpenMPI.
> >
> > Thanks,
> >
> > Brent
> >
> >
> > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> > Behalf Of Ralph Castain
> > Sent: Wednesday, Februar

Re: [OMPI users] multicast not available

2011-02-24 Thread Ralph Castain
If you are trying to use OMPI as the base for ORCM, then you can tell ORCM
to use OMPI's "tcp" multicast module - it fakes multicast using pt-2-pt tcp
messaging.

-mca rmcast tcp

will do the trick.


On Thu, Feb 24, 2011 at 6:27 AM, Jeff Squyres  wrote:

> I'm still not sure what you're asking -- are you asking how to get Open MPI
> to work if multicast is disabled in your network?
>
> If so, not to worry; Open MPI doesn't currently use multicast.
>
>
> On Feb 24, 2011, at 6:39 AM, Vasiliy G Tolstov wrote:
>
> > On Thu, 2011-02-24 at 06:32 -0500, Jeff Squyres (jsquyres) wrote:
> >> I'm not sure what you're asking. Open MPI should work just fine in a Xen
> environment.
> >>
> >> If you're unsure about how to use the MPI API, you might want to take a
> tutorial to get you familiar with MPI concepts, etc. Google around; there
> are a bunch available. My personal favorite is at the UIUC NCSA web site; if
> you sign up for a free account, they have a beginners and intermediate MPI
> tutorials available.
> >>
> >> Sent from my PDA. No type good.
> >>
> >> On Feb 24, 2011, at 4:37 AM, "Vasiliy G Tolstov" 
> wrote:
> >
> > This is not xen specific problem. but routing specific. Admins deny
> > multicast packets.
> >
> > --
> > Vasiliy G Tolstov 
> > Selfip.Ru
> >
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Ralph Castain
On Thu, Feb 24, 2011 at 8:30 AM, Jeff Squyres  wrote:

> The weird thing is that when running his test, he saw different results
> with HP MPI vs. Open MPI.
>

It sounded quite likely that HP MPI is picking up and moving the envars
itself - that possibility was implied, but not clearly stated.


>
> What his test didn't say was whether those were the same exact nodes or
> not.  It would be good to repeat my experiment with the same exact nodes
> (e.g., inside one SLURM salloc job, or use the -w param to specify the same
> nodes for salloc for OMPI and srun for HP MPI).
>

We should note that you -can- directly srun an OMPI job now. I believe that
capability was released in the 1.5 series. It takes a minimum slurm release
level plus a slurm configuration setting to do so.



>
>
> On Feb 24, 2011, at 10:02 AM, Ralph Castain wrote:
>
> > Like I said, this isn't an OMPI problem. You have your slurm configured
> to pass certain envars to the remote nodes, and Brent doesn't. It truly is
> just that simple.
> >
> > I've seen this before with other slurm installations. Which envars get
> set on the backend is configurable, that's all.
> >
> > Has nothing to do with OMPI.
> >
> >
> > On Thu, Feb 24, 2011 at 7:18 AM, Jeff Squyres 
> wrote:
> > I'm afraid I don't see the problem.  Let's get 4 nodes from slurm:
> >
> > $ salloc -N 4
> >
> > Now let's run env and see what SLURM_ env variables we see:
> >
> > $ srun env | egrep ^SLURM_ | head
> > SLURM_JOB_ID=95523
> > SLURM_JOB_NUM_NODES=4
> > SLURM_JOB_NODELIST=svbu-mpi[001-004]
> > SLURM_JOB_CPUS_PER_NODE=4(x4)
> > SLURM_JOBID=95523
> > SLURM_NNODES=4
> > SLURM_NODELIST=svbu-mpi[001-004]
> > SLURM_TASKS_PER_NODE=1(x4)
> > SLURM_PRIO_PROCESS=0
> > SLURM_UMASK=0002
> > $ srun env | egrep ^SLURM_ | wc -l
> > 144
> >
> > Good -- there's 144 of them.  Let's save them to a file for comparison,
> later.
> >
> > $ srun env | egrep ^SLURM_ | sort > srun.out
> >
> > Now let's repeat the process with mpirun.  Note that mpirun defaults to
> running one process per core (vs. srun's default of running one per node).
>  So let's tone mpirun down to use one process per node and look for the
> SLURM_ env variables.
> >
> > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
> > SLURM_JOB_ID=95523
> > SLURM_JOB_NUM_NODES=4
> > SLURM_JOB_NODELIST=svbu-mpi[001-004]
> > SLURM_JOB_ID=95523
> > SLURM_JOB_NUM_NODES=4
> > SLURM_JOB_CPUS_PER_NODE=4(x4)
> > SLURM_JOBID=95523
> > SLURM_NNODES=4
> > SLURM_NODELIST=svbu-mpi[001-004]
> > SLURM_TASKS_PER_NODE=1(x4)
> > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
> > 144
> >
> > Good -- we also got 144.  Save them to a file.
> >
> > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
> >
> > Now let's compare what we got from srun and from mpirun:
> >
> > $ diff srun.out mpirun.out
> > 93,108c93,108
> > < SLURM_SRUN_COMM_PORT=33571
> > < SLURM_SRUN_COMM_PORT=33571
> > < SLURM_SRUN_COMM_PORT=33571
> > < SLURM_SRUN_COMM_PORT=33571
> > < SLURM_STEP_ID=15
> > < SLURM_STEP_ID=15
> > < SLURM_STEP_ID=15
> > < SLURM_STEP_ID=15
> > < SLURM_STEPID=15
> > < SLURM_STEPID=15
> > < SLURM_STEPID=15
> > < SLURM_STEPID=15
> > < SLURM_STEP_LAUNCHER_PORT=33571
> > < SLURM_STEP_LAUNCHER_PORT=33571
> > < SLURM_STEP_LAUNCHER_PORT=33571
> > < SLURM_STEP_LAUNCHER_PORT=33571
> > ---
> > > SLURM_SRUN_COMM_PORT=54184
> > > SLURM_SRUN_COMM_PORT=54184
> > > SLURM_SRUN_COMM_PORT=54184
> > > SLURM_SRUN_COMM_PORT=54184
> > > SLURM_STEP_ID=18
> > > SLURM_STEP_ID=18
> > > SLURM_STEP_ID=18
> > > SLURM_STEP_ID=18
> > > SLURM_STEPID=18
> > > SLURM_STEPID=18
> > > SLURM_STEPID=18
> > > SLURM_STEPID=18
> > > SLURM_STEP_LAUNCHER_PORT=54184
> > > SLURM_STEP_LAUNCHER_PORT=54184
> > > SLURM_STEP_LAUNCHER_PORT=54184
> > > SLURM_STEP_LAUNCHER_PORT=54184
> > 125,128c125,128
> > < SLURM_TASK_PID=3899
> > < SLURM_TASK_PID=3907
> > < SLURM_TASK_PID=3908
> > < SLURM_TASK_PID=3997
> > ---
> > > SLURM_TASK_PID=3924
> > > SLURM_TASK_PID=3933
> > > SLURM_TASK_PID=3934
> > > SLURM_TASK_PID=4039
> > $
> >
> > They're identical except for per-step values (ports, PIDs, etc.) -- these
> differences are expected.
> >
> > What version of OMPI are you running?  What happens if you repeat this
> experiment?
> >
> > I would find it very strange if Open MPI's mpirun is filtering some SLURM
> env variables to some processes and not to all -- your output shows
> disparate output between the different processes.  That's just plain weird.
> >
> >
> >
> > On Feb 23, 2011, at 12:05 PM, Henderson, Brent wrote:
> >
> > > SLURM seems to be doing this in the case of a regular srun:
> > >
> > > [brent@node1 mpi]$ srun -N 2 -n 4 env | egrep
> SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
> > > SLURM_LOCALID=0
> > > SLURM_LOCALID=0
> > > SLURM_LOCALID=1
> > > SLURM_LOCALID=1
> > > SLURM_NODEID=0
> > > SLURM_NODEID=0
> > > SLURM_NODEID=1
> > > SLURM_NODEID=1
> > > SLURM_PROCID=0
> > > SLURM_PROCID=1
> > > SLURM_PROCID=2
> > > SLURM_PROCID=3
> > > [brent@node1 mpi]$
> > >
> > > Since srun is not supported

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Henderson, Brent
I'm running OpenMPI v1.4.3 and slurm v2.2.1.  I built both with the default 
configuration except setting the prefix.  The tests were run on the exact same 
nodes (I only have two).

When I run the test you outline below, I am still missing a bunch of env 
variables with OpenMPI.  I ran the extra test of using HP-MPI and they are all 
present as with the srun invocation.  I don't know if this is my slurm setup or 
not, but I find this really weird.  If anyone knows the magic to make the fix 
that Ralph is referring to, I'd appreciate a pointer.

My guess was that there is a subtle way that the launch differs between the two 
products.  But, since it works for Jeff, maybe there really is a slurm option 
that I need to compile in or set to make this work the way I want.  It is not 
as simple as HP-MPI moving the environment variables itself as some of the 
numbers will change per process created on the remote nodes.

Thanks,

Brent

[brent@node2 mpi]$ salloc -N 2
salloc: Granted job allocation 29
[brent@node2 mpi]$ srun env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
[brent@node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
66
[brent@node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
[brent@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=8(x2)
SLURM_JOB_ID=29
SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
SLURM_JOB_NODELIST=node[1-2]
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_JOB_NUM_NODES=2
SLURM_NODELIST=node[1-2]
[brent@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l
42  <-- note, not 66 as above!
[brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort > mpirun.out
[brent@node2 mpi]$ diff srun.out mpirun.out
2d1
< SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
4,5d2
< SLURM_CPUS_ON_NODE=8
< SLURM_CPUS_PER_TASK=1
8d4
< SLURM_DISTRIBUTION=cyclic
10d5
< SLURM_GTIDS=1
22,23d16
< SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
< SLURM_LOCALID=0
25c18
< SLURM_NNODES=2
---
> SLURM_NNODES=1
28d20
< SLURM_NODEID=1
31,35c23,24
< SLURM_NPROCS=2
< SLURM_NPROCS=2
< SLURM_NTASKS=2
< SLURM_NTASKS=2
< SLURM_PRIO_PROCESS=0
---
> SLURM_NPROCS=1
> SLURM_NTASKS=1
38d26
< SLURM_PROCID=1
40,56c28,35
< SLURM_SRUN_COMM_HOST=10.0.205.134
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_TASKS_PER_NODE=1(x2)
< SLURM_STEP_TASKS_PER_NODE=1(x2)
---
> SLURM_SRUN_COMM_PORT=45154
> SLURM_STEP_ID=5
> SLURM_STEPID=5
> SLURM_STEP_LAUNCHER_PORT=45154
> SLURM_STEP_NODELIST=node1
> SLURM_STEP_NUM_NODES=1
> SLURM_STEP_NUM_TASKS=1
> SLURM_STEP_TASKS_PER_NODE=1
59,62c38,40
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
< SLURM_TASKS_PER_NODE=1(x2)
< SLURM_TASKS_PER_NODE=1(x2)
---
> SLURM_TASK_PID=1429
> SLURM_TASKS_PER_NODE=1
> SLURM_TASKS_PER_NODE=8(x2)
64,65d41
< SLURM_TOPOLOGY_ADDR=node2
< SLURM_TOPOLOGY_ADDR_PATTERN=node
[brent@node2 mpi]$
[brent@node2 mpi]$
[brent@node2 mpi]$
[brent@node2 mpi]$
[brent@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep ^SLURM_ | 
sort > hpmpi.out
[brent@node2 mpi]$ diff srun.out hpmpi.out
20a21,22
> SLURM_KILL_BAD_EXIT=1
> SLURM_KILL_BAD_EXIT=1
41,48c43,50
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
---
> SLURM_SRUN_COMM_PORT=33347
> SLURM_SRUN_COMM_PORT=33347
> SLURM_STEP_ID=8
> SLURM_STEP_ID=8
> SLURM_STEPID=8
> SLURM_STEPID=8
> SLURM_STEP_LAUNCHER_PORT=33347
> SLURM_STEP_LAUNCHER_PORT=33347
59,60c61,62
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
---
> SLURM_TASK_PID=1592
> SLURM_TASK_PID=2590
[brent@node2 mpi]$
[brent@node2 mpi]$
[brent@node2 mpi]$ grep SLURM_PROCID srun.out
SLURM_PROCID=0
SLURM_PROCID=1
[brent@node2 mpi]$ grep SLURM_PROCID mpirun.out
SLURM_PROCID=0
[brent@node2 mpi]$ grep SLURM_PROCID hpmpi.out
SLURM_PROCID=0
SLURM_PROCID=1
[brent@node2 mpi]$


> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Jeff Squyres
> Sent: Thursday, February 24, 2011 9:31 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] SLURM environment variables at runtime
>
> The weird thing is that when running his test, he saw different results
> with HP MPI vs. Open MPI.
>
> What his test didn't say was whether those were the same exact nodes or
> not.  It would be good to repeat my experiment 

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Ralph Castain
I would talk to the slurm folks about it - I don't know anything about the
internals of HP-MPI, but I do know the relevant OMPI internals. OMPI doesn't
do anything with respect to the envars. We just use "srun -hostlist "
to launch the daemons. Each daemon subsequently gets a message telling it
what local procs to run, and then fork/exec's those procs. The environment
set for those procs is a copy of that given to the daemon, including any and
all slurm values.

So whatever slurm sets, your procs get.

My guess is that HP-MPI is doing something with the envars to create the
difference.

As for running OMPI procs directly from srun: the slurm folks put out a faq
(or its equivalent) on it, I believe. I don't recall the details (even
though I wrote the integration...). If you google our user and/or devel
mailing lists, though, you'll see threads discussing it. Look for "slurmd"
in the text - that's the ORTE integration module for that feature.



On Thu, Feb 24, 2011 at 8:55 AM, Henderson, Brent wrote:

> I'm running OpenMPI v1.4.3 and slurm v2.2.1.  I built both with the default
> configuration except setting the prefix.  The tests were run on the exact
> same nodes (I only have two).
>
> When I run the test you outline below, I am still missing a bunch of env
> variables with OpenMPI.  I ran the extra test of using HP-MPI and they are
> all present as with the srun invocation.  I don't know if this is my slurm
> setup or not, but I find this really weird.  If anyone knows the magic to
> make the fix that Ralph is referring to, I'd appreciate a pointer.
>
> My guess was that there is a subtle way that the launch differs between the
> two products.  But, since it works for Jeff, maybe there really is a slurm
> option that I need to compile in or set to make this work the way I want.
>  It is not as simple as HP-MPI moving the environment variables itself as
> some of the numbers will change per process created on the remote nodes.
>
> Thanks,
>
> Brent
>
> [brent@node2 mpi]$ salloc -N 2
> salloc: Granted job allocation 29
> [brent@node2 mpi]$ srun env | egrep ^SLURM_ | head
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=1(x2)
> SLURM_JOB_ID=29
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=1(x2)
> SLURM_JOB_ID=29
> [brent@node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
> 66
> [brent@node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
> [brent@node2 mpi]$ which mpirun
> ~/bin/openmpi143/bin/mpirun
> [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=8(x2)
> SLURM_JOB_ID=29
> SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
> SLURM_JOB_NODELIST=node[1-2]
> SLURM_JOB_CPUS_PER_NODE=8(x2)
> SLURM_JOB_NUM_NODES=2
> SLURM_NODELIST=node[1-2]
> [brent@node2 mpi]$ which mpirun
> ~/bin/openmpi143/bin/mpirun
> [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l
> 42  <-- note, not 66 as above!
> [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort >
> mpirun.out
> [brent@node2 mpi]$ diff srun.out mpirun.out
> 2d1
> < SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
> 4,5d2
> < SLURM_CPUS_ON_NODE=8
> < SLURM_CPUS_PER_TASK=1
> 8d4
> < SLURM_DISTRIBUTION=cyclic
> 10d5
> < SLURM_GTIDS=1
> 22,23d16
> < SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
> < SLURM_LOCALID=0
> 25c18
> < SLURM_NNODES=2
> ---
> > SLURM_NNODES=1
> 28d20
> < SLURM_NODEID=1
> 31,35c23,24
> < SLURM_NPROCS=2
> < SLURM_NPROCS=2
> < SLURM_NTASKS=2
> < SLURM_NTASKS=2
> < SLURM_PRIO_PROCESS=0
> ---
> > SLURM_NPROCS=1
> > SLURM_NTASKS=1
> 38d26
> < SLURM_PROCID=1
> 40,56c28,35
> < SLURM_SRUN_COMM_HOST=10.0.205.134
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_STEP_ID=2
> < SLURM_STEP_ID=2
> < SLURM_STEPID=2
> < SLURM_STEPID=2
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_NODELIST=node[1-2]
> < SLURM_STEP_NODELIST=node[1-2]
> < SLURM_STEP_NUM_NODES=2
> < SLURM_STEP_NUM_NODES=2
> < SLURM_STEP_NUM_TASKS=2
> < SLURM_STEP_NUM_TASKS=2
> < SLURM_STEP_TASKS_PER_NODE=1(x2)
> < SLURM_STEP_TASKS_PER_NODE=1(x2)
> ---
> > SLURM_SRUN_COMM_PORT=45154
> > SLURM_STEP_ID=5
> > SLURM_STEPID=5
> > SLURM_STEP_LAUNCHER_PORT=45154
> > SLURM_STEP_NODELIST=node1
> > SLURM_STEP_NUM_NODES=1
> > SLURM_STEP_NUM_TASKS=1
> > SLURM_STEP_TASKS_PER_NODE=1
> 59,62c38,40
> < SLURM_TASK_PID=1381
> < SLURM_TASK_PID=2288
> < SLURM_TASKS_PER_NODE=1(x2)
> < SLURM_TASKS_PER_NODE=1(x2)
> ---
> > SLURM_TASK_PID=1429
> > SLURM_TASKS_PER_NODE=1
> > SLURM_TASKS_PER_NODE=8(x2)
> 64,65d41
> < SLURM_TOPOLOGY_ADDR=node2
> < SLURM_TOPOLOGY_ADDR_PATTERN=node
> [brent@node2 mpi]$
> [brent@node2 mpi]$
> [brent@node2 mpi]$
> [brent@node2 mpi]$
> [brent@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep
> ^SLURM_ | sort > hpmpi.out
> [brent@node2 mpi]$ diff srun.out hpmpi.out
> 20a21,22
> > SLURM_KILL_BAD_EXIT=1
> > SLURM_KILL_BAD_EXIT=1
> 41

[OMPI users] Setup of a two nodes cluster?

2011-02-24 Thread Xianglong Kong
Hi, all,

I asked for help for a code problem here days ago (
http://www.open-mpi.org/community/lists/users/2011/02/15656.php ).
Then I found that the code can be executed without any issue on other
cluster. So I suspected that there maybe something wrong in my cluster
environment configuration. So I reconfigured the nfs,ssh and other
related thing and reinstalled the openmpi library. The cluster
consists of two desktops which are connected using a crossover cable.
Both of the desktops have a Intel Core 2 Duo CPU and are using Ubuntu
10.04 LTS, and the version of openmpi intalled on the nfs (located at
the master node)   is 1.4.3.

Now, things seems to be getting worse. I can't run any code
successfully that more complicated than the "MPI hello world". But if
all of the processes are launched in the same node, the code  can be
executed without any issue.

For example, the following code(only add one line to the "MPI hello
world") would crash at the MPI_Barrier. However, if I delete the line
of MPI_Barrier, the code would run successfully.

#include 
#include "mpi.h"

int main(int argc, char** argv) {

int myrank, nprocs;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

printf("First hello from processor %d of %d\n", myrank, nprocs);

MPI_Barrier(MPI_COMM_WORLD);

printf("Second hello from processor %d of %d\n", myrank, nprocs);

MPI_Finalize();
return 0;
}


The output of the above code is:

[kongdragon-master:16119] *** An error occurred in MPI_Barrier
[kongdragon-master:16119] *** on communicator MPI_COMM_WORLD
[kongdragon-master:16119] *** MPI_ERR_IN_STATUS: error code in status
[kongdragon-master:16119] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
First hello from processor 0 of 2
--
mpirun has exited due to process rank 0 with PID 16119 on
node kongdragon-master exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
First hello from processor 1 of 2



Can anyone help to point out why things didn't work?

Thanks!


Kong


Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Henderson, Brent
Sorry Ralph, I have to respectfully disagree with you on this one.   I believe 
that the output below shows that the issue is that the two different MPIs 
launch things differently.   On one node, I ran:

[brent@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[brent@node2 mpi]$ mpirun -np 4 --bynode sleep 300

And then checked the process tree on the remote node:

[brent@node1 mpi]$ ps -fu brent
UIDPID  PPID  C STIME TTY  TIME CMD
brent 1709  1706  0 10:00 ?00:00:00 
/mnt/node1/home/brent/bin/openmpi143/bin/orted -mca
brent 1712  1709  0 10:00 ?00:00:00 sleep 300
brent 1713  1709  0 10:00 ?00:00:00 sleep 300
brent 1714 18458  0 10:00 pts/000:00:00 ps -fu brent
brent13282 13281  0 Feb17 pts/000:00:00 -bash
brent18458 13282  0 Feb23 pts/000:00:00 -csh
[brent@node1 mpi]$ ps -fp 1706
UIDPID  PPID  C STIME TTY  TIME CMD
root  1706 1  0 10:00 ?00:00:00 slurmstepd: [29.9]
[brent@node1 mpi]$

Note that the parent of the sleep processes is orted and that orted was started 
by slurmstepd.  Unless orted is updating the slurm variables for the children 
(which is doubtful) then they will not contain the specific settings that I see 
when I run srun directly.  I launch with HP-MPI like this:

[brent@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4 sleep 300

I then see the following in the process tree on the remote node:

[brent@node1 mpi]$ ps -fu brent
UIDPID  PPID  C STIME TTY  TIME CMD
brent 1741  1738  0 10:02 ?00:00:00 /bin/sleep 300
brent 1742  1738  0 10:02 ?00:00:00 /bin/sleep 300
brent 1745 18458  0 10:02 pts/000:00:00 ps -fu brent
brent13282 13281  0 Feb17 pts/000:00:00 -bash
brent18458 13282  0 Feb23 pts/000:00:00 -csh
[brent@node1 mpi]$ ps -fp 1738
UIDPID  PPID  C STIME TTY  TIME CMD
root  1738 1  0 10:02 ?00:00:00 slurmstepd: [29.10]
[brent@node1 mpi]$

Since the parent of both of the sleep processes is slurmstepd, it is setting 
things up as I would expect.  This lineage is the same as I find by running 
srun directly.

Now, the question still is, why does this work for Jeff?  :)  Is there a way to 
get orted out of the way so the sleep processes are launched directly by srun?

brent




From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, February 24, 2011 10:05 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime

I would talk to the slurm folks about it - I don't know anything about the 
internals of HP-MPI, but I do know the relevant OMPI internals. OMPI doesn't do 
anything with respect to the envars. We just use "srun -hostlist " to 
launch the daemons. Each daemon subsequently gets a message telling it what 
local procs to run, and then fork/exec's those procs. The environment set for 
those procs is a copy of that given to the daemon, including any and all slurm 
values.

So whatever slurm sets, your procs get.

My guess is that HP-MPI is doing something with the envars to create the 
difference.

As for running OMPI procs directly from srun: the slurm folks put out a faq (or 
its equivalent) on it, I believe. I don't recall the details (even though I 
wrote the integration...). If you google our user and/or devel mailing lists, 
though, you'll see threads discussing it. Look for "slurmd" in the text - 
that's the ORTE integration module for that feature.


On Thu, Feb 24, 2011 at 8:55 AM, Henderson, Brent  
wrote:
I'm running OpenMPI v1.4.3 and slurm v2.2.1.  I built both with the default 
configuration except setting the prefix.  The tests were run on the exact same 
nodes (I only have two).

When I run the test you outline below, I am still missing a bunch of env 
variables with OpenMPI.  I ran the extra test of using HP-MPI and they are all 
present as with the srun invocation.  I don't know if this is my slurm setup or 
not, but I find this really weird.  If anyone knows the magic to make the fix 
that Ralph is referring to, I'd appreciate a pointer.

My guess was that there is a subtle way that the launch differs between the two 
products.  But, since it works for Jeff, maybe there really is a slurm option 
that I need to compile in or set to make this work the way I want.  It is not 
as simple as HP-MPI moving the environment variables itself as some of the 
numbers will change per process created on the remote nodes.

Thanks,

Brent

[brent@node2 mpi]$ salloc -N 2
salloc: Granted job allocation 29
[brent@node2 mpi]$ srun env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
[brent@node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
66
[brent@node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
[brent@node2 mpi]$ which mpirun
~/bi

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Jeff Squyres
FWIW, I'm running Slurm 2.1.0 -- I haven't updated to 2.2.x. yet.

Just to be sure, I re-ran my test with OMPI 1.4.3 (I was using the OMPI 
development SVN trunk before) and got the same results:


$ srun env | egrep ^SLURM_ | wc -l
144
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144


I find it strange that "srun env ..." and HPMPI's "mpirun env..." return 
(effectively) the same results, but OMPI's "mpirun env ..." returns something 
different.

Perhaps SLURM changed something in 2.2.x...?  As Ralph mentioned, OMPI 
*shouldn't* be altering the environment w.r.t. SLURM variables that you get -- 
whatever SLURM sets, that's what you should get in an OMPI-launched process.



On Feb 24, 2011, at 10:55 AM, Henderson, Brent wrote:

> I'm running OpenMPI v1.4.3 and slurm v2.2.1.  I built both with the default 
> configuration except setting the prefix.  The tests were run on the exact 
> same nodes (I only have two).
> 
> When I run the test you outline below, I am still missing a bunch of env 
> variables with OpenMPI.  I ran the extra test of using HP-MPI and they are 
> all present as with the srun invocation.  I don't know if this is my slurm 
> setup or not, but I find this really weird.  If anyone knows the magic to 
> make the fix that Ralph is referring to, I'd appreciate a pointer.
> 
> My guess was that there is a subtle way that the launch differs between the 
> two products.  But, since it works for Jeff, maybe there really is a slurm 
> option that I need to compile in or set to make this work the way I want.  It 
> is not as simple as HP-MPI moving the environment variables itself as some of 
> the numbers will change per process created on the remote nodes.
> 
> Thanks,
> 
> Brent
> 
> [brent@node2 mpi]$ salloc -N 2
> salloc: Granted job allocation 29
> [brent@node2 mpi]$ srun env | egrep ^SLURM_ | head
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=1(x2)
> SLURM_JOB_ID=29
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=1(x2)
> SLURM_JOB_ID=29
> [brent@node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
> 66
> [brent@node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
> [brent@node2 mpi]$ which mpirun
> ~/bin/openmpi143/bin/mpirun
> [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head
> SLURM_NODELIST=node[1-2]
> SLURM_NNODES=2
> SLURM_JOBID=29
> SLURM_TASKS_PER_NODE=8(x2)
> SLURM_JOB_ID=29
> SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
> SLURM_JOB_NODELIST=node[1-2]
> SLURM_JOB_CPUS_PER_NODE=8(x2)
> SLURM_JOB_NUM_NODES=2
> SLURM_NODELIST=node[1-2]
> [brent@node2 mpi]$ which mpirun
> ~/bin/openmpi143/bin/mpirun
> [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l
> 42  <-- note, not 66 as above!
> [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort > 
> mpirun.out
> [brent@node2 mpi]$ diff srun.out mpirun.out
> 2d1
> < SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
> 4,5d2
> < SLURM_CPUS_ON_NODE=8
> < SLURM_CPUS_PER_TASK=1
> 8d4
> < SLURM_DISTRIBUTION=cyclic
> 10d5
> < SLURM_GTIDS=1
> 22,23d16
> < SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
> < SLURM_LOCALID=0
> 25c18
> < SLURM_NNODES=2
> ---
>> SLURM_NNODES=1
> 28d20
> < SLURM_NODEID=1
> 31,35c23,24
> < SLURM_NPROCS=2
> < SLURM_NPROCS=2
> < SLURM_NTASKS=2
> < SLURM_NTASKS=2
> < SLURM_PRIO_PROCESS=0
> ---
>> SLURM_NPROCS=1
>> SLURM_NTASKS=1
> 38d26
> < SLURM_PROCID=1
> 40,56c28,35
> < SLURM_SRUN_COMM_HOST=10.0.205.134
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_STEP_ID=2
> < SLURM_STEP_ID=2
> < SLURM_STEPID=2
> < SLURM_STEPID=2
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_NODELIST=node[1-2]
> < SLURM_STEP_NODELIST=node[1-2]
> < SLURM_STEP_NUM_NODES=2
> < SLURM_STEP_NUM_NODES=2
> < SLURM_STEP_NUM_TASKS=2
> < SLURM_STEP_NUM_TASKS=2
> < SLURM_STEP_TASKS_PER_NODE=1(x2)
> < SLURM_STEP_TASKS_PER_NODE=1(x2)
> ---
>> SLURM_SRUN_COMM_PORT=45154
>> SLURM_STEP_ID=5
>> SLURM_STEPID=5
>> SLURM_STEP_LAUNCHER_PORT=45154
>> SLURM_STEP_NODELIST=node1
>> SLURM_STEP_NUM_NODES=1
>> SLURM_STEP_NUM_TASKS=1
>> SLURM_STEP_TASKS_PER_NODE=1
> 59,62c38,40
> < SLURM_TASK_PID=1381
> < SLURM_TASK_PID=2288
> < SLURM_TASKS_PER_NODE=1(x2)
> < SLURM_TASKS_PER_NODE=1(x2)
> ---
>> SLURM_TASK_PID=1429
>> SLURM_TASKS_PER_NODE=1
>> SLURM_TASKS_PER_NODE=8(x2)
> 64,65d41
> < SLURM_TOPOLOGY_ADDR=node2
> < SLURM_TOPOLOGY_ADDR_PATTERN=node
> [brent@node2 mpi]$
> [brent@node2 mpi]$
> [brent@node2 mpi]$
> [brent@node2 mpi]$
> [brent@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep ^SLURM_ 
> | sort > hpmpi.out
> [brent@node2 mpi]$ diff srun.out hpmpi.out
> 20a21,22
>> SLURM_KILL_BAD_EXIT=1
>> SLURM_KILL_BAD_EXIT=1
> 41,48c43,50
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_SRUN_COMM_PORT=43247
> < SLURM_STEP_ID=2
> < SLURM_STEP_ID=2
> < SLURM_STEPID=2
> < SLURM_STEPID=2
> < SLURM_STEP_LAUNCHER_PORT=43247
> < SLURM_STEP_LAUNCHER_PORT=43247
> ---
>> SLURM_SRUN_COMM_PORT=33347
>> S

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Jeff Squyres
On Feb 24, 2011, at 11:15 AM, Henderson, Brent wrote:

> Note that the parent of the sleep processes is orted and that orted was 
> started by slurmstepd.  Unless orted is updating the slurm variables for the 
> children (which is doubtful) then they will not contain the specific settings 
> that I see when I run srun directly.  

I'm not sure what you mean by that statement.  The orted passes its environment 
to its children; so whatever the slurm stepd set in the environment for the 
orted, the children should be getting.

Clearly, something is different here -- maybe we do have a bug -- but as you 
stated below, why does it work for me?  Is SLURM 2.2.x the difference?  I don't 
know.

> Now, the question still is, why does this work for Jeff?  :)  Is there a way 
> to get orted out of the way so the sleep processes are launched directly by 
> srun?

Yes; see Ralph's prior mail about direct srun support in Open MPI 1.5.x.  You 
lose some functionality / features that way, though.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Henderson, Brent
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Jeff Squyres
> Sent: Thursday, February 24, 2011 10:20 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] SLURM environment variables at runtime
> 
> On Feb 24, 2011, at 11:15 AM, Henderson, Brent wrote:
> 
> > Note that the parent of the sleep processes is orted and that orted
> was started by slurmstepd.  Unless orted is updating the slurm
> variables for the children (which is doubtful) then they will not
> contain the specific settings that I see when I run srun directly.
> 
> I'm not sure what you mean by that statement.  The orted passes its
> environment to its children; so whatever the slurm stepd set in the
> environment for the orted, the children should be getting.
> 

While you are correct the environment is inherited to the children, sometimes 
that does not make sense.  Take for example SLURM_PROCID.  If slurmstepd starts 
the orted and sets its SLURM_PROCID, then the children sleep processes (of 
orted) would get that as well exactly as it is in orted.  That is clearly 
misleading at best.  For example:

[brent@node2 mpi]$ mpirun -np 4 --bynode sleep 300

Then looking at the remote node:

[brent@node1 mpi]$ ps -fu brent
UIDPID  PPID  C STIME TTY  TIME CMD
brent 2853  2850  0 13:23 ?00:00:00 
/mnt/node1/home/brent/bin/openmpi143/bin/orted -mca
brent 2856  2853  0 13:23 ?00:00:00 sleep 300
brent 2857  2853  0 13:23 ?00:00:00 sleep 300
(snip)

And the SLURM_PROCID from each process:

[brent@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2853/environ | egrep ^SLURM_ | 
grep PROCID
SLURM_PROCID=0
[brent@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2856/environ | egrep ^SLURM_ | 
grep PROCID
SLURM_PROCID=0
[brent@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2857/environ | egrep ^SLURM_ | 
grep PROCID
SLURM_PROCID=0
[brent@node1 mpi]$

They really can't be all SLURM_PROCID=0 - that is supposed to be unique for the 
job - right?  It appears that the SLURM_PROCID is inherited from the orted 
parent - which makes a fair amount of sense given how things are launched.  If 
I use HP-MPI, the slurmstepd starts each of the sleep processes and it does set 
SLURM_PROCID uniquely when launching each child.  This is the crux of my issue.

I did find that there are OMPI_* variables that I can map internally back to 
what I think that the slurm variables should be:

[brent@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2853/environ | egrep ^OMPI | 
grep WORLD
[brent@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2856/environ | egrep ^OMPI | 
grep WORLD
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=0
[brent@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2857/environ | egrep ^OMPI | 
grep WORLD
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=3
OMPI_COMM_WORLD_LOCAL_RANK=1
[brent@node1 mpi]$

So, I think if I combined some OMPI_* things with SLURM_* things, I should be 
o.k. for what I need.

Now to answer the other question - why are there some variables missing.  It 
appears that when the orted processes are launched - via srun but only one per 
node, it is a subset of the main allocation and thus some of the environment 
variables are not the same (or missing entirely) as compared to launching them 
directly with srun on the full allocation.  This also makes sense to me at some 
level, so I'm at peace with it now.  :)

> Clearly, something is different here -- maybe we do have a bug -- but
> as you stated below, why does it work for me?  Is SLURM 2.2.x the
> difference?  I don't know.
> 
I'm tempted to try the older version of slurm as this might be the cause of the 
missing environment variables, but that is an experiment for another day.  I'll 
see if I can make do with what I see currently.

> > Now, the question still is, why does this work for Jeff?  :)  Is
> there a way to get orted out of the way so the sleep processes are
> launched directly by srun?
> 
> Yes; see Ralph's prior mail about direct srun support in Open MPI
> 1.5.x.  You lose some functionality / features that way, though.
> 
Maybe that will be an answer, but I'll see if I can make things work with 1.4.3 
for now.

Last thing before I go.  Please let me apologize for not being clear on what I 
disagreed with Ralph about in my last note.  Clearly he nailed the orted 
launching process and spelled it out very clearly, but I don't believe that 
HP-MPI is not doing anything special to copy/fix up the SLURM environment 
variables.  Hopefully that was clear by the body of that message.  

I think we are done here as I think I can make something work with the various 
environment variables now.  Many thanks to Jeff and Ralph for their suggestions 
and insight on this issue!

Brent




Re: [OMPI users] nonblock alternative to MPI_Win_complete

2011-02-24 Thread James Dinan

Just to follow up on Jeff's comments:

I'm a member of the MPI-3 RMA committee and we are working on improving 
the current state of the RMA spec.  Right now it's not possible to ask 
for local completion of specific RMA operations.  Part of the current 
RMA proposal is an extension that would allow you to ask for 
per-operation completion.  However, this is strictly in the context of 
passive mode RMA which is asynchronous.


Active mode RMA implies an explicit synchronization across all processes 
involved in the communication before anything can complete.  This allows 
for very efficient implementation and use.  If you don't synchronize 
across all processes in the active communication group then active mode 
RMA is probably not the right construct for your algorithm; 
point-to-point send/recv synchronization/completion is still the right 
choice.


Jeff, I respectfully disagree on advising users to avoid MPI-2 RMA.  We 
unfortunately don't get to pitch the existing chapter and rewrite it for 
MPI-3.  Even if we could, I don't think we would because it's actually 
not that bad.  So, all of MPI-2 RMA will pass unscathed into MPI-3; 
anything you write now will still work under MPI-3.  Our work will be 
adding new constructs and improved semantics to RMA so that it is more 
featureful, flexible, and better performing.  I do grant that the spec 
is challenging to read, but there are much better books available to 
users who are interested in their algorithms and not MPI nuts and bolts.


Part of the reason why RMA hasn't enjoyed as much success as MPI 
two-sided is because users have been shy about using it, so implementers 
haven't prioritized it.  As a result, implementations aren't that great, 
so users avoid it, and the cycle continues.  So, please do use MPI RMA. 
 I'm all ears if you have any feedback.


All the best,
 ~Jim.

On 02/24/2011 07:36 AM, Jeff Squyres wrote:

I personally find the entire MPI one-sided chapter to be incredibly confusing 
and subject to arbitrary interpretation.  I have consistently advised people to 
not use it since the late '90s.

That being said, the MPI one-sided chapter is being overhauled in the MPI-3 
forum; the standardization process for that chapter is getting pretty close to 
consensus.  The new chapter is promised to be much better.

My $0.02 is that you might be better severed staying away from the MPI-2 
one-sided stuff because of exactly the surprises and limitations that you've 
run in to, and wait for MPI-3 implementations for real one-sided support.


On Feb 24, 2011, at 8:21 AM, Toon Knapen wrote:


But that is what surprises me. Indeed the scenario I described can be 
implemented using two-sided communication, but it seems not to be possible when 
using one sided communication.

Additionally the MPI 2.2. standard describes on page 356 the matching rules for post and 
start, complete and wait and there it says : "MPI_WIN_COMPLETE(win) initiate a 
nonblocking send with tag tag1 to each process in the group of the preceding start call. 
No need to wait for the completion of these sends."
The wording 'nonblocking send' startles me somehow !?

toon


On Thu, Feb 24, 2011 at 2:05 PM, James Dinan  wrote:
Hi Toon,

Can you use non-blocking send/recv?  It sounds like this will give you the 
completion semantics you want.

Best,
  ~Jim.


On 2/24/11 6:07 AM, Toon Knapen wrote:
In that case, I have a small question concerning design:
Suppose task-based parallellism where one node (master) distributes
work/tasks to 2 other nodes (slaves) by means of an MPI_Put. The master
allocates 2 buffers locally in which it will store all necessary data
that is needed by the slave to perform the task. So I do an MPI_Put on
each of my 2 buffers to send each buffer to a specific slave. Now I need
to know when I can reuse one of my buffers to already store the next
task (that I will MPI_Put later on). The only way to know this is call
MPI_Complete. But since this is blocking and if this buffer is not ready
to be reused yet, I can neither verify if the other buffer is already
available to me again (in the same thread).
I would very much appreciate input on how to solve such issue !
thanks in advance,
toon
On Tue, Feb 22, 2011 at 7:21 PM, Barrett, Brian Wmailto:bwba...@sandia.gov>>  wrote:

On Feb 18, 2011, at 8:59 AM, Toon Knapen wrote:

 >  (Probably this issue has been discussed at length before but
unfortunately I did not find any threads (on this site or anywhere
else) on this topic, if you are able to provide me with links to
earlier discussions on this topic, please do not hesitate)
 >
 >  Is there an alternative to MPI_Win_complete that does not
'enforce completion of preceding RMS calls at the origin' (as said
on pag 353 of the mpi-2.2 standard) ?
 >
 >  I would like to know if I can reuse the buffer I gave to MPI_Put
but without blocking on it, if the MPI lib is still using it, I want
to be able to continue (and use anoth

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Jeff Squyres
On Feb 24, 2011, at 2:59 PM, Henderson, Brent wrote:

> [snip]
> They really can't be all SLURM_PROCID=0 - that is supposed to be unique for 
> the job - right?  It appears that the SLURM_PROCID is inherited from the 
> orted parent - which makes a fair amount of sense given how things are 
> launched.  

That's correct, and I can agree with your sentiment.  

However, our design goals were to provide a consistent *Open MPI* experience 
across different launchers. Providing native access to the actual underlying 
launcher was a secondary goal.  Balancing those two, you can see why we chose 
the model we did: our orted provides  (nearly) the same functionality across 
all environments.  

In SLURM's case, we propagate a [seemingly] non-sensical SLURM_PROCID values to 
the individual processes, but only if you are making an assumption about how 
Open MPI is using SLURM's launcher.

More specifically, our goal is to provide consistent *Open MPI information* 
(e.g., through the OMPI_COMM_WORLD* env variables) -- not emulate what SLURM 
would have done if MPI processes had been launched individually through srun.  
Even more specifically: we don't think that the exact underlying launching 
mechanism that OMPI uses is of interest to most users; we encourage them to use 
our portable mechanisms that work even if they move to another cluster with a 
different launcher.  Admittedly, that does make it a little more challenging if 
you have to support multiple MPI implementations, and although that's an 
important consideration to us, it's not our first priority.

> Now to answer the other question - why are there some variables missing.  It 
> appears that when the orted processes are launched - via srun but only one 
> per node, it is a subset of the main allocation and thus some of the 
> environment variables are not the same (or missing entirely) as compared to 
> launching them directly with srun on the full allocation.  This also makes 
> sense to me at some level, so I'm at peace with it now.  :)

Ah, good.

> Last thing before I go.  Please let me apologize for not being clear on what 
> I disagreed with Ralph about in my last note.  Clearly he nailed the orted 
> launching process and spelled it out very clearly, but I don't believe that 
> HP-MPI is not doing anything special to copy/fix up the SLURM environment 
> variables.  Hopefully that was clear by the body of that message.  

No worries; you were perfectly clear.  Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-24 Thread George Bosilca
The issues have been identified deep into the tuned collective component. It 
has been fixed in the trunk and 1.5 a while back, but never pushed in the 1.4. 
I attached a patch to the ticket, and force its way into the next 1.4 release.

  Thanks,
george.

On Feb 14, 2011, at 13:11 , Jeff Squyres wrote:

> Thanks Jeremiah; I filed the following ticket about this:
> 
>https://svn.open-mpi.org/trac/ompi/ticket/2723
> 
> 
> On Feb 10, 2011, at 3:24 PM, Jeremiah Willcock wrote:
> 
>> I forgot to mention that this was tested with 3 or 4 ranks, connected via 
>> TCP.
>> 
>> -- Jeremiah Willcock
>> 
>> On Thu, 10 Feb 2011, Jeremiah Willcock wrote:
>> 
>>> Here is a small test case that hits the bug on 1.4.1:
>>> 
>>> #include 
>>> 
>>> int arr[1142];
>>> 
>>> int main(int argc, char** argv) {
>>> int rank, my_size;
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>> my_size = (rank == 1) ? 1142 : 1088;
>>> MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD);
>>> MPI_Finalize();
>>> return 0;
>>> }
>>> 
>>> I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have 
>>> already been fixed.
>>> 
>>> -- Jeremiah Willcock
>>> 
>>> 
>>> On Thu, 10 Feb 2011, Jeremiah Willcock wrote:
>>> 
 FYI, I am having trouble finding a small test case that will trigger this 
 on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have 
 been fixed.  What are the triggering rules for different broadcast 
 algorithms?  It could be that only certain sizes or only certain BTLs 
 trigger it.
 -- Jeremiah Willcock
 On Thu, 10 Feb 2011, Jeff Squyres wrote:
> Nifty!  Yes, I agree that that's a poor error message.  It's probably 
> (unfortunately) being propagated up from the underlying point-to-point 
> system, where an ERR_IN_STATUS would actually make sense.
> I'll file a ticket about this.  Thanks for the heads up.
> On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote:
>> On Wed, 9 Feb 2011, Jeremiah Willcock wrote:
>>> I get the following Open MPI error from 1.4.1:
>>> *** An error occurred in MPI_Bcast
>>> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>>> *** MPI_ERR_IN_STATUS: error code in status
>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>> (hostname and port removed from each line).  There is no MPI_Status 
>>> returned by MPI_Bcast, so I don't know what the error is?  Is this 
>>> something that people have seen before?
>> For the record, this appears to be caused by specifying inconsistent 
>> data sizes on the different ranks in the broadcast operation.  The error 
>> message could still be improved, though.
>> -- Jeremiah Willcock
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

"I disapprove of what you say, but I will defend to the death your right to say 
it"
  -- Evelyn Beatrice Hall




Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Ralph Castain
I guess I wasn't clear earlier - I don't know anything about how HP-MPI
works. I was only theorizing that perhaps they did something different that
results in some other slurm vars showing up in Brent's tests. From Brent's
comments, I guess they don't - but they launch jobs in a different manner
that results in some difference in the slurm envars seen by application
procs.

I don't believe we have a bug in OMPI. What we have is behavior that
reflects how the proc is launched. If an app has integrated itself tightly
with slurm, then OMPI may not be a good choice - or they can try the
"slurm-direct" launch method in 1.5 and see if that meets their needs.

There may be something going on with slurm 2.2.x - as I've said before,
slurm makes major changes in even minor releases, and trying to track them
is a nearly impossible task, especially as many of these features are
configuration dependent. What we have in OMPI is the level of slurm
integration required by the three DOE weapons labs as (a) they represent the
largest component of the very small slurm community, and (b) in the past,
they provided the majority of the slurm integration effort within ompi. It
works as they need it to, given the way they configure slurm (which may not
be how others do).

I'm always willing to help other slurm users, but within the guidelines
expressed in an earlier thread - the result must be driven by the DOE
weapons lab's requirements, and cannot interfere with their usage models.

As for slurm_procid - if an application is looking for it, it sounds like
that OMPI may not be a good choice for them. Under OMPI, slurm does not see
the application procs and has no idea they exist. Slurm's knowledge of an
OMPI job is limited solely to the daemons. This has tradeoffs, as most
design decisions do - in the case of the DOE labs, the tradeoffs were judged
favorable...at least, as far as LANL was concerned, and they were my boss
when I wrote the code :-) At LLNL's request, I did create the ability to run
jobs directly under srun - but as Jeff noted, with reduced capability.

Hope that helps clarify what is in the code, and why. I'm not sure what
motivated the original question, but hopefully ompi's slurm support is a
little bit clearer?

Ralph




On Thu, Feb 24, 2011 at 2:08 PM, Jeff Squyres  wrote:

> On Feb 24, 2011, at 2:59 PM, Henderson, Brent wrote:
>
> > [snip]
> > They really can't be all SLURM_PROCID=0 - that is supposed to be unique
> for the job - right?  It appears that the SLURM_PROCID is inherited from the
> orted parent - which makes a fair amount of sense given how things are
> launched.
>
> That's correct, and I can agree with your sentiment.
>
> However, our design goals were to provide a consistent *Open MPI*
> experience across different launchers. Providing native access to the actual
> underlying launcher was a secondary goal.  Balancing those two, you can see
> why we chose the model we did: our orted provides  (nearly) the same
> functionality across all environments.
>
> In SLURM's case, we propagate a [seemingly] non-sensical SLURM_PROCID
> values to the individual processes, but only if you are making an assumption
> about how Open MPI is using SLURM's launcher.
>
> More specifically, our goal is to provide consistent *Open MPI information*
> (e.g., through the OMPI_COMM_WORLD* env variables) -- not emulate what SLURM
> would have done if MPI processes had been launched individually through
> srun.  Even more specifically: we don't think that the exact underlying
> launching mechanism that OMPI uses is of interest to most users; we
> encourage them to use our portable mechanisms that work even if they move to
> another cluster with a different launcher.  Admittedly, that does make it a
> little more challenging if you have to support multiple MPI implementations,
> and although that's an important consideration to us, it's not our first
> priority.
>
> > Now to answer the other question - why are there some variables missing.
>  It appears that when the orted processes are launched - via srun but only
> one per node, it is a subset of the main allocation and thus some of the
> environment variables are not the same (or missing entirely) as compared to
> launching them directly with srun on the full allocation.  This also makes
> sense to me at some level, so I'm at peace with it now.  :)
>
> Ah, good.
>
> > Last thing before I go.  Please let me apologize for not being clear on
> what I disagreed with Ralph about in my last note.  Clearly he nailed the
> orted launching process and spelled it out very clearly, but I don't believe
> that HP-MPI is not doing anything special to copy/fix up the SLURM
> environment variables.  Hopefully that was clear by the body of that
> message.
>
> No worries; you were perfectly clear.  Thanks!
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>