from:"Reuti"

Re: [OMPI users] Fwd: problem for multiple clusters using mpirun

2014-03-25 Thread Reuti

Hi,

Am 25.03.2014 um 08:34 schrieb Hamid Saeed:

> Is it possible to change the port number for the MPI communication?
> 
> I can see that my program uses port 4 for the MPI communication.
> 
> [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on 
> port 4
> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 134.106.3.252 failed: Connection refused (111)
> 
> In my case the ports from 1 to 1024 are reserved. 
> MPI tries to use one of the reserve ports and prompts the connection refused 
> error.
> 
> I will be very glade for the kind suggestions.

There are certain parameters to set the range of used ports, but using any up 
to 1024 should not be the default:

http://www.open-mpi.org/community/lists/users/2011/11/17732.php

Are any of these set by accident beforehand by your environment?

-- Reuti


> Regards.
> 
> 
> 
> 
> 
> On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed  wrote:
> Hello Jeff,
> 
> Thanks for your cooperation.
> 
> --mca btl_tcp_if_include br0 
> 
> worked out of the box.
> 
> The problem was from the network administrator. The machines on the network 
> side were halting the mpi...
> 
> so cleaning and killing every thing worked.
> 
> :)
> 
> regards. 
> 
> 
> On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres)  
> wrote:
> There is no "self" IP interface in the Linux kernel.
> 
> Try using btl_tcp_if_include and list just the interface(s) that you want to 
> use.  From your prior email, I'm *guessing* it's just br2 (i.e., the 10.x 
> address inside your cluster).
> 
> Also, it looks like you didn't setup your SSH keys properly for logging in to 
> remote notes automatically.
> 
> 
> 
> On Mar 24, 2014, at 10:56 AM, Hamid Saeed  wrote:
> 
> > Hello,
> >
> > I added the "self" e.g
> >
> > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca 
> > btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv
> >
> > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
> > --
> >
> > ERROR::
> >
> > At least one pair of MPI processes are unable to reach each other for
> > MPI communications.  This means that no Open MPI device has indicated
> > that it can be used to communicate between these processes.  This is
> > an error; Open MPI requires that all MPI processes be able to reach
> > each other.  This error can sometimes be the result of forgetting to
> > specify the "self" BTL.
> >
> >   Process 1 ([[15751,1],7]) is on host: wirth
> >   Process 2 ([[15751,1],0]) is on host: karp
> >   BTLs attempted: self sm
> >
> > Your MPI job is now going to abort; sorry.
> > --
> > --
> > MPI_INIT has failed because at least one MPI process is unreachable
> > from another.  This *usually* means that an underlying communication
> > plugin -- such as a BTL or an MTL -- has either not loaded or not
> > allowed itself to be used.  Your MPI job will now abort.
> >
> > You may wish to try to narrow down the problem;
> >
> >  * Check the output of ompi_info to see which BTL/MTL plugins are
> >available.
> >  * Run your application with MPI_THREAD_SINGLE.
> >  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
> >if using MTL-based communications) to see exactly which
> >communication plugins were considered and/or discarded.
> > --
> > [wirth:40329] *** An error occurred in MPI_Init
> > [wirth:40329] *** on a NULL communicator
> > [wirth:40329] *** Unknown error
> > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> > --
> > An MPI process is aborting at a time when it cannot guarantee that all
> > of its peer processes in the job will be killed properly.  You should
> > double check that everything has shut down cleanly.
> >
> >   Reason: Before MPI_INIT completed
> >   Local host: wirth
> >   PID:40329
> > --
> > --
> > mpirun has exited due to process rank 7 with PID 40329 on
> > node wirth exitin

Re: [OMPI users] busy waiting and oversubscriptions

2014-03-27 Thread Reuti

Am 27.03.2014 um 16:31 schrieb Gus Correa:

> On 03/27/2014 05:05 AM, Andreas Schäfer wrote:
>>> >Queue systems won't allow resources to be oversubscribed.
>> I'm fairly confident that you can configure Slurm to oversubscribe
>> nodes: just specify more cores for a node than are actually present.
>> 
> 
> That is true.
> If you lie to the queue system about your resources,
> it will believe you and oversubscribe.
> Torque has this same feature.
> I don't know about SGE.

It's possible too.

-- Reuti


> You may choose to set some or all nodes with more cores than they actually 
> have, if that is a good choice for the codes you run.
> However, for our applications oversubscribing is bad, hence my mindset.
> 
> Gus Correa
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Reuti

Hi,

Am 27.03.2014 um 20:15 schrieb Gus Correa:

> 
>> Awesome, but now here is my concern.
> If we have OpenMPI-based applications launched as batch jobs
> via a batch scheduler like SLURM, PBS, LSF, etc.
> (which decides the placement of the app and dispatches it to the compute 
> hosts),
> then will including "--report-bindings --bind-to-core" cause problems?

Do all of them have an internal bookkeeping of granted cores to slots - i.e. 
not only the number of scheduled slots per job per node, but also which core 
was granted to which job? Does Open MPI read this information would be the next 
question then.


> I don't know all resource managers and schedulers.
> 
> I use Torque+Maui here.
> OpenMPI is built with Torque support, and will use the nodes and cpus/cores 
> provided by Torque.

Same question here.


> My understanding is that Torque delegates to OpenMPI the process placement 
> and binding (beyond the list of nodes/cpus available for
> the job).
> 
> My guess is that OpenPBS behaves the same as Torque.
> 
> SLURM and SGE/OGE *probably* have pretty much the same behavior.

SGE/OGE: no, any binding request is only a soft request.
UGE: here you can request a hard binding. But I have no clue whether this 
information is read by Open MPI too.

If in doubt: use only complete nodes for each job (which is often done for 
massively parallel jobs anyway).

-- Reuti


> A cursory reading of the SLURM web page suggested to me that it
> does core binding by default, but don't quote me on that.
> 
> I don't know what LSF does, but I would guess there is a
> way to do the appropriate bindings, either at the resource manager level, or 
> at the OpenMPI level (or a combination of both).
> 
> 
> Certainly I can test this, but concerned there may be a case where inclusion 
> of
> --bind-to-core would cause an unexpected problem I did not account for.
>> 
>> --john
>> 
> 
> Well, testing and failing is part of this game!
> Would the GE manager buy that? :)
> 
> I hope this helps,
> Gus Correa
> 
>> 
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
>> Sent: Thursday, March 27, 2014 2:06 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)
>> 
>> Hi John
>> 
>> Take a look at the mpiexec/mpirun options:
>> 
>> -report-bindings (this one should report what you want)
>> 
>> and maybe also also:
>> 
>> -bycore, -bysocket, -bind-to-core, -bind-to-socket, ...
>> 
>> and similar, if you want more control on where your MPI processes run.
>> 
>> "man mpiexec" is your friend!
>> 
>> I hope this helps,
>> Gus Correa
>> 
>> On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:
>>> When a piece of software built against OpenMPI fails, I will see an
>>> error referring to the rank of the MPI task which incurred the failure.
>>> For example:
>>> 
>>> MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD
>>> 
>>> with errorcode 1.
>>> 
>>> Unfortunately, I do not have access to the software code, just the
>>> installation directory tree for OpenMPI.  My question is:  Is there a
>>> flag that can be passed to mpirun, or an environment variable set,
>>> which would reveal the mapping of ranks to the hosts they are on?
>>> 
>>> I do understand that one could have multiple MPI ranks running on the
>>> same host, but finding a way to determine which rank ran on what host
>>> would go a long way in help troubleshooting problems which may be
>>> central to the host.  Thanks!
>>> 
>>>--john
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)

2014-03-27 Thread Reuti

Am 27.03.2014 um 23:59 schrieb Dave Love:

> Reuti  writes:
> 
>> Do all of them have an internal bookkeeping of granted cores to slots
>> - i.e. not only the number of scheduled slots per job per node, but
>> also which core was granted to which job? Does Open MPI read this
>> information would be the next question then.
> 
> OMPI works with the bindings it's handed via orted (if the processes are
> started that way).
> 
>>> My understanding is that Torque delegates to OpenMPI the process placement 
>>> and binding (beyond the list of nodes/cpus available for
>>> the job).
> 
> Can't/doesn't torque start the MPI processes itself?  Otherwise, yes,
> since orted gets the binding.
> 
>>> My guess is that OpenPBS behaves the same as Torque.
>>> 
>>> SLURM and SGE/OGE *probably* have pretty much the same behavior.
>> 
>> SGE/OGE: no, any binding request is only a soft request.
> 
> I don't understand that.  Does it mean the system-specific "strict" and
> "non-strict" binding in hwloc, in which case I don't see how UGE can do
> anything different?
> 
>> UGE: here you can request a hard binding. But I have no clue whether this 
>> information is read by Open MPI too.
>> 
>> If in doubt: use only complete nodes for each job (which is often done
>> for massively parallel jobs anyway).
> 
> There's no need with a recent SGE.  All our jobs get core bindings --
> unless they use all the cores, since binding them all is equivalent to
> binding none -- and OMPI inherits them.  See
> <http://arc.liv.ac.uk/SGE/howto/sge-configs.html#_core_binding> for the
> SGE+OMPI configuration.

To avoid any misunderstanding I first discuss this last paragraph. I read 
http://www.slideshare.net/jsquyres/open-mpi-explorations-in-process-affinity-eurompi13-presentation
 which was posted on this list yesterday. And so I would phrase it: mapping to 
all is like mapping to none. And as they are only mapped, the kernel scheduler 
is free to move them around inside this set of (granted) cores.

But maybe I got it wrong.

-- Reuti



> -- 
> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi query

2014-04-04 Thread Reuti

Am 04.04.2014 um 05:55 schrieb Ralph Castain:

> On Apr 3, 2014, at 8:03 PM, Nisha Dhankher -M.Tech(CSE) 
>  wrote:
> 
>> thankyou Ralph.
>> Yes cluster is heterogenous...
> 
> And did you configure OMPI --enable-heterogeneous? And are you running it 
> with ---hetero-nodes? What version of OMPI are you using anyway?
> 
> Note that we don't care if the host pc's are hetero - what we care about is 
> the VM. If all the VMs are the same, then it shouldn't matter. However, most 
> VM technologies don't handle hetero hardware very well - i.e., you can't 
> emulate an x86 architecture on top of a Sparc or Power chip or vice versa.

Well - you have to emulate the CPU. There were products running a virtual x86 
PC on a Mac with PowerPC chip. And IBM has a product called PowerVM Lx86 to run 
software compiled for Linux x86 directly on a PowerLinux machine.

-- Reuti


>> And i haven't made compute nodes on direct physical nodes (pc's) becoz in 
>> college it is not possible to take whole lab of 32 pc's for your work  so i 
>> ran on vm.
> 
> Yes, but at least it would let you test the setup to run MPI across even a 
> couple of pc's - this is simple debugging practice.
> 
>> In Rocks cluster, frontend give the same kickstart to all the pc's so 
>> openmpi version should be same i guess.
> 
> Guess? or know? Makes a difference - might be worth testing.
> 
>> Sir 
>> mpiformatdb is a command to distribute database fragments to different 
>> compute nodes after partitioning od database.
>> And sir have you done mpiblast ?
> 
> Nope - but that isn't the issue, is it? The issue is with the MPI setup.
> 
>> 
>> 
>> On Fri, Apr 4, 2014 at 4:48 AM, Ralph Castain  wrote:
>> What is "mpiformatdb"? We don't have an MPI database in our system, and I 
>> have no idea what that command means
>> 
>> As for that error - it means that the identifier we exchange between 
>> processes is failing to be recognized. This could mean a couple of things:
>> 
>> 1. the OMPI version on the two ends is different - could be you aren't 
>> getting the right paths set on the various machines
>> 
>> 2. the cluster is heterogeneous
>> 
>> You say you have "virtual nodes" running on various PC's? That would be an 
>> unusual setup - VM's can be problematic given the way they handle TCP 
>> connections, so that might be another source of the problem if my 
>> understanding of your setup is correct. Have you tried running this across 
>> the PCs directly - i.e., without any VMs?
>> 
>> 
>> On Apr 3, 2014, at 10:13 AM, Nisha Dhankher -M.Tech(CSE) 
>>  wrote:
>> 
>>> i first formatted my database with mpiformatdb command then i ran command :
>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i query.fas -o 
>>> output.txt
>>> but then it gave this error 113 from some hosts and continue to run for 
>>> other but with no  results even after 2 hours lapsed.on rocks 6.0 
>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb 
>>> ram to each
>>> 
>>> 
>>> On Thu, Apr 3, 2014 at 10:41 PM, Nisha Dhankher -M.Tech(CSE) 
>>>  wrote:
>>> i also made machine file which contain ip adresses of all compute nodes + 
>>> .ncbirc file for path to mpiblast and shared ,local storage path
>>> Sir
>>> I ran the same command of mpirun on my college supercomputer 8 nodes each 
>>> having 24 processors but it just runninggave no result uptill 3 hours...
>>> 
>>> 
>>> On Thu, Apr 3, 2014 at 10:39 PM, Nisha Dhankher -M.Tech(CSE) 
>>>  wrote:
>>> i first formatted my database with mpiformatdb command then i ran command :
>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i query.fas -o 
>>> output.txt
>>> but then it gave this error 113 from some hosts and continue to run for 
>>> other but with results even after 2 hours lapsed.on rocks 6.0 cluster 
>>> with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb ram to 
>>> each
>>>  
>>> 
>>> 
>>> On Thu, Apr 3, 2014 at 8:37 PM, Ralph Castain  wrote:
>>> I'm having trouble understanding your note, so perhaps I am getting this 
>>> wrong. Let's see if I can figure out what you said:
>>> 
>>> * your perl command fails with "no route to host" - but I don't see any 
>>> host in your cmd. Maybe I'm just missing something.
>>

Re: [OMPI users] Problem with shell when launching jobs with OpenMPI 1.6.5 rsh

2014-04-07 Thread Reuti

Am 07.04.2014 um 22:04 schrieb Blosch, Edwin L:

> I am submitting a job for execution under SGE.  My default shell is /bin/csh.

Where - in SGE or on the interactive command line you get?


>  The script that is submitted has #!/bin/bash at the top.  The script runs on 
> the 1st node allocated to the job.  The script runs a Python wrapper that 
> ultimately issues the following mpirun command:
>  
> /apps/local/test/openmpi/bin/mpirun --machinefile mpihosts.914 -np 48 -x 
> LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 --mca btl ^tcp --mca 
> shmem_mmap_relocate_backing_file -1 --bind-to-core --bycore --mca 
> orte_rsh_agent /usr/bin/rsh --mca plm_rsh_disable_qrsh 1 
> /apps/local/test/solver/bin/solver_openmpi -cycles 50 -ri restart.0 -i 
> flow.inp >& output
>  
> Just so there’s no confusion, OpenMPI is built without support for SGE.  It 
> should be using rsh to launch.
>  
> There are 4 nodes involved (each 12 cores, 48 processes total).  In the 
> output file, I see 3 sets of messages as shown below.  I assume I am seeing 1 
> set of messages for each of the 3 remote nodes where processes need to be 
> launched:
>  
> /bin/.: Permission denied.
> OPAL_PREFIX=/apps/local/falcon2014/openmpi: Command not found.
> export: Command not found.
> PATH=/apps/local/test/openmpi/bin:/bin:/usr/bin:/usr/ccs/bin:/usr/local/bin:/usr/openwin/bin:/usr/local/etc:/home/bloscel/bin:/usr/ucb:/usr/bsd:
>  Command not found.
> export: Command not found.
> LD_LIBRARY_PATH: Undefined variable.

This looks really like csh is trying to interpret bash commands. In case SGE's 
queue is set up to have "shell_start_mode posix_compliant" set, the first line 
of the script is not treated in a special way. You can change the shell only by 
"-S /bin/bash" then (or redefine the queue to have "shell_start_mode 
unix_behavior" set and get the expected behavior when starting a script [side 
effect: the shell is not started as login shell any longer. See also `man 
sge_conf` => "login_shells" for details]).

BTW: you don't want a tight integration by intention?

-- Reuti


>  These look like errors you get when csh is trying to parse commands intended 
> for bash. 
>  
> Does anyone know what may be going on here?
>  
> Thanks,
>  
> Ed
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] EXTERNAL: Re: Problem with shell when launching jobs with OpenMPI 1.6.5 rsh

2014-04-07 Thread Reuti

Am 07.04.2014 um 22:36 schrieb Blosch, Edwin L:

> I guess this is not OpenMPI related anymore.  I can repeat the essential 
> problem interactively:
> 
> % echo $SHELL
> /bin/csh
> 
> % echo $SHLVL
> 1
> 
> % cat hello
> echo Hello
> 
> % /bin/bash hello
> Hello
> 
> % /bin/csh hello
> Hello
> 
> %  . hello
> /bin/.: Permission denied

. as a bash shortcut for `source` will also be interpreted by `csh` an generate 
this error. You can try to change your interactive shell by: `chsh`.

-- Reuti


> I think I need to hope the administrator can fix it.  Sorry for the bother...
> 
> 
> -Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Reuti
> Sent: Monday, April 07, 2014 3:27 PM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Problem with shell when launching jobs 
> with OpenMPI 1.6.5 rsh
> 
> Am 07.04.2014 um 22:04 schrieb Blosch, Edwin L:
> 
>> I am submitting a job for execution under SGE.  My default shell is /bin/csh.
> 
> Where - in SGE or on the interactive command line you get?
> 
> 
>> The script that is submitted has #!/bin/bash at the top.  The script runs on 
>> the 1st node allocated to the job.  The script runs a Python wrapper that 
>> ultimately issues the following mpirun command:
>> 
>> /apps/local/test/openmpi/bin/mpirun --machinefile mpihosts.914 -np 48 -x 
>> LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 --mca btl ^tcp --mca 
>> shmem_mmap_relocate_backing_file -1 --bind-to-core --bycore --mca 
>> orte_rsh_agent /usr/bin/rsh --mca plm_rsh_disable_qrsh 1 
>> /apps/local/test/solver/bin/solver_openmpi -cycles 50 -ri restart.0 -i 
>> flow.inp >& output
>> 
>> Just so there's no confusion, OpenMPI is built without support for SGE.  It 
>> should be using rsh to launch.
>> 
>> There are 4 nodes involved (each 12 cores, 48 processes total).  In the 
>> output file, I see 3 sets of messages as shown below.  I assume I am seeing 
>> 1 set of messages for each of the 3 remote nodes where processes need to be 
>> launched:
>> 
>> /bin/.: Permission denied.
>> OPAL_PREFIX=/apps/local/falcon2014/openmpi: Command not found.
>> export: Command not found.
>> PATH=/apps/local/test/openmpi/bin:/bin:/usr/bin:/usr/ccs/bin:/usr/local/bin:/usr/openwin/bin:/usr/local/etc:/home/bloscel/bin:/usr/ucb:/usr/bsd:
>>  Command not found.
>> export: Command not found.
>> LD_LIBRARY_PATH: Undefined variable.
> 
> This looks really like csh is trying to interpret bash commands. In case 
> SGE's queue is set up to have "shell_start_mode posix_compliant" set, the 
> first line of the script is not treated in a special way. You can change the 
> shell only by "-S /bin/bash" then (or redefine the queue to have 
> "shell_start_mode unix_behavior" set and get the expected behavior when 
> starting a script [side effect: the shell is not started as login shell any 
> longer. See also `man sge_conf` => "login_shells" for details]).
> 
> BTW: you don't want a tight integration by intention?
> 
> -- Reuti
> 
> 
>> These look like errors you get when csh is trying to parse commands intended 
>> for bash. 
>> 
>> Does anyone know what may be going on here?
>> 
>> Thanks,
>> 
>> Ed
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Different output ubuntu /mac

2014-04-14 Thread Reuti

Am 13.04.2014 um 09:58 schrieb Kamal:

> I have a code which uses both mpicc and mpif90.
> 
> The code is to read a file from the directory, It works properly on my 
> desktop (ubuntu) but when I run the same code on my Macbook I get fopen 
> failure errno : 2 ( file does not exist )

Without more information it looks like an error in the application and not Open 
MPI. Worth to note is that Mac users are in /Users and not /home.

-- Reuti


> Could some one please tell me what might be the problem ?
> 
> 
> Thanks,
> Bow
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Reuti

Am 06.06.2014 um 18:58 schrieb Sasso, John (GE Power & Water, Non-GE):

> OK, so at the least, how can I get the node and slots/node info that is 
> passed from PBS?
>  
> I ask because I’m trying to troubleshoot a problem w/ PBS and the build of 
> OpenMPI 1.6 I noted.  If I submit a 24-process simple job through PBS using a 
> script which has:
>  
> /usr/local/openmpi/bin/orterun -n 24 --hostfile /home/sasso/TEST/hosts.file 
> --mca orte_rsh_agent rsh --mca btl openib,tcp,self --mca 
> orte_base_help_aggregate 0 -x PATH -x LD_LIBRARY_PATH 
> /home/sasso/TEST/simplempihello.exe

Using the --hostfile on your own would mean to violate the granted slot 
allocation by PBS. Just leave this option out. How do you submit your job?

-- Reuti


> And the hostfile /home/sasso/TEST/hosts.file contains 24 entries (the first 
> 16 being host node0001 and the last 8 being node0002), it appears that 24 MPI 
> tasks try to start on node0001 instead of getting distributed as 16 on 
> node0001 and 8 on node0002.   Hence, I am curious what is being passed by PBS.
>  
> --john
>  
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Friday, June 06, 2014 12:31 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Determining what parameters a scheduler passes to 
> OpenMPI
>  
> We currently only get the node and slots/node info from PBS - we don't get 
> any task placement info at all. We then use the mpirun cmd options and 
> built-in mappers to map the tasks to the nodes.
>  
> I suppose we could do more integration in that regard, but haven't really 
> seen a reason to do so - the OMPI mappers are generally more flexible than 
> anything in the schedulers.
>  
>  
> On Jun 6, 2014, at 9:08 AM, Sasso, John (GE Power & Water, Non-GE) 
>  wrote:
> 
> 
> For the PBS scheduler and using a build of OpenMPI 1.6 built against PBS 
> include files + libs, is there a way to determine (perhaps via some debugging 
> flags passed to mpirun) what job placement parameters are passed from the PBS 
> scheduler to OpenMPI?  In particular, I am talking about task placement info 
> such as nodes to place on, etc.   Thanks!
>  
>   --john
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Reuti

Am 06.06.2014 um 21:04 schrieb Ralph Castain:

> Supposed to, yes - but I don't know how much testing it has seen. I can try 
> to take a look

Wasn't it on the list recently, that 1.8.1 should do it even without 
passphraseless SSH between the nodes?

-- Reuti


> On Jun 6, 2014, at 12:02 PM, E.O.  wrote:
> 
>> Hello
>> I am using OpenMPI ver 1.8.1 on a cluster of 4 machines.
>> One Redhat 6.2 and three busybox machine. They are all 64bit environment.
>> 
>> I want to use --preload-binary option to send the binary file to hosts but 
>> it's not working.
>> 
>> # /mpi/bin/mpirun --prefix /mpi --preload-files ./a.out --allow-run-as-root 
>> --np 4 --host box0101,box0103 --preload-binary ./a.out
>> --
>> mpirun was unable to launch the specified application as it could not access
>> or execute an executable:
>> 
>> Executable: ./a.out
>> Node: box0101
>> 
>> while attempting to start process rank 17.
>> --
>> 17 total processes failed to start
>> #
>> 
>> If I sent the binary by SCP beforehand, the command works fine. SCP is 
>> working fine without password between the hosts.
>> Is the option supposed to work?
>> Thank you,
>> 
>> Eiichi
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] connect() fails - inhomogeneous cluster

2014-06-17 Thread Reuti

Hi,

Am 17.06.2014 um 13:00 schrieb Borno Knuttelski:

> this is the first time I contact this list. I'm using OpenMPI 1.6.5 on an 
> inhomogeneous cluster with 2 machines. Short: With few processes everything 
> works fine but with some more my application crashes. (Yes, I can guarantee 
> that in every scenario I start processes on both machines).  I posted the 
> problem already with all details on stackoverflow 
> (http://stackoverflow.com/questions/24164825/mpi-connect-fails-inhomogeneous-cluster).
>  Please have a look at it. What exactly is the problem and how can I fix it?

How do you start the program - just with `mpiexec` and a proper hostfile and 
number of slots?

-- Reuti


> Every help and guess is appreciated and will be tested...
> Thanks in advance,
>  
> Kurt
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24662.php

Re: [OMPI users] connect() fails - inhomogeneous cluster

2014-06-17 Thread Reuti

Am 17.06.2014 um 14:53 schrieb borno_bo...@gmx.de:

> I should have written that...
>  
> mpirun -np n --hostfile host.cfg
>  
> mpi@Ries   slots=n_1 max_slots=n_1
> mpi@Euler  slots=n_2 max_slots=n_2

Although it's defined to use characters in a case insensitive manner in 
hostnames, my experience is that not all calls are mapping it in a proper way. 
To avoid any confusion because of this, it's best to have them all in 
lowercase. I don't know whether this is related to your observation.

-- Reuti


> It is arranged that the sum over the n_i is equal to n.
>  
> Kurt
> Gesendet: Dienstag, 17. Juni 2014 um 14:25 Uhr
> Von: Reuti 
> An: "Open MPI Users" 
> Betreff: Re: [OMPI users] connect() fails - inhomogeneous cluster
> Hi,
> 
> Am 17.06.2014 um 13:00 schrieb Borno Knuttelski:
> 
> > this is the first time I contact this list. I'm using OpenMPI 1.6.5 on an 
> > inhomogeneous cluster with 2 machines. Short: With few processes everything 
> > works fine but with some more my application crashes. (Yes, I can guarantee 
> > that in every scenario I start processes on both machines). I posted the 
> > problem already with all details on stackoverflow 
> > (http://stackoverflow.com/questions/24164825/mpi-connect-fails-inhomogeneous-cluster).
> >  Please have a look at it. What exactly is the problem and how can I fix it?
> 
> How do you start the program - just with `mpiexec` and a proper hostfile and 
> number of slots?
> 
> -- Reuti
> 
> 
> > Every help and guess is appreciated and will be tested...
> > Thanks in advance,
> >
> > Kurt
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2014/06/24662.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24663.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24664.php

Re: [OMPI users] Problem moving from 1.4 to 1.6

2014-06-27 Thread Reuti

Hi,

Am 27.06.2014 um 19:56 schrieb Jeffrey A Cummings:

> I appreciate your response and I understand the logic behind your suggestion, 
> but you and the other regular expert contributors to this list are frequently 
> working under a misapprehension.  Many of your openMPI users don't have any 
> control over what version of openMPI is available on their system.  I'm stuck 
> with whatever version my IT people choose to bless, which in general is the 
> (possibly old and/or moldy) version that is bundled with some larger package 
> (i.e., Rocks, Linux).  The fact that I'm only now seeing this 1.4 to 1.6 
> problem illustrates the situation I'm in.  I really need someone to did into 
> their memory archives to see if they can come up with a clue for me.

You can freely download the Open MPI source and install it for example in your 
personal ~/local/openmpi-1.8 or alike. Pointing your $PATH and $LD_LIBRARY_PATH 
to your own version will supersede installed system one.

-- Reuti


> Jeffrey A. Cummings
> Engineering Specialist
> Performance Modeling and Analysis Department
> Systems Analysis and Simulation Subdivision
> Systems Engineering Division
> Engineering and Technology Group
> The Aerospace Corporation
> 571-307-4220
> jeffrey.a.cummi...@aero.org 
> 
> 
> 
> From:Gus Correa  
> To:Open MPI Users , 
> Date:06/27/2014 01:45 PM 
> Subject:Re: [OMPI users] Problem moving from 1.4 to 1.6 
> Sent by:"users"  
> 
> 
> 
> It may be easier to install the latest OMPI from the tarball,
> rather than trying to sort out the error.
> 
> http://www.open-mpi.org/software/ompi/v1.8/
> 
> The packaged built of (somewhat old) OMPI 1.6.2 that came with
> Linux may not have built against the same IB libraries, hardware,
> and configuration you have.
> [The error message reference to udapl is ominous.]
> 
> > The mpirun command line contains the argument '--mca btl ^openib', which
> > I thought told mpi to not look for the ib interface.
> 
> As you said, the mca parameter above tells OMPI not to use openib,
> although it may not be the only cause of the problem.
> If you want to use openib switch to
> --mca btl openib,sm,self
> 
> Another thing to check is whether there is a mixup of enviroment 
> variables, PATH and LD_LIBRARY_PATH perhaps pointing to the old OMPI 
> version you may have installed.
> 
> My two cents,
> Gus Correa
> 
> On 06/27/2014 12:53 PM, Jeffrey A Cummings wrote:
> > We have recently upgraded our cluster to a version of Linux which comes
> > with openMPI version 1.6.2.
> >
> > An application which ran previously (using some version of 1.4) now
> > errors out with the following messages:
> >
> >  librdmacm: Fatal: no RDMA devices found
> >  librdmacm: Fatal: no RDMA devices found
> >  librdmacm: Fatal: no RDMA devices found
> >
> > --
> >  WARNING: Failed to open "OpenIB-cma" [DAT_INTERNAL_ERROR:].
> >  This may be a real error or it may be an invalid entry in the
> > uDAPL
> >  Registry which is contained in the dat.conf file. Contact your
> > local
> >  System Administrator to confirm the availability of the
> > interfaces in
> >  the dat.conf file.
> >
> > --
> >  [tupile:25363] 2 more processes have sent help message
> > help-mpi-btl-udapl.txt / dat_ia_open fail
> >  [tupile:25363] Set MCA parameter "orte_base_help_aggregate" to
> > 0 to see all help / error messages
> >
> > The mpirun command line contains the argument '--mca btl ^openib', which
> > I thought told mpi to not look for the ib interface.
> >
> > Can anyone suggest what the problem might be?  Did the relevant syntax
> > change between versions 1.4 and 1.6?
> >
> >
> > Jeffrey A. Cummings
> > Engineering Specialist
> > Performance Modeling and Analysis Department
> > Systems Analysis and Simulation Subdivision
> > Systems Engineering Division
> > Engineering and Technology Group
> > The Aerospace Corporation
> > 571-307-4220
> > jeffrey.a.cummi...@aero.org
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2014/06/24721.php
> >
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24722.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24723.php

Re: [OMPI users] Newbie query - mpirun will not run if it's previously been killed with Control-C

2014-08-07 Thread Reuti

Am 07.08.2014 um 17:28 schrieb Gus Correa:

> I guess Control-C will kill only the mpirun process.
> You may need to kill the (two) jules.exe processes separately,
> say, with kill -9.
> ps -u "yourname"
> will show what you have running.

Shouldn't Open MPI clean this up in a proper way when Control-C is pressed?

But maybe there is something left in /tmp like "openmpi-sessions-...@..." which 
needs to be removed.

-- Reuti


> On 08/07/2014 11:16 AM, Jane Lewis wrote:
>> Hi all,
>> 
>> This is a really simple problem (I hope) where I’ve introduced MPI to a
>> complex numerical model which I have to kill occasionally with Control-C
>> as I don’t want it running forever.
>> 
>> I have only used mpi_init(), mpi_comm_size(), mpi_comm_rank() and
>> mpi_finalize()– there are no send/receive calls going on at the moment –
>> and I only have two instances. My startup command is:
>> 
>> #/bin/bash
>> 
>> /opt/openmpi/bin/mpirun  -np 2 -hostfile hostfile jules.exe
>> 
>> where hostfile has one entry : localhost
>> 
>> The result of terminating the process with Control-C at the command
>> prompt from where I launched it, is that I am then unable to run it
>> again. I get the
>> 
>> “mpirun has exited due to process rank 0 with PID 10094 on node
>> metclcv10.local exiting improperly. There are two reasons this could
>> occur:…” error each time despite checking running processes for
>> stragglers, closing my terminal, or changing node.
>> 
>> I have spent several hours searching for an answer to this, if it’s
>> already somewhere then please point me in the right direction.
>> 
>> many thanks in advance
>> 
>> Jane
>> 
>> For info:
>> 
>> #ompi_info -v ompi full --parsable
>> 
>> package:Open MPI root@centos-6-3.localdomain Distribution
>> 
>> ompi:version:full:1.6.2
>> 
>> ompi:version:svn:r27344
>> 
>> ompi:version:release_date:Sep 18, 2012
>> 
>> orte:version:full:1.6.2
>> 
>> orte:version:svn:r27344
>> 
>> orte:version:release_date:Sep 18, 2012
>> 
>> opal:version:full:1.6.2
>> 
>> opal:version:svn:r27344
>> 
>> opal:version:release_date:Sep 18, 2012
>> 
>> mpi-api:version:full:2.1
>> 
>> ident:1.6.2
>> 
>> I’m using centos-6-3 and FORTRAN.
>> 
>> Jane Lewis
>> 
>> Deputy Technical Director, Reading e-Science Centre
>> 
>> Department of Meteorology
>> 
>> University of Reading, UK
>> 
>> Tel: +44 (0)118 378 5173
>> 
>> http://www.resc.reading.ac.uk <http://www.resc.reading.ac.uk/>
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/24938.php
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/24939.php
>

Re: [OMPI users] Multiple runs interaction

2014-08-12 Thread Reuti

Am 12.08.2014 um 16:57 schrieb Antonio Rago:

> Brilliant, this works!
> However I’ve to say that it seems that it seems that code becomes slightly 
> less performing.
> Is there a way to instruct mpirun on which core to use, and maybe create this 
> map automatically with grid engine?

In the open source version of SGE the requested core binding is only a soft 
request. The Univa version can handle this as a hard request though, as the 
scheduler will do the assignment and knows which cores are used. I have no 
information whether this will be forwarded to Open MPI automatically. I assume 
not, and it must be read out of the machine file (there ought to be an extra 
column for it in their version) and feed to Open MPI by some measures.

-- Reuti


> Thanks in advance
> Antonio
> 
> 
> 
> 
> On 12 Aug 2014, at 14:10, Jeff Squyres (jsquyres)  wrote:
> 
>> The quick and dirty answer is that in the v1.8 series, Open MPI started 
>> binding MPI processes to cores by default.
>> 
>> When you run 2 independent jobs on the same machine in the way in which you 
>> described, the two jobs won't have knowledge of each other, and therefore 
>> they will both starting binging MPI processes starting with logical core 0.
>> 
>> The easy workaround is to disable bind-to-core behavior.  For example, 
>> "mpirun --bind-to none ...".  In this way, the OS will (more or less) load 
>> balance your MPI jobs to available cores (assuming you don't run more MPI 
>> processes than cores).
>> 
>> 
>> On Aug 12, 2014, at 7:05 AM, Antonio Rago  
>> wrote:
>> 
>>> Dear mailing list
>>> I’m running into trouble in the configuration of the small cluster I’m 
>>> managing.
>>> I’ve installed openmpi-1.8.1 with gcc 4.7 on a Centos 6.5 with infiniband 
>>> support.
>>> Compile and installation were all ok and i can compile and actually run 
>>> parallel jobs, both directly or by submitting them with the queue manager 
>>> (gridengine).
>>> My problem is that when two different subsets of two job end on the same 
>>> node, they will not spread equally and use all the cores of the node but 
>>> instead they will run on a common subset of cores leaving some other 
>>> totally empty.
>>> For example two 4 core jobs on a 8 core node will result in only 4 core 
>>> running on the node (all of them being oversubscribed) and the other 4 
>>> cores being empty.
>>> Clearly there must be an error in the way I’ve configured stuffs but i 
>>> cannot find any hint on how to solve the problem.
>>> I’ve tried to do different map (map by core or by slot) but I’ve never 
>>> succeeded.
>>> Could you give a me suggestion on this issue?
>>> Regards
>>> Antonio
>>> 
>>> 
>>> [http://www.plymouth.ac.uk/images/email_footer.gif]<http://www.plymouth.ac.uk/worldclass>
>>> 
>>> This email and any files with it are confidential and intended solely for 
>>> the use of the recipient to whom it is addressed. If you are not the 
>>> intended recipient then copying, distribution or other use of the 
>>> information contained is strictly prohibited and you should not rely on it. 
>>> If you have received this email in error please let the sender know 
>>> immediately and delete it from your system(s). Internet emails are not 
>>> necessarily secure. While we take every care, Plymouth University accepts 
>>> no responsibility for viruses and it is your responsibility to scan emails 
>>> and their attachments. Plymouth University does not accept responsibility 
>>> for any changes made after it was sent. Nothing in this email or its 
>>> attachments constitutes an order for goods or services unless accompanied 
>>> by an official order form.
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/24986.php
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/24991.php
> 
> _

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-14 Thread Reuti

Hi,

Am 14.08.2014 um 15:50 schrieb Oscar Mojica:

> I am trying to run a hybrid mpi + openmp program in a cluster.  I created a 
> queue with 14 machines, each one with 16 cores. The program divides the work 
> among the 14 processors with MPI and within each processor a loop is also 
> divided into 8 threads for example, using openmp. The problem is that when I 
> submit the job to the queue the MPI processes don't divide the work into 
> threads and the program prints the number of threads  that are working within 
> each process as one. 
> 
> I made a simple test program that uses openmp and  I logged in one machine of 
> the fourteen. I compiled it using gfortran -fopenmp program.f -o exe,  set 
> the OMP_NUM_THREADS environment variable equal to 8  and when I ran directly 
> in the terminal the loop was effectively divided among the cores and for 
> example in this case the program printed the number of threads equal to 8
> 
> This is my Makefile
>  
> # Start of the makefile
> # Defining variables
> objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o
> #f90comp = /opt/openmpi/bin/mpif90
> f90comp = /usr/bin/mpif90
> #switch = -O3
> executable = inverse.exe
> # Makefile
> all : $(executable)
> $(executable) : $(objects)
>   $(f90comp) -fopenmp -g -O -o $(executable) $(objects)
>   rm $(objects)
> %.o: %.f
>   $(f90comp) -c $<
> # Cleaning everything
> clean:
>   rm $(executable) 
> # rm $(objects)
> # End of the makefile
> 
> and the script that i am using is 
> 
> #!/bin/bash
> #$ -cwd
> #$ -j y
> #$ -S /bin/bash
> #$ -pe orte 14

What is the output of `qconf -sp orte`?


> #$ -N job
> #$ -q new.q

Looks like you are using SGE The installed Open MPI was compiled with 
"--with-sge" to achieve a Tight Integration*, and the processes are distributed 
to all machines correctly (disregarding the thread issue here, just a plain MPI 
job)?

There is also to note, that in either case the generated $PE_HOSTFILE needs to 
be adjusted, as you have to request 14 times 8 cores in total for your 
computation to avoid that SGE will oversubscribe the machines.

-- Reuti

* This will also forward the environment variables to the slave machines. 
Without the Tight Integration there is the option "-x OMP_NUM_THREADS" to 
`mpirun` in Open MPI.


> export OMP_NUM_THREADS=8
> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np $NSLOTS ./inverse.exe 
> 
> am I forgetting something?
> 
> Thanks,
> 
> Oscar Fabian Mojica Ladino
> Geologist M.S. in  Geophysics
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25016.php

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-14 Thread Reuti

Hi,

I think this is a broader issue in case an MPI library is used in conjunction 
with threads while running inside a queuing system. First: whether your actual 
installation of Open MPI is SGE-aware you can check with:

$ ompi_info | grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)

Then we can look at the definition of your PE: "allocation_rule$fill_up". 
This means that SGE will grant you 14 slots in total in any combination on the 
available machines, means 8+4+2 slots allocation is an allowed combination like 
4+4+3+3 and so on. Depending on the SGE-awareness it's a question: will your 
application just start processes on all nodes and completely disregard the 
granted allocation, or as the other extreme does it stays on one and the same 
machine for all started processes? On the master node of the parallel job you 
can issue:

$ ps -e f

(f w/o -) to have a look whether `ssh` or `qrsh -inhert ...` is used to reach 
other machines and their requested process count.


Now to the common problem in such a set up:

AFAICS: for now there is no way in the Open MPI + SGE combination to specify 
the number of MPI processes and intended number of threads which are 
automatically read by Open MPI while staying inside the granted slot count and 
allocation. So it seems to be necessary to have the intended number of threads 
being honored by Open MPI too.

Hence specifying e.g. "allocation_rule 8" in such a setup while requesting 32 
processes, would for now start 32 processes by MPI already, as Open MP reads 
the $PE_HOSTFILE and acts accordingly.

Open MPI would have to read the generated machine file in a slightly different 
way regarding threads: a) read the $PE_HOSTFILE, b) divide the granted slots 
per machine by OMP_NUM_THREADS, c) throw an error in case it's not divisible by 
OMP_NUM_THREADS. Then start one process per quotient.

Would this work for you?

-- Reuti

PS: This would also mean to have a couple of PEs in SGE having a fixed 
"allocation_rule". While this works right now, an extension in SGE could be 
"$fill_up_omp"/"$round_robin_omp" and using  OMP_NUM_THREADS there too, hence 
it must not be specified as an `export` in the job script but either on the 
command line or inside the job script in #$ lines as job requests. This would 
mean to collect slots in bunches of OMP_NUM_THREADS on each machine to reach 
the overall specified slot count. Whether OMP_NUM_THREADS or n times 
OMP_NUM_THREADS is allowed per machine needs to be discussed.
 
PS2: As Univa SGE can also supply a list of granted cores in the $PE_HOSTFILE, 
it would be an extension to feed this to Open MPI to allow any UGE aware 
binding.


Am 14.08.2014 um 21:52 schrieb Oscar Mojica:

> Guys
> 
> I changed the line to run the program in the script with both options
> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-none -np $NSLOTS 
> ./inverse.exe
> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-socket -np $NSLOTS 
> ./inverse.exe
> 
> but I got the same results. When I use man mpirun appears:
> 
>-bind-to-none, --bind-to-none
>   Do not bind processes.  (Default.)
> 
> and the output of 'qconf -sp orte' is
> 
> pe_nameorte
> slots  
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$fill_up
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary TRUE
> 
> I don't know if the installed Open MPI was compiled with '--with-sge'. How 
> can i know that?
> before to think in an hybrid application i was using only MPI and the program 
> used few processors (14). The cluster possesses 28 machines, 15 with 16 cores 
> and 13 with 8 cores totalizing 344 units of processing. When I submitted the 
> job (only MPI), the MPI processes were spread to the cores directly, for that 
> reason I created a new queue with 14 machines trying to gain more time.  the 
> results were the same in both cases. In the last case i could prove that the 
> processes were distributed to all machines correctly.
> 
> What I must to do?
> Thanks 
> 
> Oscar Fabian Mojica Ladino
> Geologist M.S. in  Geophysics
> 
> 
> > Date: Thu, 14 Aug 2014 10:10:17 -0400
> > From: maxime.boissonnea...@calculquebec.ca
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > 
> > Hi,
> > You DEFINITELY need to disable OpenMPI's new default binding. Otherwise, 
> > your N threads will run on a single core. --bind-to socket would be my 
> > recommendation for hybrid jobs.
> > 
> > Maxime
> >

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-15 Thread Reuti

   oscarr 08/15/2014 12:38:21 
> one.q@compute-1-14.local   SLAVE1.compute-1-14 r 00:49:14 
> 554.60818 0.09379 
>2726 0.50500 joboscarr 08/15/2014 12:38:21 
> one.q@compute-1-10.local   SLAVE1.compute-1-10 r 00:49:59 
> 562.95487 0.09349 
>2726 0.50500 joboscarr 08/15/2014 12:38:21 
> one.q@compute-1-15.local   SLAVE1.compute-1-15 r 00:50:01 
> 563.27221 0.09361 
>2726 0.50500 joboscarr 08/15/2014 12:38:21 
> one.q@compute-1-8.localSLAVE1.compute-1-8 r 00:49:26 
> 556.68431 0.09349 
>2726 0.50500 joboscarr 08/15/2014 12:38:21 
> one.q@compute-1-4.localSLAVE1.compute-1-4 r 00:49:27 
> 556.87510 0.04967 

Yes, here you got 10 slots (= cores) granted by SGE. So there is no free core 
left inside the allocation of SGE to allow the use of additional cores for your 
threads. If you use more cores than granted by SGE, it will oversubscribe the 
machines.

The issue is now:

a) If you want 8 threads per MPI process, your job will use 80 cores in total - 
for now SGE isn't aware of it.

b) Although you specified $fill_up as allocation rule, it looks like 
$round_robin. Is there more than one slot defined in the queue definition of 
one.q to get exclusive access?

c) What version of SGE are you using? Certain ones use cgroups or bind 
processes directly to cores (although it usually needs to be requested by the 
job: first line of `qconf -help`).


In case you are alone in the cluster, you could bypass the allocation with b) 
(unless you are hit by c)). But having a mixture of users and jobs a different 
handling would be necessary to handle this in a proper way IMO:

a) having a PE with a fixed allocation rule of 8

b) requesting this PE with an overall slot count of 80

c) copy and alter the $PE_HOSTFILE to show only (granted core count per 
machine) divided by (OMP_NUM_THREADS) per entry, change $PE_HOSTFILE so that it 
points to the altered file

d) Open MPI with a Tight Integration will now start only N process per machine 
according to the altered hostfile, in your case one

e) Your application can start the desired threads and you stay inside the 
granted allocation

-- Reuti


> I accessed to the MASTER processor with 'ssh compute-1-2.local' , and with $ 
> ps -e f and got this, I'm showing only the last lines  
> 
>  2506 ?Ss 0:00 /usr/sbin/atd
>  2548 tty1 Ss+0:00 /sbin/mingetty /dev/tty1
>  2550 tty2 Ss+0:00 /sbin/mingetty /dev/tty2
>  2552 tty3 Ss+0:00 /sbin/mingetty /dev/tty3
>  2554 tty4 Ss+0:00 /sbin/mingetty /dev/tty4
>  2556 tty5 Ss+0:00 /sbin/mingetty /dev/tty5
>  2558 tty6 Ss+0:00 /sbin/mingetty /dev/tty6
>  3325 ?Sl 0:04 /opt/gridengine/bin/linux-x64/sge_execd
> 17688 ?S  0:00  \_ sge_shepherd-2726 -bg
> 17695 ?Ss 0:00  \_ -bash 
> /opt/gridengine/default/spool/compute-1-2/job_scripts/2726
> 17797 ?S  0:00  \_ /usr/bin/time -f %E 
> /opt/openmpi/bin/mpirun -v -np 10 ./inverse.exe
> 17798 ?S  0:01  \_ /opt/openmpi/bin/mpirun -v -np 10 
> ./inverse.exe
> 17799 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-5.local 
> PATH=/opt/openmpi/bin:$PATH ; expo
> 17800 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-9.local 
> PATH=/opt/openmpi/bin:$PATH ; expo
> 17801 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-12.local
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17802 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-13.local
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17803 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-14.local
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17804 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-10.local
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17805 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-15.local
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17806 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-8.local 
> PATH=/opt/openmpi/bin:$PATH ; expo
> 17807 ?Sl 0:00  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-4.local 
> PATH=/opt/openmpi/bin:$PATH ; expo
> 17826 ?

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-19 Thread Reuti

Hi,

Am 19.08.2014 um 19:06 schrieb Oscar Mojica:

> I discovered what was the error. I forgot include the '-fopenmp' when I 
> compiled the objects in the Makefile, so the program worked but it didn't 
> divide the job in threads. Now the program is working and I can use until 15 
> cores for machine in the queue one.q.
> 
> Anyway i would like to try implement your advice. Well I'm not alone in the 
> cluster so i must implement your second suggestion. The steps are
> 
> a) Use '$ qconf -mp orte' to change the allocation rule to 8

The number of slots defined in your used one.q was also increased to 8 (`qconf 
-sq one.q`)?


> b) Set '#$ -pe orte 80' in the script

Fine.


> c) I'm not sure how to do this step. I'd appreciate your help here. I can add 
> some lines to the script to determine the PE_HOSTFILE path and contents, but 
> i don't know how alter it 

For now you can put in your jobscript (just after OMP_NUM_THREAD is exported):

awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' 
$PE_HOSTFILE > $TMPDIR/machines
export PE_HOSTFILE=$TMPDIR/machines

=

Unfortunately noone stepped into this discussion, as in my opinion it's a much 
broader issue which targets all users who want to combine MPI with OpenMP. The 
queuingsystem should get a proper request for the overall amount of slots the 
user needs. For now this will be forwarded to Open MPI and it will use this 
information to start the appropriate number of processes (which was an 
achievement for the Tight Integration out-of-the-box of course) and ignores any 
setting of OMP_NUM_THREADS. So, where should the generated list of machines be 
adjusted; there are several options:

a) The PE of the queuingsystem should do it:

+ a one time setup for the admin
+ in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE
- the "start_proc_args" would need to know the number of threads, i.e. 
OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript 
(tricky scanning of the submitted jobscript for OMP_NUM_THREADS would be too 
nasty)
- limits to use inside the jobscript calls to libraries behaving in the same 
way as Open MPI only


b) The particular queue should do it in a queue prolog:

same as a) I think


c) The user should do it

+ no change in the SGE installation
- each and every user must include it in all the jobscripts to adjust the list 
and export the pointer to the $PE_HOSTFILE, but he could change it forth and 
back for different steps of the jobscript though


d) Open MPI should do it

+ no change in the SGE installation
+ no change to the jobscript
+ OMP_NUM_THREADS can be altered for different steps of the jobscript while 
staying inside the granted allocation automatically
o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS already)?

-- Reuti


> echo "PE_HOSTFILE:"
> echo $PE_HOSTFILE
> echo
> echo "cat PE_HOSTFILE:"
> cat $PE_HOSTFILE 
> 
> Thanks for take a time for answer this emails, your advices had been very 
> useful
> 
> PS: The version of SGE is   OGS/GE 2011.11p1
> 
> 
> Oscar Fabian Mojica Ladino
> Geologist M.S. in  Geophysics
> 
> 
> > From: re...@staff.uni-marburg.de
> > Date: Fri, 15 Aug 2014 20:38:12 +0200
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > 
> > Hi,
> > 
> > Am 15.08.2014 um 19:56 schrieb Oscar Mojica:
> > 
> > > Yes, my installation of Open MPI is SGE-aware. I got the following
> > > 
> > > [oscar@compute-1-2 ~]$ ompi_info | grep grid
> > > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2)
> > 
> > Fine.
> > 
> > 
> > > I'm a bit slow and I didn't understand the las part of your message. So i 
> > > made a test trying to solve my doubts.
> > > This is the cluster configuration: There are some machines turned off but 
> > > that is no problem
> > > 
> > > [oscar@aguia free-noise]$ qhost
> > > HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
> > > ---
> > > global - - - - - - -
> > > compute-1-10 linux-x64 16 0.97 23.6G 558.6M 996.2M 0.0
> > > compute-1-11 linux-x64 16 - 23.6G - 996.2M -
> > > compute-1-12 linux-x64 16 0.97 23.6G 561.1M 996.2M 0.0
> > > compute-1-13 linux-x64 16 0.99 23.6G 558.7M 996.2M 0.0
> > > compute-1-14 linux-x64 16 1.00 23.6G 555.1M 996.2M 0.0
> > > compute-1-15 linux-x64 16 0.97 23.6G 555.5M 996.2M 0.0
> > > compute-1-16 linux-x64 8 0.00 15.7G 296.9M 1000.0M 0.0
> > > compute-1-17 linux-x64

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Reuti

Hi,

Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:

> Reuti and Oscar,
> 
> I'm a Torque user and I myself have never used SGE, so I hesitated to join 
> the discussion.
> 
> From my experience with the Torque, the openmpi 1.8 series has already 
> resolved the issue you pointed out in combining MPI with OpenMP. 
> 
> Please try to add --map-by slot:pe=8 option, if you want to use 8 threads. 
> Then, then openmpi 1.8 should allocate processes properly without any 
> modification 
> of the hostfile provided by the Torque.
> 
> In your case(8 threads and 10 procs):
> 
> # you have to request 80 slots using SGE command before mpirun 
> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe

Thx for pointing me to this option, for now I can't get it working though (in 
fact, I want to use it without binding essentially). This allows to tell Open 
MPI to bind more cores to each of the MPI processes - ok, but does it lower the 
slot count granted by Torque too? I mean, was your submission command like:

$ qsub -l nodes=10:ppn=8 ...

so that Torque knows, that it should grant and remember this slot count of a 
total of 80 for the correct accounting?

-- Reuti


> where you can omit --bind-to option because --bind-to core is assumed
> as default when pe=N is provided by the user.
> Regards,
> Tetsuya
> 
>> Hi,
>> 
>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
>> 
>>> I discovered what was the error. I forgot include the '-fopenmp' when I 
>>> compiled the objects in the Makefile, so the program worked but it didn't 
>>> divide the job 
> in threads. Now the program is working and I can use until 15 cores for 
> machine in the queue one.q.
>>> 
>>> Anyway i would like to try implement your advice. Well I'm not alone in the 
>>> cluster so i must implement your second suggestion. The steps are
>>> 
>>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
>> 
>> The number of slots defined in your used one.q was also increased to 8 
>> (`qconf -sq one.q`)?
>> 
>> 
>>> b) Set '#$ -pe orte 80' in the script
>> 
>> Fine.
>> 
>> 
>>> c) I'm not sure how to do this step. I'd appreciate your help here. I can 
>>> add some lines to the script to determine the PE_HOSTFILE path and 
>>> contents, but i 
> don't know how alter it 
>> 
>> For now you can put in your jobscript (just after OMP_NUM_THREAD is 
>> exported):
>> 
>> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' 
>> $PE_HOSTFILE > $TMPDIR/machines
>> export PE_HOSTFILE=$TMPDIR/machines
>> 
>> =
>> 
>> Unfortunately noone stepped into this discussion, as in my opinion it's a 
>> much broader issue which targets all users who want to combine MPI with 
>> OpenMP. The 
> queuingsystem should get a proper request for the overall amount of slots the 
> user needs. For now this will be forwarded to Open MPI and it will use this 
> information to start the appropriate number of processes (which was an 
> achievement for the Tight Integration out-of-the-box of course) and ignores 
> any setting of 
> OMP_NUM_THREADS. So, where should the generated list of machines be adjusted; 
> there are several options:
>> 
>> a) The PE of the queuingsystem should do it:
>> 
>> + a one time setup for the admin
>> + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE
>> - the "start_proc_args" would need to know the number of threads, i.e. 
>> OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript 
>> (tricky scanning 
> of the submitted jobscript for OMP_NUM_THREADS would be too nasty)
>> - limits to use inside the jobscript calls to libraries behaving in the same 
>> way as Open MPI only
>> 
>> 
>> b) The particular queue should do it in a queue prolog:
>> 
>> same as a) I think
>> 
>> 
>> c) The user should do it
>> 
>> + no change in the SGE installation
>> - each and every user must include it in all the jobscripts to adjust the 
>> list and export the pointer to the $PE_HOSTFILE, but he could change it 
>> forth and back 
> for different steps of the jobscript though
>> 
>> 
>> d) Open MPI should do it
>> 
>> + no change in the SGE installation
>> + no change to the jobscript
>> + OMP_NUM_THREADS can be altered for different steps of the jobscript while 
>> staying inside the granted allocation automatically
>> o should MKL_NUM_THREADS be covered t

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Reuti

Hi,

Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:

> Reuti,
> 
> If you want to allocate 10 procs with N threads, the Torque
> script below should work for you:
> 
> qsub -l nodes=10:ppn=N
> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe

I played around with giving -np 10 in addition to a Tight Integration. The slot 
count is not really divided I think, but only 10 out of the granted maximum is 
used (while on each of the listed machines an `orted` is started). Due to the 
fixed allocation this is of course the result we want to achieve as it 
subtracts bunches of 8 from the given list of machines resp. slots. In SGE it's 
sufficient to use and AFAICS it works (without touching the $PE_HOSTFILE any 
longer):

===
export OMP_NUM_THREADS=8
mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
$OMP_NUM_THREADS") ./inverse.exe
===

and submit with:

$ qsub -pe orte 80 job.sh

as the variables are distributed to the slave nodes by SGE already.

Nevertheless, using -np in addition to the Tight Integration gives a taste of a 
kind of half-tight integration in some way. And would not work for us because 
"--bind-to none" can't be used in such a command (see below) and throws an 
error.


> Then, the openmpi automatically reduces the logical slot count to 10
> by dividing real slot count 10N by binding width of N.
> 
> I don't know why you want to use pe=N without binding, but unfortunately
> the openmpi allocates successive cores to each process so far when you
> use pe option - it forcibly bind_to core.

In a shared cluster with many users and different MPI libraries in use, only 
the queuingsystem could know which job got which cores granted. This avoids any 
oversubscription of cores, while others are idle.

-- Reuti


> Tetsuya
> 
> 
>> Hi,
>> 
>> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
>> 
>>> Reuti and Oscar,
>>> 
>>> I'm a Torque user and I myself have never used SGE, so I hesitated to
> join
>>> the discussion.
>>> 
>>> From my experience with the Torque, the openmpi 1.8 series has already
>>> resolved the issue you pointed out in combining MPI with OpenMP.
>>> 
>>> Please try to add --map-by slot:pe=8 option, if you want to use 8
> threads.
>>> Then, then openmpi 1.8 should allocate processes properly without any
> modification
>>> of the hostfile provided by the Torque.
>>> 
>>> In your case(8 threads and 10 procs):
>>> 
>>> # you have to request 80 slots using SGE command before mpirun
>>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
>> 
>> Thx for pointing me to this option, for now I can't get it working though
> (in fact, I want to use it without binding essentially). This allows to
> tell Open MPI to bind more cores to each of the MPI
>> processes - ok, but does it lower the slot count granted by Torque too? I
> mean, was your submission command like:
>> 
>> $ qsub -l nodes=10:ppn=8 ...
>> 
>> so that Torque knows, that it should grant and remember this slot count
> of a total of 80 for the correct accounting?
>> 
>> -- Reuti
>> 
>> 
>>> where you can omit --bind-to option because --bind-to core is assumed
>>> as default when pe=N is provided by the user.
>>> Regards,
>>> Tetsuya
>>> 
>>>> Hi,
>>>> 
>>>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
>>>> 
>>>>> I discovered what was the error. I forgot include the '-fopenmp' when
> I compiled the objects in the Makefile, so the program worked but it didn't
> divide the job
>>> in threads. Now the program is working and I can use until 15 cores for
> machine in the queue one.q.
>>>>> 
>>>>> Anyway i would like to try implement your advice. Well I'm not alone
> in the cluster so i must implement your second suggestion. The steps are
>>>>> 
>>>>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
>>>> 
>>>> The number of slots defined in your used one.q was also increased to 8
> (`qconf -sq one.q`)?
>>>> 
>>>> 
>>>>> b) Set '#$ -pe orte 80' in the script
>>>> 
>>>> Fine.
>>>> 
>>>> 
>>>>> c) I'm not sure how to do this step. I'd appreciate your help here. I
> can add some lines to the script to determine the PE_HOSTFILE path and
> contents, but i
>>> don't know how alter it
>>>> 
>>>> For now you can put in your jobscript (jus

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Reuti

Am 20.08.2014 um 16:26 schrieb Ralph Castain:

> On Aug 20, 2014, at 6:58 AM, Reuti  wrote:
> 
>> Hi,
>> 
>> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:
>> 
>>> Reuti,
>>> 
>>> If you want to allocate 10 procs with N threads, the Torque
>>> script below should work for you:
>>> 
>>> qsub -l nodes=10:ppn=N
>>> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
>> 
>> I played around with giving -np 10 in addition to a Tight Integration. The 
>> slot count is not really divided I think, but only 10 out of the granted 
>> maximum is used (while on each of the listed machines an `orted` is 
>> started). Due to the fixed allocation this is of course the result we want 
>> to achieve as it subtracts bunches of 8 from the given list of machines 
>> resp. slots. In SGE it's sufficient to use and AFAICS it works (without 
>> touching the $PE_HOSTFILE any longer):
>> 
>> ===
>> export OMP_NUM_THREADS=8
>> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
>> $OMP_NUM_THREADS") ./inverse.exe
>> ===
>> 
>> and submit with:
>> 
>> $ qsub -pe orte 80 job.sh
>> 
>> as the variables are distributed to the slave nodes by SGE already.
>> 
>> Nevertheless, using -np in addition to the Tight Integration gives a taste 
>> of a kind of half-tight integration in some way. And would not work for us 
>> because "--bind-to none" can't be used in such a command (see below) and 
>> throws an error.
>> 
>> 
>>> Then, the openmpi automatically reduces the logical slot count to 10
>>> by dividing real slot count 10N by binding width of N.
>>> 
>>> I don't know why you want to use pe=N without binding, but unfortunately
>>> the openmpi allocates successive cores to each process so far when you
>>> use pe option - it forcibly bind_to core.
>> 
>> In a shared cluster with many users and different MPI libraries in use, only 
>> the queuingsystem could know which job got which cores granted. This avoids 
>> any oversubscription of cores, while others are idle.
> 
> FWIW: we detect the exterior binding constraint and work within it

Aha, this is quite interesting - how do you do this: scanning the 
/proc//status or alike? What happens if you don't find enough free cores 
as they are used up by other applications already?

-- Reuti


>> -- Reuti
>> 
>> 
>>> Tetsuya
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
>>>> 
>>>>> Reuti and Oscar,
>>>>> 
>>>>> I'm a Torque user and I myself have never used SGE, so I hesitated to
>>> join
>>>>> the discussion.
>>>>> 
>>>>> From my experience with the Torque, the openmpi 1.8 series has already
>>>>> resolved the issue you pointed out in combining MPI with OpenMP.
>>>>> 
>>>>> Please try to add --map-by slot:pe=8 option, if you want to use 8
>>> threads.
>>>>> Then, then openmpi 1.8 should allocate processes properly without any
>>> modification
>>>>> of the hostfile provided by the Torque.
>>>>> 
>>>>> In your case(8 threads and 10 procs):
>>>>> 
>>>>> # you have to request 80 slots using SGE command before mpirun
>>>>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
>>>> 
>>>> Thx for pointing me to this option, for now I can't get it working though
>>> (in fact, I want to use it without binding essentially). This allows to
>>> tell Open MPI to bind more cores to each of the MPI
>>>> processes - ok, but does it lower the slot count granted by Torque too? I
>>> mean, was your submission command like:
>>>> 
>>>> $ qsub -l nodes=10:ppn=8 ...
>>>> 
>>>> so that Torque knows, that it should grant and remember this slot count
>>> of a total of 80 for the correct accounting?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> where you can omit --bind-to option because --bind-to core is assumed
>>>>> as default when pe=N is provided by the user.
>>>>> Regards,
>>>>> Tetsuya
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
>>>>>> 
>>>>>>&g

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Reuti

Am 20.08.2014 um 19:05 schrieb Ralph Castain:

>> 
>> Aha, this is quite interesting - how do you do this: scanning the 
>> /proc//status or alike? What happens if you don't find enough free 
>> cores as they are used up by other applications already?
>> 
> 
> Remember, when you use mpirun to launch, we launch our own daemons using the 
> native launcher (e.g., qsub). So the external RM will bind our daemons to the 
> specified cores on each node. We use hwloc to determine what cores our 
> daemons are bound to, and then bind our own child processes to cores within 
> that range.

Thx for reminding me of this. Indeed, I mixed up two different aspects in this 
discussion.

a) What will happen in case no binding was done by the RM (hence Open MPI could 
use all cores) and two Open MPI jobs (or something completely different besides 
one Open MPI job) are running on the same node (due to the Tight Integration 
with two different Open MPI directories in /tmp and two `orted`, unique for 
each job)? Will the second Open MPI job know what the first Open MPI job used 
up already? Or will both use the same set of cores as "-bind-to none" can't be 
set in the given `mpiexec` command because of "-map-by 
slot:pe=$OMP_NUM_THREADS" was used - which triggers "-bind-to core" 
indispensable and can't be switched off? I see the same cores being used for 
both jobs.

Altering the machinefile instead: the processes are not bound to any core, and 
the OS takes care of a proper assignment.


> If the cores we are bound to are the same on each node, then we will do this 
> with no further instruction. However, if the cores are different on the 
> individual nodes, then you need to add --hetero-nodes to your command line 
> (as the nodes appear to be heterogeneous to us).

b) Aha, it's not about different type CPU types, but also same CPU type but 
different allocations between the nodes? It's not in the `mpiexec` man-page of 
1.8.1 though. I'll have a look at it.


> So it is up to the RM to set the constraint - we just live within it.

Fine.

-- Reuti

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti

Am 20.08.2014 um 23:16 schrieb Ralph Castain:

> 
> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
> 
>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>> 
>>>> 
>>>> Aha, this is quite interesting - how do you do this: scanning the 
>>>> /proc//status or alike? What happens if you don't find enough free 
>>>> cores as they are used up by other applications already?
>>>> 
>>> 
>>> Remember, when you use mpirun to launch, we launch our own daemons using 
>>> the native launcher (e.g., qsub). So the external RM will bind our daemons 
>>> to the specified cores on each node. We use hwloc to determine what cores 
>>> our daemons are bound to, and then bind our own child processes to cores 
>>> within that range.
>> 
>> Thx for reminding me of this. Indeed, I mixed up two different aspects in 
>> this discussion.
>> 
>> a) What will happen in case no binding was done by the RM (hence Open MPI 
>> could use all cores) and two Open MPI jobs (or something completely 
>> different besides one Open MPI job) are running on the same node (due to the 
>> Tight Integration with two different Open MPI directories in /tmp and two 
>> `orted`, unique for each job)? Will the second Open MPI job know what the 
>> first Open MPI job used up already? Or will both use the same set of cores 
>> as "-bind-to none" can't be set in the given `mpiexec` command because of 
>> "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers "-bind-to core" 
>> indispensable and can't be switched off? I see the same cores being used for 
>> both jobs.
> 
> Yeah, each mpirun executes completely independently of the other, so they 
> have no idea what the other is doing. So the cores will be overloaded. 
> Multi-pe's requires bind-to-core otherwise there is no way to implement the 
> request

Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
"-bind-to none" here?


>> Altering the machinefile instead: the processes are not bound to any core, 
>> and the OS takes care of a proper assignment.

Here the ordinary user has to mangle the hostfile, this is not good (but allows 
several jobs per node as the OS shift the processes around). Could/should it be 
put into the "gridengine" module in OpenMPI, to divide the slot count per node 
automatically when $OMP_NUM_THREADS is found, or generate an error if it's not 
divisible?

===

>>> If the cores we are bound to are the same on each node, then we will do 
>>> this with no further instruction. However, if the cores are different on 
>>> the individual nodes, then you need to add --hetero-nodes to your command 
>>> line (as the nodes appear to be heterogeneous to us).
>> 
>> b) Aha, it's not about different type CPU types, but also same CPU type but 
>> different allocations between the nodes? It's not in the `mpiexec` man-page 
>> of 1.8.1 though. I'll have a look at it.

I tried:

$ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q 
parallel@node0[1-4] test_openmpi.sh 
Your job 247109 ("test_openmpi.sh") has been submitted
$ qsub -binding linear:2:1 -pe smp2 8 -masterq parallel@node01 -q 
parallel@node0[1-4] test_openmpi.sh 
Your job 247110 ("test_openmpi.sh") has been submitted


Getting on node03:


 6733 ?Sl 0:00  \_ sge_shepherd-247109 -bg
 6734 ?SNs0:00  |   \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/node03/active_jobs/247109.1/1.node03
 6741 ?SN 0:00  |   \_ orted -mca orte_hetero_nodes 1 -mca ess 
env -mca orte_ess_jobid 1493303296 -mca orte_ess_vpid
 6742 ?    RNl0:31  |   \_ ./mpihello
 6745 ?Sl 0:00  \_ sge_shepherd-247110 -bg
 6746 ?SNs0:00  \_ /usr/sge/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/node03/active_jobs/247110.1/1.node03
 6753 ?    SN 0:00  \_ orted -mca orte_hetero_nodes 1 -mca ess 
env -mca orte_ess_jobid 1506607104 -mca orte_ess_vpid
 6754 ?RNl0:25  \_ ./mpihello


reuti@node03:~> cat /proc/6741/status | grep Cpus_
Cpus_allowed:   
,,,,,0000,0000,,,,,,,,,0003
Cpus_allowed_list:  0-1
reuti@node03:~> cat /proc/6753/status | grep Cpus_
Cpus_allowed:   
,,,,,,,,,,,,,,,0030
Cpus_allowed_list:  4-5

Hence, "orted" got two cores assigned for each of them. But:


reuti@node03:~> cat /proc/6742/status | grep Cpus_
Cpus_allowed:

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti

Hi,

Am 21.08.2014 um 01:56 schrieb tmish...@jcity.maeda.co.jp:

> Reuti,
> 
> Sorry for confusing you. Under the managed condition, actually
> -np option is not necessary. So, this cmd line also works for me
> with Torque.
> 
> $ qsub -l nodes=10:ppn=N
> $ mpirun -map-by slot:pe=N ./inverse.exe

Aha, yes. Works in SGE too.

To make the notation of threads generic, what about an extension to use:

-map-by slot:pe=omp

where the literal "omp" will trigger to use $OMP_NUM_THREADS instead?

-- Reuti


> At least, Ralph confirmed it worked with Slurm and I comfirmed
> with Torque as shown below:
> 
> [mishima@manage ~]$ qsub -I -l nodes=4:ppn=8
> qsub: waiting for job 8798.manage.cluster to start
> qsub: job 8798.manage.cluster ready
> 
> [mishima@node09 ~]$ cat $PBS_NODEFILE
> node09
> node09
> node09
> node09
> node09
> node09
> node09
> node09
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node12
> node12
> node12
> node12
> node12
> node12
> node12
> node12
> [mishima@node09 ~]$ mpirun -map-by slot:pe=8 -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [8050,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: node09  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [8050,1] App: 0 Process rank: 0
> 
> Data for node: node10  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [8050,1] App: 0 Process rank: 1
> 
> Data for node: node11  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [8050,1] App: 0 Process rank: 2
> 
> Data for node: node12  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [8050,1] App: 0 Process rank: 3
> 
> =
> Hello world from process 0 of 4
> Hello world from process 2 of 4
> Hello world from process 3 of 4
> Hello world from process 1 of 4
> [mishima@node09 ~]$ mpirun -map-by slot:pe=4 -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [8056,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: node09  Num slots: 8Max slots: 0Num procs: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 0
>Process OMPI jobid: [8056,1] App: 0 Process rank: 1
> 
> Data for node: node10  Num slots: 8Max slots: 0Num procs: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 3
> 
> Data for node: node11  Num slots: 8Max slots: 0Num procs: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 4
>Process OMPI jobid: [8056,1] App: 0 Process rank: 5
> 
> Data for node: node12  Num slots: 8Max slots: 0Num procs: 2
>Process OMPI jobid: [8056,1] App: 0 Process rank: 6
>Process OMPI jobid: [8056,1] App: 0 Process rank: 7
> 
> =
> Hello world from process 1 of 8
> Hello world from process 0 of 8
> Hello world from process 2 of 8
> Hello world from process 3 of 8
> Hello world from process 4 of 8
> Hello world from process 5 of 8
> Hello world from process 6 of 8
> Hello world from process 7 of 8
> 
> I don't know why it dosen't work with SGE. Could you show me
> your output adding -display-map and -mca rmaps_base_verbose 5 options?
> 
> By the way, the option -map-by ppr:N:node or ppr:N:socket might be
> useful for your purpose. The ppr can reduce the slot counts given
> by RM without binding and allocate N procs by the specified resource.
> 
> [mishima@node09 ~]$ mpirun -map-by ppr:1:node -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [7913,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: node09  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [7913,1] App: 0 Process rank: 0
> 
> Data for node: node10  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [7913,1] App: 0 Process rank: 1
> 
> Data for node: node11  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [7913,1] App: 0 Process rank: 2
> 
> Data for node: node12  Num slots: 8Max slots: 0Num procs: 1
>Process OMPI jobid: [7913,1] App: 0 Process rank: 3
> 
> =
> Hello world from process 0 of 4
> Hello world from process 2 of 4
> Hello world from process 1 of 4
> Hello world from p

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti

Hi,

Am 20.08.2014 um 20:08 schrieb Oscar Mojica:

> Well, with qconf -sq one.q I got the following:
> 
> [oscar@aguia free-noise]$ qconf -sq one.q
> qname one.q
> hostlist compute-1-30.local compute-1-2.local 
> compute-1-3.local \
>   compute-1-4.local compute-1-5.local compute-1-6.local \
>   compute-1-7.local compute-1-8.local compute-1-9.local \
>   compute-1-10.local compute-1-11.local 
> compute-1-12.local \
>   compute-1-13.local compute-1-14.local compute-1-15.local
> seq_no0
> load_thresholds np_load_avg=1.75
> suspend_thresholds  NONE
> nsuspend  1
> suspend_interval00:05:00
> priority0
> min_cpu_interval00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list   NONE
> pe_list make mpich mpi orte
> rerun FALSE
> slots  1,[compute-1-30.local=1],[compute-1-2.local=1], \
>   [compute-1-3.local=1],[compute-1-5.local=1], \
>   [compute-1-8.local=1],[compute-1-6.local=1], \
>   [compute-1-4.local=1],[compute-1-9.local=1], \
>   [compute-1-11.local=1],[compute-1-7.local=1], \
>   [compute-1-13.local=1],[compute-1-10.local=1], \
>   [compute-1-15.local=1],[compute-1-12.local=1], \
>   [compute-1-14.local=1]
> 
> the admin was who created this queue, so I have to speak to him to change the 
> number of slots to number of threads that i wish to use. 

Yep. I think it was his intention to allow an exclusive use of each node by 
this (this can be done in SGE by other means too). While one could do it, it 
doesn't reflect the proper amount of cores to SGE the user wants to use (it's 
more like the number of machines) and so any accounting won't work, or getting 
from `qacct` the correct information what the job requested at time he was 
submitted.


> Then I could make use of: 
> ===
> export OMP_NUM_THREADS=N 
> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
> $OMP_NUM_THREADS") ./inverse.exe
> ==

As mentioned by tmishima, it's sufficient to use:

$ qsub -pe orte 80 ...

export OMP_NUM_THREADS=8
mpirun -map-by slot:pe=$OMP_NUM_THREADS ./yourapp.exe


=> you get a proper binding here, either if you are alone on each machine or 
all jobs get proper binding and Open MPI stays inside it (not all versions of 
SGE support this though)


> For now in my case this command line just would work for 10 processes and the 
> work wouldn't be divided in threads, is it right?

It works for 10 machines which you get exclusively, hence oversubscribing the 
granted single slot on each machine with "-bind-to none" what Ralph mentioned 
in the beginning is up to you (unless other users would get hurt as they are 
having there jobs too).

$ qsub -pe orte 10 ...

export OMP_NUM_THREADS=8
mpirun -bind-to none ./yourapp.exe


=> The OS will shift the processes around, while SGE doesn't know anything 
about the final number of slots/cores you want to use on each machine (or to 
leave free for others).

===

Both ways above work right now, but IMO it's not the optimum in a shared 
cluster for the SGE versions w/o hard-binding. In the second case Open MPI 
starts 1 process per node as we need it. In case you would request here `qsub 
-pe orte 80 ...` here too, Open MPI would start 80 processes. To avoid this I 
came up with altering the machinefile to give Open MPI a different information 
about the granted slots on each machine.

$ qsub -pe orte 80 ...

export OMP_NUM_THREADS=8
awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' 
$PE_HOSTFILE > $TMPDIR/machines
export PE_HOSTFILE=$TMPDIR/machines
mpirun -bind-to none ./yourapp.exe

===

I hope having all three versions in one email sheds some light on it.

-- Reuti


> can I set a maximum number of threads in the queue one.q (e.g. 15 ) and 
> change the number in the 'export' for my convenience
> 
> I feel like a child hearing the adults speaking
> Thanks I'm learning a lot   
>   
> 
> Oscar Fabian Mojica Ladino
> Geologist M.S. in  Geophysics
> 
> 
> > From: re...@staff.uni-marburg.de
> > Date: Tue, 19 Aug 2014 19:51:46 +0200
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > 
> > Hi,
> > 
> > Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
> > 
> > > I discovered what was the error. I forgot include the '-fopenmp' when I 
> > > compiled the objects

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti

Am 21.08.2014 um 15:45 schrieb Ralph Castain:

> On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
> 
>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
>> 
>>> 
>>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>>> 
>>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>>> 
>>>>>> 
>>>>>> Aha, this is quite interesting - how do you do this: scanning the 
>>>>>> /proc//status or alike? What happens if you don't find enough free 
>>>>>> cores as they are used up by other applications already?
>>>>>> 
>>>>> 
>>>>> Remember, when you use mpirun to launch, we launch our own daemons using 
>>>>> the native launcher (e.g., qsub). So the external RM will bind our 
>>>>> daemons to the specified cores on each node. We use hwloc to determine 
>>>>> what cores our daemons are bound to, and then bind our own child 
>>>>> processes to cores within that range.
>>>> 
>>>> Thx for reminding me of this. Indeed, I mixed up two different aspects in 
>>>> this discussion.
>>>> 
>>>> a) What will happen in case no binding was done by the RM (hence Open MPI 
>>>> could use all cores) and two Open MPI jobs (or something completely 
>>>> different besides one Open MPI job) are running on the same node (due to 
>>>> the Tight Integration with two different Open MPI directories in /tmp and 
>>>> two `orted`, unique for each job)? Will the second Open MPI job know what 
>>>> the first Open MPI job used up already? Or will both use the same set of 
>>>> cores as "-bind-to none" can't be set in the given `mpiexec` command 
>>>> because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers 
>>>> "-bind-to core" indispensable and can't be switched off? I see the same 
>>>> cores being used for both jobs.
>>> 
>>> Yeah, each mpirun executes completely independently of the other, so they 
>>> have no idea what the other is doing. So the cores will be overloaded. 
>>> Multi-pe's requires bind-to-core otherwise there is no way to implement the 
>>> request
>> 
>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
>> "-bind-to none" here?
> 
> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you are 
> running on a mixed cluster and don't want binding, then just say bind-to none 
> and leave the pe argument out entirely as it wouldn't mean anything unless 
> you are bound

I would mean: divide the overall number of slots/cores in the machinefile by N 
(i.e. $OMP_NUM_THREADS).

- Request made to the queuing system: I need 80 cores in total.
- The machinefile will contain 80 cores
- Open MPI will divide it by N, i.e. 8 here
- Open MPI will start only 10 processes, one on each node
- The application will use 8 threads per started MPI process

-- Reuti


>> 
>> 
>>>> Altering the machinefile instead: the processes are not bound to any core, 
>>>> and the OS takes care of a proper assignment.
>> 
>> Here the ordinary user has to mangle the hostfile, this is not good (but 
>> allows several jobs per node as the OS shift the processes around). 
>> Could/should it be put into the "gridengine" module in OpenMPI, to divide 
>> the slot count per node automatically when $OMP_NUM_THREADS is found, or 
>> generate an error if it's not divisible?
> 
> Sure, that could be done - but it will only have if OMP_NUM_THREADS is set 
> when someone spins off threads. So far as I know, that's only used for OpenMP 
> - so we'd get a little help, but it wouldn't be full coverage.
> 
> 
>> 
>> ===
>> 
>>>>> If the cores we are bound to are the same on each node, then we will do 
>>>>> this with no further instruction. However, if the cores are different on 
>>>>> the individual nodes, then you need to add --hetero-nodes to your command 
>>>>> line (as the nodes appear to be heterogeneous to us).
>>>> 
>>>> b) Aha, it's not about different type CPU types, but also same CPU type 
>>>> but different allocations between the nodes? It's not in the `mpiexec` 
>>>> man-page of 1.8.1 though. I'll have a look at it.
>> 
>> I tried:
>> 
>> $ qsub -binding linear:2:0 -pe smp2 8 -masterq parallel@node01 -q 
>> parallel@node0[1-4] test_openmpi.sh 
>> Your job 247109 (

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti

Am 21.08.2014 um 16:00 schrieb Ralph Castain:

> 
> On Aug 21, 2014, at 6:54 AM, Reuti  wrote:
> 
>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
>> 
>>> On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
>>> 
>>>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
>>>> 
>>>>> 
>>>>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>>>>> 
>>>>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>>>>> 
>>>>>>>> 
>>>>>>>> Aha, this is quite interesting - how do you do this: scanning the 
>>>>>>>> /proc//status or alike? What happens if you don't find enough 
>>>>>>>> free cores as they are used up by other applications already?
>>>>>>>> 
>>>>>>> 
>>>>>>> Remember, when you use mpirun to launch, we launch our own daemons 
>>>>>>> using the native launcher (e.g., qsub). So the external RM will bind 
>>>>>>> our daemons to the specified cores on each node. We use hwloc to 
>>>>>>> determine what cores our daemons are bound to, and then bind our own 
>>>>>>> child processes to cores within that range.
>>>>>> 
>>>>>> Thx for reminding me of this. Indeed, I mixed up two different aspects 
>>>>>> in this discussion.
>>>>>> 
>>>>>> a) What will happen in case no binding was done by the RM (hence Open 
>>>>>> MPI could use all cores) and two Open MPI jobs (or something completely 
>>>>>> different besides one Open MPI job) are running on the same node (due to 
>>>>>> the Tight Integration with two different Open MPI directories in /tmp 
>>>>>> and two `orted`, unique for each job)? Will the second Open MPI job know 
>>>>>> what the first Open MPI job used up already? Or will both use the same 
>>>>>> set of cores as "-bind-to none" can't be set in the given `mpiexec` 
>>>>>> command because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which 
>>>>>> triggers "-bind-to core" indispensable and can't be switched off? I see 
>>>>>> the same cores being used for both jobs.
>>>>> 
>>>>> Yeah, each mpirun executes completely independently of the other, so they 
>>>>> have no idea what the other is doing. So the cores will be overloaded. 
>>>>> Multi-pe's requires bind-to-core otherwise there is no way to implement 
>>>>> the request
>>>> 
>>>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
>>>> "-bind-to none" here?
>>> 
>>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
>>> are running on a mixed cluster and don't want binding, then just say 
>>> bind-to none and leave the pe argument out entirely as it wouldn't mean 
>>> anything unless you are bound
>> 
>> I would mean: divide the overall number of slots/cores in the machinefile by 
>> N (i.e. $OMP_NUM_THREADS).
>> 
>> - Request made to the queuing system: I need 80 cores in total.
>> - The machinefile will contain 80 cores
>> - Open MPI will divide it by N, i.e. 8 here
>> - Open MPI will start only 10 processes, one on each node
>> - The application will use 8 threads per started MPI process
> 
> I see - so you were talking about the case where the user doesn't provide the 
> -np N option

Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in the 
machinefile from the beginning (first nodes get all the processes, remaining 
nodes are free). Making it in a round-robin way would work better for this case.


> and we need to compute the number of procs to start. Okay, the change you 
> requested below will fix that one too. I can make that easily enough.

Therefore I wanted to start a discussion about it (at that time I wasn't aware 
of the "-map-by slot:pe=N" option), as I have no final syntax which would cover 
all cases. Someone may want the binding by the "-map-by slot:pe=N". How can 
this be specified, while keeping an easy tight-integration for users who don't 
want any binding at all.

The boundary conditions are:

- the job is running inside a queuingsystem
- the user requests the overall amount of slots to the queuingsystem
- hence the machinefile has entries for all slots
- the user sets OMP_NUM_THREADS

case 1) no interest in any bindi

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti

Am 21.08.2014 um 16:50 schrieb Reuti:

> Am 21.08.2014 um 16:00 schrieb Ralph Castain:
> 
>> 
>> On Aug 21, 2014, at 6:54 AM, Reuti  wrote:
>> 
>>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
>>> 
>>>> On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
>>>> 
>>>>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
>>>>> 
>>>>>> 
>>>>>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>>>>>> 
>>>>>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Aha, this is quite interesting - how do you do this: scanning the 
>>>>>>>>> /proc//status or alike? What happens if you don't find enough 
>>>>>>>>> free cores as they are used up by other applications already?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Remember, when you use mpirun to launch, we launch our own daemons 
>>>>>>>> using the native launcher (e.g., qsub). So the external RM will bind 
>>>>>>>> our daemons to the specified cores on each node. We use hwloc to 
>>>>>>>> determine what cores our daemons are bound to, and then bind our own 
>>>>>>>> child processes to cores within that range.
>>>>>>> 
>>>>>>> Thx for reminding me of this. Indeed, I mixed up two different aspects 
>>>>>>> in this discussion.
>>>>>>> 
>>>>>>> a) What will happen in case no binding was done by the RM (hence Open 
>>>>>>> MPI could use all cores) and two Open MPI jobs (or something completely 
>>>>>>> different besides one Open MPI job) are running on the same node (due 
>>>>>>> to the Tight Integration with two different Open MPI directories in 
>>>>>>> /tmp and two `orted`, unique for each job)? Will the second Open MPI 
>>>>>>> job know what the first Open MPI job used up already? Or will both use 
>>>>>>> the same set of cores as "-bind-to none" can't be set in the given 
>>>>>>> `mpiexec` command because of "-map-by slot:pe=$OMP_NUM_THREADS" was 
>>>>>>> used - which triggers "-bind-to core" indispensable and can't be 
>>>>>>> switched off? I see the same cores being used for both jobs.
>>>>>> 
>>>>>> Yeah, each mpirun executes completely independently of the other, so 
>>>>>> they have no idea what the other is doing. So the cores will be 
>>>>>> overloaded. Multi-pe's requires bind-to-core otherwise there is no way 
>>>>>> to implement the request
>>>>> 
>>>>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
>>>>> "-bind-to none" here?
>>>> 
>>>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
>>>> are running on a mixed cluster and don't want binding, then just say 
>>>> bind-to none and leave the pe argument out entirely as it wouldn't mean 
>>>> anything unless you are bound
>>> 
>>> I would mean: divide the overall number of slots/cores in the machinefile 
>>> by N (i.e. $OMP_NUM_THREADS).
>>> 
>>> - Request made to the queuing system: I need 80 cores in total.
>>> - The machinefile will contain 80 cores
>>> - Open MPI will divide it by N, i.e. 8 here
>>> - Open MPI will start only 10 processes, one on each node
>>> - The application will use 8 threads per started MPI process
>> 
>> I see - so you were talking about the case where the user doesn't provide 
>> the -np N option
> 
> Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in 
> the machinefile from the beginning (first nodes get all the processes, 
> remaining nodes are free). Making it in a round-robin way would work better 
> for this case.
> 
> 
>> and we need to compute the number of procs to start. Okay, the change you 
>> requested below will fix that one too. I can make that easily enough.
> 
> Therefore I wanted to start a discussion about it (at that time I wasn't 
> aware of the "-map-by slot:pe=N" option), as I have no final syntax which 
> would cover all cases. Someone may want the binding by the "-map-by 
> slot:pe=N". How c

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Reuti

Am 21.08.2014 um 16:50 schrieb Reuti:

> Am 21.08.2014 um 16:00 schrieb Ralph Castain:
> 
>> 
>> On Aug 21, 2014, at 6:54 AM, Reuti  wrote:
>> 
>>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
>>> 
>>>> On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
>>>> 
>>>>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
>>>>> 
>>>>>> 
>>>>>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>>>>>> 
>>>>>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Aha, this is quite interesting - how do you do this: scanning the 
>>>>>>>>> /proc//status or alike? What happens if you don't find enough 
>>>>>>>>> free cores as they are used up by other applications already?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Remember, when you use mpirun to launch, we launch our own daemons 
>>>>>>>> using the native launcher (e.g., qsub). So the external RM will bind 
>>>>>>>> our daemons to the specified cores on each node. We use hwloc to 
>>>>>>>> determine what cores our daemons are bound to, and then bind our own 
>>>>>>>> child processes to cores within that range.
>>>>>>> 
>>>>>>> Thx for reminding me of this. Indeed, I mixed up two different aspects 
>>>>>>> in this discussion.
>>>>>>> 
>>>>>>> a) What will happen in case no binding was done by the RM (hence Open 
>>>>>>> MPI could use all cores) and two Open MPI jobs (or something completely 
>>>>>>> different besides one Open MPI job) are running on the same node (due 
>>>>>>> to the Tight Integration with two different Open MPI directories in 
>>>>>>> /tmp and two `orted`, unique for each job)? Will the second Open MPI 
>>>>>>> job know what the first Open MPI job used up already? Or will both use 
>>>>>>> the same set of cores as "-bind-to none" can't be set in the given 
>>>>>>> `mpiexec` command because of "-map-by slot:pe=$OMP_NUM_THREADS" was 
>>>>>>> used - which triggers "-bind-to core" indispensable and can't be 
>>>>>>> switched off? I see the same cores being used for both jobs.
>>>>>> 
>>>>>> Yeah, each mpirun executes completely independently of the other, so 
>>>>>> they have no idea what the other is doing. So the cores will be 
>>>>>> overloaded. Multi-pe's requires bind-to-core otherwise there is no way 
>>>>>> to implement the request
>>>>> 
>>>>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
>>>>> "-bind-to none" here?
>>>> 
>>>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
>>>> are running on a mixed cluster and don't want binding, then just say 
>>>> bind-to none and leave the pe argument out entirely as it wouldn't mean 
>>>> anything unless you are bound
>>> 
>>> I would mean: divide the overall number of slots/cores in the machinefile 
>>> by N (i.e. $OMP_NUM_THREADS).
>>> 
>>> - Request made to the queuing system: I need 80 cores in total.
>>> - The machinefile will contain 80 cores
>>> - Open MPI will divide it by N, i.e. 8 here
>>> - Open MPI will start only 10 processes, one on each node
>>> - The application will use 8 threads per started MPI process
>> 
>> I see - so you were talking about the case where the user doesn't provide 
>> the -np N option
> 
> Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in 
> the machinefile from the beginning (first nodes get all the processes, 
> remaining nodes are free). Making it in a round-robin way would work better 
> for this case.
> 
> 
>> and we need to compute the number of procs to start. Okay, the change you 
>> requested below will fix that one too. I can make that easily enough.
> 
> Therefore I wanted to start a discussion about it (at that time I wasn't 
> aware of the "-map-by slot:pe=N" option), as I have no final syntax which 
> would cover all cases. Someone may want the binding by the "-map-by 
> slot:pe=N&

Re: [OMPI users] A daemon on node cl231 failed to start as expected

2014-08-23 Thread Reuti

Hi,

Am 23.08.2014 um 16:09 schrieb Pengcheng Wang:

> I need to run a single driver program that only require one proc with the 
> command mpirun -np 1 ./app or ./app. But it will schedule the launch of other 
> executable files including parallel and sequential computing. So I require 
> more than one proc to run it. It can be run smoothly as an interactive job 
> with the command below.
> 
> qrsh -cwd -pe "ompi*" 6 -l h_rt=00:30:00,test=true ./app
> 
> But after I submitted the job, a strange error occurred and it stopped... 
> Please find the job script and error message below:
> 
> • job submission script:
> #$ -S /bin/bash
> #$ -N couple
> #$ -cwd
> #$ -j y
> #$ -l h_rt=05:00:00
> #$ -l h_vmem=2G

Is a simple hello_world program listing the threads working? Does it work 
without the h_vmem limit?


> #$ -o couple.out
> #$ -pe ompi*  6

Which PEs can be addressed here? What are their allocation rules (looks like 
you need "$pe_slots").

What version of SGE?
What version of Open MPI?
Compiled with --with-sge?

For me it's working in either way.

-- Reuti


> ./app
> 
> error message:
> error: executing task of job 6777095 failed:
> [cl231:23777] ERROR: A daemon on node cl231 failed to start as expected.
> [cl231:23777] ERROR: There may be more information available from
> [cl231:23777] ERROR: the 'qstat -t' command on the Grid Engine tasks.
> [cl231:23777] ERROR: If the problem persists, please restart the
> [cl231:23777] ERROR: Grid Engine PE job
> [cl231:23777] ERROR: The daemon exited unexpectedly with status 1.
> 
> Thanks for any help!
> 
> Pengcheng
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25141.php

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-25 Thread Reuti

Am 21.08.2014 um 16:50 schrieb Reuti:

> Am 21.08.2014 um 16:00 schrieb Ralph Castain:
> 
>> 
>> On Aug 21, 2014, at 6:54 AM, Reuti  wrote:
>> 
>>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
>>> 
>>>> On Aug 21, 2014, at 2:51 AM, Reuti  wrote:
>>>> 
>>>>> Am 20.08.2014 um 23:16 schrieb Ralph Castain:
>>>>> 
>>>>>> 
>>>>>> On Aug 20, 2014, at 11:16 AM, Reuti  wrote:
>>>>>> 
>>>>>>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Aha, this is quite interesting - how do you do this: scanning the 
>>>>>>>>> /proc//status or alike? What happens if you don't find enough 
>>>>>>>>> free cores as they are used up by other applications already?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Remember, when you use mpirun to launch, we launch our own daemons 
>>>>>>>> using the native launcher (e.g., qsub). So the external RM will bind 
>>>>>>>> our daemons to the specified cores on each node. We use hwloc to 
>>>>>>>> determine what cores our daemons are bound to, and then bind our own 
>>>>>>>> child processes to cores within that range.
>>>>>>> 
>>>>>>> Thx for reminding me of this. Indeed, I mixed up two different aspects 
>>>>>>> in this discussion.
>>>>>>> 
>>>>>>> a) What will happen in case no binding was done by the RM (hence Open 
>>>>>>> MPI could use all cores) and two Open MPI jobs (or something completely 
>>>>>>> different besides one Open MPI job) are running on the same node (due 
>>>>>>> to the Tight Integration with two different Open MPI directories in 
>>>>>>> /tmp and two `orted`, unique for each job)? Will the second Open MPI 
>>>>>>> job know what the first Open MPI job used up already? Or will both use 
>>>>>>> the same set of cores as "-bind-to none" can't be set in the given 
>>>>>>> `mpiexec` command because of "-map-by slot:pe=$OMP_NUM_THREADS" was 
>>>>>>> used - which triggers "-bind-to core" indispensable and can't be 
>>>>>>> switched off? I see the same cores being used for both jobs.
>>>>>> 
>>>>>> Yeah, each mpirun executes completely independently of the other, so 
>>>>>> they have no idea what the other is doing. So the cores will be 
>>>>>> overloaded. Multi-pe's requires bind-to-core otherwise there is no way 
>>>>>> to implement the request
>>>>> 
>>>>> Yep, and so it's no option in a mixed cluster. Why would it hurt to allow 
>>>>> "-bind-to none" here?
>>>> 
>>>> Guess I'm confused here - what does pe=N mean if we bind-to none?? If you 
>>>> are running on a mixed cluster and don't want binding, then just say 
>>>> bind-to none and leave the pe argument out entirely as it wouldn't mean 
>>>> anything unless you are bound
>>> 
>>> I would mean: divide the overall number of slots/cores in the machinefile 
>>> by N (i.e. $OMP_NUM_THREADS).
>>> 
>>> - Request made to the queuing system: I need 80 cores in total.
>>> - The machinefile will contain 80 cores
>>> - Open MPI will divide it by N, i.e. 8 here
>>> - Open MPI will start only 10 processes, one on each node
>>> - The application will use 8 threads per started MPI process
>> 
>> I see - so you were talking about the case where the user doesn't provide 
>> the -np N option
> 
> Yes. Even if -np is specified: AFAICS Open MPI fills up the given slots in 
> the machinefile from the beginning (first nodes get all the processes, 
> remaining nodes are free). Making it in a round-robin way would work better 
> for this case.

Could this be an option which include all cases:

>> and we need to compute the number of procs to start. Okay, the change you 
>> requested below will fix that one too. I can make that easily enough.
> 
> Therefore I wanted to start a discussion about it (at that time I wasn't 
> aware of the "-map-by slot:pe=N" option), as I have no final syntax which 
> would cover all cases. Someone may want the binding by the &q

Re: [OMPI users] A daemon on node cl231 failed to start as expected (Pengcheng)

2014-08-25 Thread Reuti

Am 25.08.2014 um 13:23 schrieb Pengcheng Wang:

> Hi Reuti,
> 
> A simple hello_world program works without the h_vmem limit. Honestly, I am 
> not familiar with Open MPI. The command qconf -spl and qconf -sp ompi give 
> the information below.

Thx.


> But strangely, it begins to work after I insert unset SGE_ROOT in my job 
> script. I don't know why. 

Unsetting this variable will make Open MPI unaware that it runs under SGE. 
Hence it will use `ssh` to reach other machines. These `ssh` calls will have no 
memory or time limit set then.

As you run a singleton this shouldn't matter though. But: when you want to 
start additional threads (according to your "#$ -pe ompi*  6") you should use a 
PE with allocation rule "$pe_slots" so that all slots which SGE grants to your 
task are on one and the same machine.

SGE will multiply the limit with the number of slots, but only with the count 
granted on the master node of the parallel job (resp. for each slave). How the 
other treads or tasks started is something you might look at.


> However, it still cannot work smoothly through 60hrs I setup. After running 
> for about two hours, it stops without any error messages. Is this related to 
> the h_vemem limit?

You can have a look in $SGE_ROOT/spool//messages (resp. your actual 
location of the spool directories) whether any limit was passed and triggered 
an abortion of the job (for all granted machines for this job). Also `qacct -j 
` might give some hint whether the was an exitcode of 137 due to a kill 
-9.


> $ qconf -spl
> 16per
> 1per
> 2per
> 4per
> hadoop
> make
> ompi
> openmp
> 
> $ qconf -sp ompi
> pe_name   ompi
> slots 
> user_listsNONE
> xuser_lists   NONE
> start_proc_args   /bin/true
> stop_proc_args/bin/true
> allocation_rule   $fill_up

This will allow to collect the slots from several machines, not necessarily all 
will be on one and the same machine where the jobscript runs.


> control_slavesTRUE
> job_is_first_task FALSE
> urgency_slots min
> 
> SGE version: 6.1u6
> Open MPI version: 1.2.9

Both are really old versions. I fear I can't help much here as many things 
changed compared to the actual version 1.8.1 of Open MPI, while also SGE's 
latest version is 6.2u5 with SoGE being now at 8.1.7.

-- Reuti


> Job script updated:
> #$ -S /bin/bash
> #$ -N couple
> #$ -cwd
> #$ -j y
> #$ -R y
> #$ -l h_rt=62:00:00
> #$ -l h_vmem=2G
> #$ -o couple.out
> #$ -e couple.err
> #$ -pe ompi* 8
> unset SGE_ROOT
>./app
> 
> Thanks,
> Pengcheng
> 
> On Sun, Aug 24, 2014 at 1:00 PM,  wrote:
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
>     users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. Re: A daemon on node cl231 failed to start as expected (Reuti)
> 
> 
> --
> 
> Message: 1
> Date: Sat, 23 Aug 2014 18:49:38 +0200
> From: Reuti 
> To: Open MPI Users 
> Subject: Re: [OMPI users] A daemon on node cl231 failed to start as
> expected
> Message-ID:
> <8f21a4d9-9e8d-4e20-9ae6-04a495a33...@staff.uni-marburg.de>
> Content-Type: text/plain; charset=windows-1252
> 
> Hi,
> 
> Am 23.08.2014 um 16:09 schrieb Pengcheng Wang:
> 
> > I need to run a single driver program that only require one proc with the 
> > command mpirun -np 1 ./app or ./app. But it will schedule the launch of 
> > other executable files including parallel and sequential computing. So I 
> > require more than one proc to run it. It can be run smoothly as an 
> > interactive job with the command below.
> >
> > qrsh -cwd -pe "ompi*" 6 -l h_rt=00:30:00,test=true ./app
> >
> > But after I submitted the job, a strange error occurred and it stopped... 
> > Please find the job script and error message below:
> >
> > ? job submission script:
> > #$ -S /bin/bash
> > #$ -N couple
> > #$ -cwd
> > #$ -j y
> > #$ -l h_rt=05:00:00
> > #$ -l h_vmem=2G
> 
> Is a simple hello_world program listing the threads working? Does it work 
> without the h_vmem limit?
> 
> 
> > #$ -o couple.out
> > #$ -pe ompi*  6
> 
> Which PEs can b

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-27 Thread Reuti

Hi,

Am 27.08.2014 um 09:57 schrieb Tetsuya Mishima:

> Hi Reuti and Ralph,
> 
> How do you think if we accept bind-to none option even when the pe=N option 
> is provided?
> 
> just like:
> mpirun -map-by slot:pe=N -bind-to none ./inverse

Yes, this would be ok to cover all cases.

-- Reuti


> If yes, it's easy for me to make a patch.
> 
> Tetsuya
> 
> 
> Tetsuya Mishima  tmish...@jcity.maeda.co.jp
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25161.php

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

2014-08-28 Thread Reuti

Am 28.08.2014 um 10:09 schrieb Lane, William:

> I have some updates on these issues and some test results as well.
> 
> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs via 
> the SGE orte parallel environment received
> errors whenever more slots are requested than there are actual cores on the 
> first node allocated to the job.

Does "-bind-to none" help? The binding is switched on by default in Open MPI 
1.8 onwards.


> The btl tcp,self switch passed to mpirun made significant differences in 
> performance as per the below:
> 
> Even with the oversubscribe option, the memory mapping errors still persist. 
> On 32 core nodes and with HPL run compiled for openmpi/1.8.2,  it reliably 
> starts failing at 20 cores allocated. Note that I tested with 'btl tcp,self' 
> defined and it does slow down the solve by 2 on a quick solve. The results on 
> a larger solve would probably be more dramatic:
> - Quick HPL 16 core with SM: ~19GFlops
> - Quick HPL 16 core without SM: ~10GFlops
> 
> Unfortunately, a recompiled HPL did not work, but it did give us more 
> information (error below). Still trying a couple things.
> 
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to: CORE
>   Node:csclprd3-0-7
>   #processes:  2
>   #cpus:   1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> 
> When using the SGE make parallel environment to submit jobs everything worked 
> perfectly.
> I noticed when using the make PE, the number of slots allocated from each 
> node to the job
> corresponded to the number of CPU's and disregarded any additional cores 
> within a CPU and
> any hyperthreading cores.

For SGE the hyperthreading cores count as normal cores. In principle it's 
possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit the 
number of cores for the "make" PE, or (better) limit it in each exechost 
defintion to the physical installed ones (this is what I set up usually - maybe 
leaving hyperthreading switched on gives some room for the kernel processes 
this way).


> Here are the definitions of the two parallel environments tested (with orte 
> always failing when
> more slots are requested than there are CPU cores on the first node allocated 
> to the job by
> SGE):
> 
> [root@csclprd3 ~]# qconf -sp orte
> pe_nameorte
> slots  
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$fill_up
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary TRUE
> qsort_args NONE
> 
> [root@csclprd3 ~]# qconf -sp make
> pe_namemake
> slots  999
> user_lists NONE
> xuser_listsNONE
> start_proc_argsNONE
> stop_proc_args NONE
> allocation_rule$round_robin
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary TRUE
> qsort_args NONE
> 
> Although everything seems to work with the make PE, I'd still like
> to know why? Because on a much older version of openMPI loaded
> on an older version of CentOS, SGE and ROCKS, using all physical
> cores, as well as all hyperthreads was never a problem (even on NUMA
> nodes).
> 
> What is the recommended SGE parallel environment definition for
> OpenMPI 1.8.2?

Whether you prefer $fill_up or $round_robin is up to you - do you prefer all 
your processes on the least amount of machines or spread around in the cluster. 
If there is much communication maybe it's better on less machines, but if each 
process has heavy I/O to the local scratch disk spreading it around may be the 
preferred choice. This doesn't make any difference to Open MPI, as the 
generated $PE_HOSTFILE contains just the list of granted slots. Doing it in an 
$fill_up style will of course fill the first node including the hyperthreading 
ones before moving to the next machine (`man sge_pe`).

-- Reuti


> I apologize for the length of this, but I thought it best to provide more
> information than less.
> 
> Thank you in advance,
> 
> -Bill Lane
> 
> 
> From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres (jsquyres) 
> [jsquy...@cisco.com]
> Sent: Friday, August 08, 2014 5:25 AM
> To: Open MPI User's List
> Subject: Re: [OMPI users] Mpirun 1.5.4  problems when request > 28 slots
> 
> On Aug 8, 2014, at 1:24 AM, Lane, William  wrote:
> 
>> Using the "--mca btl tcp,sel

Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread Reuti

Hi,

Am 28.08.2014 um 20:50 schrieb McGrattan, Kevin B. Dr.:

> My institute recently purchased a linux cluster with 20 nodes; 2 sockets per 
> node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 
> jobs. Each job requires 16 MPI processes.  For each job, I want to use two 
> cores on each node, mapping by socket. If I use these options:
>  
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16 
> 
>  
> The reported bindings are:
>  
> [burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././.][./././././.]
> [burn001:09186] MCW rank 1 bound to socket 1[core 6[hwt 0]]: 
> [./././././.][B/././././.]
> [burn004:07113] MCW rank 6 bound to socket 0[core 0[hwt 0]]: 
> [B/././././.][./././././.]
> [burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: 
> [./././././.][B/././././.]
> and so on…
>  
> These bindings appear to be OK, but when I do a “top –H” on each node, I see 
> that all 15 jobs use core 0 and core 6 on each node. This means, I believe, 
> that I am only using 1/6 or my resources. I want to use 100%. So I try this:
>  
> #PBS -l nodes=8:ppn=2
> mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 
> 
>  
> Now it appears that I am getting 100% usage of all cores on all nodes. The 
> bindings are:
>  
> [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
> 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
> 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
> [burn004:07244] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket 1[core 
> 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
> 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
> [burn008:07256] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 
> 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
> 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
> [burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 
> 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
> 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
> and so on…
>  
> The problem now is that some of my jobs are hanging. They all start running 
> fine, and produce output. But at some point I lose about 4 out of 15 jobs due 
> to hanging. I suspect that an MPI message is passed and not received. The 
> number of jobs that hang and the time when they hang varies from test to 
> test. We have run these cases successfully on our old cluster dozens of times 
> – they are part of our benchmark suite.
>  
> When I run these jobs using a map by core strategy (that is, the MPI 
> processes are just mapped by core, and each job only uses 16 cores on two 
> nodes), I do not see as much hanging. It still occurs, but less often. This 
> leads me to suspect that there is something about the increased network 
> traffic due to the map-by-socket approach that is the cause of the problem. 
> But I do not know what to do about it. I think that the map-by-socket 
> approach is the right one, but I do not know if I have my OpenMPI options 
> just right.
>  
> Can you tell me what OpenMPI options to use, and can you tell me how I might 
> debug the hanging issue.

BTW: In modern systems the NIC(s) can be connected directly to one CPU, while 
the other CPU first has to send the data to the other CPU to get to the NIC 
(besides that the integrated NICs may be connect to the chipset).

Did anyone ever made some benchmarks whether there is a difference in which CPU 
was used in the system, i.e. the one to which the network adapter is connected 
or the other CPU - or even to the chipset one?

-- Reuti


> Kevin McGrattan
> National Institute of Standards and Technology
> 100 Bureau Drive, Mail Stop 8664
> Gaithersburg, Maryland 20899
>  
> 301 975 2712
>  
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25181.php

Re: [OMPI users] SGE and openMPI

2014-09-03 Thread Reuti

Hi,

Am 03.09.2014 um 12:17 schrieb Donato Pera:

> I'm using Rocks 5.4.3 with SGE 6.1 I installed
> a new version of openMPI 1.6.5 when I run
> a script using SGE+openMPI (1.6.5) in a single node
> I don't have any problems but when I try to use more nodes
> I get this error:
> 
> 
> A hostfile was provided that contains at least one node not
> present in the allocation:
> 
>  hostfile:  /tmp/21202.1.parallel.q/machines
>  node:  compute-2-4
> 
> If you are operating in a resource-managed environment, then only
> nodes that are in the allocation can be used in the hostfile. You
> may find relative node syntax to be a useful alternative to
> specifying absolute node names see the orte_hosts man page for
> further information.

Was Open MPI compiled with SGE support?

$ ompi_info | grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)

In this case you don't need to provide any -machinefile option at all, as Open 
MPI will use the SGE generated one automatically.

(Nevertheless the $TMPDIR/machines should be correct - it could be an issue 
between the short hostname and the FQDN - can you `cat` the $TMPDIR/machines in 
a job script for curiosity - and the output of `hostname` on a node too 
therein?).


> --
> rm: cannot remove `/tmp/21202.1.parallel.q/rsh': No such file or directory
> --

The above line comes from "stop_proc_args" defined in the "mpi" PE and can be 
ignored. In fact: you don't need any "stop_proc_args" at all. Maybe you can 
define a new PE solely for Open MPI, often called "orte":

https://www.open-mpi.org/faq/?category=sge

-- Reuti


> I send also my SGE script:
> 
> #!/bin/bash
> #$ -S /bin/bash
> #$ -pe mpi 64
> #$ -cwd
> #$ -o ./file.out
> #$ -e ./file.err
> 
> export LD_LIBRARY_PATH=/home/SWcbbc/openmpi-1.6.5/lib:$LD_LIBRARY_PATH
> export OMP_NUM_THREADS=1
> 
> CPMD_PATH=/home/tanzi/myroot/X86_66intel-mpi/
> PP_PATH=/home/tanzi
> 
> /home/SWcbbc/openmpi-1.6.5/bin/mpirun -np 64 -machinefile 
> $TMPDIR/machines  
> ${CPMD_PATH}cpmd.x  input ${PP_PATH}/PP/ > out
> 
> 
> I don't understand my mistake
> 
> Regards D.
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25238.php

Re: [OMPI users] SGE and openMPI

2014-09-03 Thread Reuti

Am 03.09.2014 um 13:11 schrieb Donato Pera:

> I get
> 
> ompi_info | grep grid
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)

Good.


> and using this script
> 
> #!/bin/bash
> #$ -S /bin/bash
> #$ -pe orte 64
> #$ -cwd
> #$ -o ./file.out
> #$ -e ./file.err
> 
> export LD_LIBRARY_PATH=/home/SWcbbc/openmpi-1.6.5/lib:$LD_LIBRARY_PATH
> export OMP_NUM_THREADS=1
> 
> CPMD_PATH=/home/tanzi/myroot/X86_66intel-mpi/
> PP_PATH=/home/tanzi
> /home/SWcbbc/openmpi-1.6.5/bin/mpirun -mca btl openib -np 64
> -machinefile $TMPDIR/machines  ${CPMD_PATH}cpmd.x  input ${PP_PATH}/PP/

In the PE "orte" is no "start_proc_args" defined which could generate the 
machinefile. Please try to start the application with:

/home/SWcbbc/openmpi-1.6.5/bin/mpirun -mca btl openib ${CPMD_PATH}cpmd.x  input 
${PP_PATH}/PP/

-- Reuti


>> out
> 
> 
> I get this error
> 
> Open RTE was unable to open the hostfile:
>/tmp/21213.1.debug.q/machines
> Check to make sure the path and filename are correct.
> --
> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
> base/rmaps_base_support_fns.c at line 207
> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
> rmaps_rr.c at line 82
> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
> base/rmaps_base_map_job.c at line 88
> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
> base/plm_base_launch_support.c at line 105
> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
> plm_rsh_module.c at line 1173
> 
> 
> 
> 
> 
> Instead using this script
> 
> 
> #!/bin/bash
> #$ -S /bin/bash
> #$ -pe orte 64
> #$ -cwd
> #$ -o ./file.out
> #$ -e ./file.err
> 
> export LD_LIBRARY_PATH=/home/SWcbbc/openmpi-1.6.5/lib:$LD_LIBRARY_PATH
> export OMP_NUM_THREADS=1
> 
> CPMD_PATH=/home/tanzi/myroot/X86_66intel-mpi/
> PP_PATH=/home/tanzi
> /home/SWcbbc/openmpi-1.6.5/bin/mpirun -mca btl openib -np 64
> $TMPDIR/machines  ${CPMD_PATH}cpmd.x  input ${PP_PATH}/PP/ > out
> 
> 
> I get
> Executable: /tmp/21214.1.debug.q/machines
> Node: compute-2-0.local
> 
> while attempting to start process rank 0.
> --
> 
> can you help me
> 
> 
> Thanks and Regards Donato
> 
> 
> 
> 
> On 03/09/2014 12:28, Reuti wrote:
>> ompi_info | grep grid
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25240.php

Re: [OMPI users] SGE and openMPI

2014-09-04 Thread Reuti

Hi,

Am 04.09.2014 um 14:43 schrieb Donato Pera:

> using this script :
> 
> #!/bin/bash
> #$ -S /bin/bash
> #$ -pe orte 64
> #$ -cwd
> #$ -o ./file.out
> #$ -e ./file.err
> 
> export LD_LIBRARY_PATH=/home/SWcbbc/openmpi-1.6.5/lib:$LD_LIBRARY_PATH
> export OMP_NUM_THREADS=1
> 
> CPMD_PATH=/home/tanzi/myroot/X86_66intel-mpi/
> PP_PATH=/home/tanzi
> /home/SWcbbc/openmpi-1.6.5/bin/mpirun ${CPMD_PATH}cpmd.x  input
> ${PP_PATH}/PP/ > out

Is this text below in out, file.out or file.err - any hint in the other files?

-- Reuti


> 
> The program run for about 2 minutes and after I get this error
> 
> WARNING: A process refused to die!
> 
> Host: compute-2-2.local
> PID:  24897
> 
> This process may still be running and/or consuming resources.
> 
> --
> [compute-2-2.local:24889] 25 more processes have sent help message
> help-odls-default.txt / odls-default:could-not-kill
> [compute-2-2.local:24889] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
> [compute-2-2.local:24889] 27 more processes have sent help message
> help-odls-default.txt / odls-default:could-not-kill
> --
> mpirun has exited due to process rank 0 with PID 24896 on
> node compute-2-2.local exiting improperly. There are two reasons this
> could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> ------
> [compute-2-2.local:24889] 1 more process has sent help message
> help-odls-default.txt / odls-default:could-not-kill
> 
> 
> Thanks and Regards Donato
> 
> 
> 
> 
> On 03/09/2014 13:19, Reuti wrote:
>> Am 03.09.2014 um 13:11 schrieb Donato Pera:
>> 
>>> I get
>>> 
>>> ompi_info | grep grid
>>>MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)
>> Good.
>> 
>> 
>>> and using this script
>>> 
>>> #!/bin/bash
>>> #$ -S /bin/bash
>>> #$ -pe orte 64
>>> #$ -cwd
>>> #$ -o ./file.out
>>> #$ -e ./file.err
>>> 
>>> export LD_LIBRARY_PATH=/home/SWcbbc/openmpi-1.6.5/lib:$LD_LIBRARY_PATH
>>> export OMP_NUM_THREADS=1
>>> 
>>> CPMD_PATH=/home/tanzi/myroot/X86_66intel-mpi/
>>> PP_PATH=/home/tanzi
>>> /home/SWcbbc/openmpi-1.6.5/bin/mpirun -mca btl openib -np 64
>>> -machinefile $TMPDIR/machines  ${CPMD_PATH}cpmd.x  input ${PP_PATH}/PP/
>> In the PE "orte" is no "start_proc_args" defined which could generate the 
>> machinefile. Please try to start the application with:
>> 
>> /home/SWcbbc/openmpi-1.6.5/bin/mpirun -mca btl openib ${CPMD_PATH}cpmd.x  
>> input ${PP_PATH}/PP/
>> 
>> -- Reuti
>> 
>> 
>>>> out
>>> 
>>> I get this error
>>> 
>>> Open RTE was unable to open the hostfile:
>>>   /tmp/21213.1.debug.q/machines
>>> Check to make sure the path and filename are correct.
>>> --
>>> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
>>> base/rmaps_base_support_fns.c at line 207
>>> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
>>> rmaps_rr.c at line 82
>>> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
>>> base/rmaps_base_map_job.c at line 88
>>> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
>>> base/plm_base_launch_support.c at line 105
>>> [compute-2-6.local:22452] [[5218,0],0] ORTE_ERROR_LOG: Not found in file
>>> plm_rsh_module.c at line 1173
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Instead using this script
>>> 
>>> 
>>> #!/bin/bash
>>> #$ -S /bin/bash
>>> #$

Re: [OMPI users] help

2014-09-08 Thread Reuti

Hi,

Am 08.09.2014 um 10:09 schrieb Ahmed Salama:

> i new in open mpi, I installed openmpi1.6.5 in linux redhat , and i have code 
> in java and i want to use mpi with it, so i configured mpi as follow
> 
> ./configure --enable-mpi-java  --with-jdk-bindir=/usr/jdk6/bin  
> --with-jdk-headers=/usr/jdk6/include  --prefix=/usr/local/openmpi
> 
> but i have the following warning message: 
> Warning  : unrocognoized option -enable-mpi-java --with..

The Java binding appeared at sometime in the 1.7 series. When you start from 
scratch right now, it's best to use 1.8.1 where it should work.

Some notes on it: 
https://blogs.cisco.com/performance/java-bindings-for-open-mpi/

-- Reuti

> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25294.php

Re: [OMPI users] Strange affinity messages with 1.8 and torque 5

2014-09-23 Thread Reuti

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Am 23.09.2014 um 19:53 schrieb Brock Palen:

> I found a fun head scratcher, with openmpi 1.8.2  with torque 5 built with TM 
> support, on hereto core layouts  I get the fun thing:
> mpirun -report-bindings hostname< Works

And you get 64 lines of output?


> mpirun -report-bindings -np 64 hostname   <- Wat?
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to: CORE
>   Node:nyx5518
>   #processes:  2
>   #cpus:   1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --

How many cores are physically installed on this machine - two as mentioned 
above?

- -- Reuti


> I ran with --oversubscribed and got the expected host list, which matched 
> $PBS_NODEFILE and was 64 entires long:
> 
> mpirun -overload-allowed -report-bindings -np 64 --oversubscribe hostname
> 
> What did I do wrong?  I'm stumped why one works one doesn't but the one that 
> doesn't if your force it appears correct.
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25375.php

-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.20 (Darwin)
Comment: GPGTools - http://gpgtools.org

iEYEARECAAYFAlQhv7IACgkQo/GbGkBRnRr3HgCgjZoD9l9a+WThl5CDaGF1jawx
PWIAmwWnZwQdytNgAJgbir6V7yCyBt5D
=NG0H
-END PGP SIGNATURE-

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-09 Thread Reuti

Hi,

> Am 09.11.2014 um 18:20 schrieb SLIM H.A. :
> 
> We switched on hyper threading on our cluster with two eight core sockets per 
> node (32 threads per node).
> 
> We configured  gridengine with 16 slots per node to allow the 16 extra 
> threads for kernel process use but this apparently does not work. Printout of 
> the gridengine hostfile shows that for a 32 slots job, 16 slots are placed on 
> each of two nodes as expected. Including the openmpi --display-map option 
> shows that all 32 processes are incorrectly  placed on the head node.

You mean the master node of the parallel job I assume.

> Here is part of the output
> 
> master=cn6083
> PE=orte

What allocation rule was defined for this PE - "control_slave yes" is set?

> JOB_ID=2481793
> Got 32 slots.
> slots:
> cn6083 16 par6.q@cn6083 
> cn6085 16 par6.q@cn6085 
> Sun Nov  9 16:50:59 GMT 2014
> Data for JOB [44767,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: cn6083  Num slots: 16   Max slots: 0Num procs: 32
>Process OMPI jobid: [44767,1] App: 0 Process rank: 0
>Process OMPI jobid: [44767,1] App: 0 Process rank: 1
> ...
>Process OMPI jobid: [44767,1] App: 0 Process rank: 31
> 
> =
> 
> I found some related mailings about a new warning in 1.8.2 about 
> oversubscription and  I tried a few options to avoid the use of the extra 
> threads for MPI tasks by openmpi without success, e.g. variants of
> 
> --cpus-per-proc 1 
> --bind-to-core 
> 
> and some others. Gridengine treats hw threads as cores==slots (?) but the 
> content of $PE_HOSTFILE suggests it distributes the slots sensibly  so it 
> seems there is an option for openmpi required to get 16 cores per node?

Was Open MPI configured with --with-sge?

-- Reuti

> I tried both 1.8.2, 1.8.3 and also 1.6.5.
> 
> Thanks for some clarification that anyone can give.
> 
> Henk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25718.php

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti

Hi,

Am 09.11.2014 um 05:38 schrieb Ralph Castain:

> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
> process receives a complete map of that info for every process in the job. So 
> when the TCP btl sets itself up, it attempts to connect across -all- the 
> interfaces published by the other end.
> 
> So it doesn’t matter what hostname is provided by the RM. We discover and 
> “share” all of the interface info for every node, and then use them for 
> loadbalancing.

does this lead to any time delay when starting up? I stayed with Open MPI 1.6.5 
for some time and tried to use Open MPI 1.8.3 now. As there is a delay when the 
applications starts in my first compilation of 1.8.3 I disregarded even all my 
extra options and run it outside of any queuingsystem - the delay remains - on 
two different clusters.

I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
creates this delay when starting up a simple mpihello. I assume it may lay in 
the way how to reach other machines, as with one single machine there is no 
delay. But using one (and only one - no tree spawn involved) additional machine 
already triggers this delay.

Did anyone else notice it?

-- Reuti


> HTH
> Ralph
> 
> 
>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>> 
>> Ok I figured, i'm going to have to read some more for my own curiosity. The 
>> reason I mention the Resource Manager we use, and that the hostnames given 
>> but PBS/Torque match the 1gig-e interfaces, i'm curious what path it would 
>> take to get to a peer node when the node list given all match the 1gig 
>> interfaces but yet data is being sent out the 10gig eoib0/ib0 interfaces.  
>> 
>> I'll go do some measurements and see.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by default.  
>>> 
>>> This short FAQ has links to 2 other FAQs that provide detailed information 
>>> about reachability:
>>> 
>>>  http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>>> 
>>> The usNIC BTL uses UDP for its wire transport and actually does a much more 
>>> standards-conformant peer reachability determination (i.e., it actually 
>>> checks routing tables to see if it can reach a given peer which has all 
>>> kinds of caching benefits, kernel controls if you want them, etc.).  We 
>>> haven't back-ported this to the TCP BTL because a) most people who use TCP 
>>> for MPI still use a single L2 address space, and b) no one has asked for 
>>> it.  :-)
>>> 
>>> As for the round robin scheduling, there's no indication from the Linux TCP 
>>> stack what the bandwidth is on a given IP interface.  So unless you use the 
>>> btl_tcp_bandwidth_ (e.g., btl_tcp_bandwidth_eth0) MCA 
>>> params, OMPI will round-robin across them equally.
>>> 
>>> If you have multiple IP interfaces sharing a single physical link, there 
>>> will likely be no benefit from having Open MPI use more than one of them.  
>>> You should probably use btl_tcp_if_include / btl_tcp_if_exclude to select 
>>> just one.
>>> 
>>> 
>>> 
>>> 
>>> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
>>> 
>>>> I was doing a test on our IB based cluster, where I was diabling IB
>>>> 
>>>> --mca btl ^openib --mca mtl ^mxm
>>>> 
>>>> I was sending very large messages >1GB  and I was surppised by the speed.
>>>> 
>>>> I noticed then that of all our ethernet interfaces
>>>> 
>>>> eth0  (1gig-e)
>>>> ib0  (ip over ib, for lustre configuration at vendor request)
>>>> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
>>>> extrnal storage support at >1Gig speed
>>>> 
>>>> I saw all three were getting traffic.
>>>> 
>>>> We use torque for our Resource Manager and use TM support, the hostnames 
>>>> given by torque match the eth0 interfaces.
>>>> 
>>>> How does OMPI figure out that it can also talk over the others?  How does 
>>>> it chose to load balance?
>>>> 
>>>> BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 
>>>> and eoib0  are the same physical device and

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti

Am 10.11.2014 um 12:24 schrieb Reuti:

> Hi,
> 
> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> 
>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
>> process receives a complete map of that info for every process in the job. 
>> So when the TCP btl sets itself up, it attempts to connect across -all- the 
>> interfaces published by the other end.
>> 
>> So it doesn’t matter what hostname is provided by the RM. We discover and 
>> “share” all of the interface info for every node, and then use them for 
>> loadbalancing.
> 
> does this lead to any time delay when starting up? I stayed with Open MPI 
> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a delay 
> when the applications starts in my first compilation of 1.8.3 I disregarded 
> even all my extra options and run it outside of any queuingsystem - the delay 
> remains - on two different clusters.

I forgot to mention: the delay is more or less exactly 2 minutes from the time 
I issued `mpiexec` until the `mpihello` starts up (there is no delay for the 
initial `ssh` to reach the other node though).

-- Reuti


> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
> creates this delay when starting up a simple mpihello. I assume it may lay in 
> the way how to reach other machines, as with one single machine there is no 
> delay. But using one (and only one - no tree spawn involved) additional 
> machine already triggers this delay.
> 
> Did anyone else notice it?
> 
> -- Reuti
> 
> 
>> HTH
>> Ralph
>> 
>> 
>>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>>> 
>>> Ok I figured, i'm going to have to read some more for my own curiosity. The 
>>> reason I mention the Resource Manager we use, and that the hostnames given 
>>> but PBS/Torque match the 1gig-e interfaces, i'm curious what path it would 
>>> take to get to a peer node when the node list given all match the 1gig 
>>> interfaces but yet data is being sent out the 10gig eoib0/ib0 interfaces.  
>>> 
>>> I'll go do some measurements and see.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
>>>> wrote:
>>>> 
>>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by default. 
>>>>  
>>>> 
>>>> This short FAQ has links to 2 other FAQs that provide detailed information 
>>>> about reachability:
>>>> 
>>>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>>>> 
>>>> The usNIC BTL uses UDP for its wire transport and actually does a much 
>>>> more standards-conformant peer reachability determination (i.e., it 
>>>> actually checks routing tables to see if it can reach a given peer which 
>>>> has all kinds of caching benefits, kernel controls if you want them, 
>>>> etc.).  We haven't back-ported this to the TCP BTL because a) most people 
>>>> who use TCP for MPI still use a single L2 address space, and b) no one has 
>>>> asked for it.  :-)
>>>> 
>>>> As for the round robin scheduling, there's no indication from the Linux 
>>>> TCP stack what the bandwidth is on a given IP interface.  So unless you 
>>>> use the btl_tcp_bandwidth_ (e.g., 
>>>> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them 
>>>> equally.
>>>> 
>>>> If you have multiple IP interfaces sharing a single physical link, there 
>>>> will likely be no benefit from having Open MPI use more than one of them.  
>>>> You should probably use btl_tcp_if_include / btl_tcp_if_exclude to select 
>>>> just one.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
>>>> 
>>>>> I was doing a test on our IB based cluster, where I was diabling IB
>>>>> 
>>>>> --mca btl ^openib --mca mtl ^mxm
>>>>> 
>>>>> I was sending very large messages >1GB  and I was surppised by the speed.
>>>>> 
>>>>> I noticed then that of all our ethernet interfaces
>>>>> 
>>>>> eth0  (1gig-e)
>>>>> ib0  (ip over ib, for lustre configuration at vendor request)
>>>>> eoib0  (et

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti

Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):

> Wow, that's pretty terrible!  :(
> 
> Is the behavior BTL-specific, perchance?  E.G., if you only use certain BTLs, 
> does the delay disappear?

You mean something like:

reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
./mpihello; date
Mon Nov 10 13:44:34 CET 2014
Hello World from Node 1.
Total: 4
Universe: 4
Hello World from Node 0.
Hello World from Node 3.
Hello World from Node 2.
Mon Nov 10 13:46:42 CET 2014

(the above was even the latest v1.8.3-186-g978f61d)

Falling back to 1.8.1 gives (as expected):

reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
./mpihello; date
Mon Nov 10 13:49:51 CET 2014
Hello World from Node 1.
Total: 4
Universe: 4
Hello World from Node 0.
Hello World from Node 2.
Hello World from Node 3.
Mon Nov 10 13:49:53 CET 2014


-- Reuti

> FWIW: the use-all-IP interfaces approach has been in OMPI forever. 
> 
> Sent from my phone. No type good. 
> 
>> On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
>> 
>>> Am 10.11.2014 um 12:24 schrieb Reuti:
>>> 
>>> Hi,
>>> 
>>>> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
>>>> 
>>>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
>>>> Each process receives a complete map of that info for every process in the 
>>>> job. So when the TCP btl sets itself up, it attempts to connect across 
>>>> -all- the interfaces published by the other end.
>>>> 
>>>> So it doesn’t matter what hostname is provided by the RM. We discover and 
>>>> “share” all of the interface info for every node, and then use them for 
>>>> loadbalancing.
>>> 
>>> does this lead to any time delay when starting up? I stayed with Open MPI 
>>> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a 
>>> delay when the applications starts in my first compilation of 1.8.3 I 
>>> disregarded even all my extra options and run it outside of any 
>>> queuingsystem - the delay remains - on two different clusters.
>> 
>> I forgot to mention: the delay is more or less exactly 2 minutes from the 
>> time I issued `mpiexec` until the `mpihello` starts up (there is no delay 
>> for the initial `ssh` to reach the other node though).
>> 
>> -- Reuti
>> 
>> 
>>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
>>> creates this delay when starting up a simple mpihello. I assume it may lay 
>>> in the way how to reach other machines, as with one single machine there is 
>>> no delay. But using one (and only one - no tree spawn involved) additional 
>>> machine already triggers this delay.
>>> 
>>> Did anyone else notice it?
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> HTH
>>>> Ralph
>>>> 
>>>> 
>>>>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>>>>> 
>>>>> Ok I figured, i'm going to have to read some more for my own curiosity. 
>>>>> The reason I mention the Resource Manager we use, and that the hostnames 
>>>>> given but PBS/Torque match the 1gig-e interfaces, i'm curious what path 
>>>>> it would take to get to a peer node when the node list given all match 
>>>>> the 1gig interfaces but yet data is being sent out the 10gig eoib0/ib0 
>>>>> interfaces.  
>>>>> 
>>>>> I'll go do some measurements and see.
>>>>> 
>>>>> Brock Palen
>>>>> www.umich.edu/~brockp
>>>>> CAEN Advanced Computing
>>>>> XSEDE Campus Champion
>>>>> bro...@umich.edu
>>>>> (734)936-1985
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
>>>>>> wrote:
>>>>>> 
>>>>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by 
>>>>>> default.  
>>>>>> 
>>>>>> This short FAQ has links to 2 other FAQs that provide detailed 
>>>>>> information about reachability:
>>>>>> 
>>>>>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>>>>>> 
>>>>>> The usNIC BTL uses UDP for its wire transport and actually does a much 
>>>>>> more standards-conformant peer reachability determination (i.e., it 
>>>>>> actually checks routing tabl

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti

Hi,

Am 10.11.2014 um 16:39 schrieb Ralph Castain:

> That is indeed bizarre - we haven’t heard of anything similar from other 
> users. What is your network configuration? If you use oob_tcp_if_include or 
> exclude, can you resolve the problem?

Thx - this option helped to get it working.

These tests were made for sake of simplicity between the headnode of the 
cluster and one (idle) compute node. I tried then between the (identical) 
compute nodes and this worked fine. The headnode of the cluster and the compute 
node are slightly different though (i.e. number of cores), and using eth1 resp. 
eth0 for the internal network of the cluster.

I tried --hetero-nodes with no change.

Then I turned to:

reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include 
192.168.154.0/26 -n 4 --hetero-nodes --hostfile machines ./mpihello; date

and the application started instantly. On another cluster, where the headnode 
is identical to the compute nodes but with the same network setup as above, I 
observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the 
working addition was the correct "oob_tcp_if_include" to solve the issue.

The questions which remain: a) is this a targeted behavior, b) what changed in 
this scope between 1.8.1 and 1.8.2?

-- Reuti


> 
>> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
>> 
>> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
>> 
>>> Wow, that's pretty terrible!  :(
>>> 
>>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
>>> BTLs, does the delay disappear?
>> 
>> You mean something like:
>> 
>> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
>> ./mpihello; date
>> Mon Nov 10 13:44:34 CET 2014
>> Hello World from Node 1.
>> Total: 4
>> Universe: 4
>> Hello World from Node 0.
>> Hello World from Node 3.
>> Hello World from Node 2.
>> Mon Nov 10 13:46:42 CET 2014
>> 
>> (the above was even the latest v1.8.3-186-g978f61d)
>> 
>> Falling back to 1.8.1 gives (as expected):
>> 
>> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
>> ./mpihello; date
>> Mon Nov 10 13:49:51 CET 2014
>> Hello World from Node 1.
>> Total: 4
>> Universe: 4
>> Hello World from Node 0.
>> Hello World from Node 2.
>> Hello World from Node 3.
>> Mon Nov 10 13:49:53 CET 2014
>> 
>> 
>> -- Reuti
>> 
>>> FWIW: the use-all-IP interfaces approach has been in OMPI forever. 
>>> 
>>> Sent from my phone. No type good. 
>>> 
>>>> On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
>>>> 
>>>>> Am 10.11.2014 um 12:24 schrieb Reuti:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>>> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
>>>>>> 
>>>>>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
>>>>>> Each process receives a complete map of that info for every process in 
>>>>>> the job. So when the TCP btl sets itself up, it attempts to connect 
>>>>>> across -all- the interfaces published by the other end.
>>>>>> 
>>>>>> So it doesn’t matter what hostname is provided by the RM. We discover 
>>>>>> and “share” all of the interface info for every node, and then use them 
>>>>>> for loadbalancing.
>>>>> 
>>>>> does this lead to any time delay when starting up? I stayed with Open MPI 
>>>>> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a 
>>>>> delay when the applications starts in my first compilation of 1.8.3 I 
>>>>> disregarded even all my extra options and run it outside of any 
>>>>> queuingsystem - the delay remains - on two different clusters.
>>>> 
>>>> I forgot to mention: the delay is more or less exactly 2 minutes from the 
>>>> time I issued `mpiexec` until the `mpihello` starts up (there is no delay 
>>>> for the initial `ssh` to reach the other node though).
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
>>>>> creates this delay when starting up a simple mpihello. I assume it may 
>>>>> lay in the way how to reach other machines, as with one single machine 
>>>>> there is no delay. But using one (and only one - no tree spawn involved) 
>>>>> additional machine

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Reuti

Am 11.11.2014 um 16:13 schrieb Ralph Castain:

> This clearly displays the problem - if you look at the reported “allocated 
> nodes”, you see that we only got one node (cn6050). This is why we mapped all 
> your procs onto that node.
> 
> So the real question is - why? Can you show us the content of PE_HOSTFILE?
> 
> 
>> On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
>> 
>> Dear Reuti and Ralph
>>  
>> Below is the output of the run for openmpi 1.8.3 with this line
>>  
>> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 $exe
>>  
>>  
>> master=cn6050
>> PE=orte
>> JOB_ID=2482923
>> Got 32 slots.
>> slots:
>> cn6050 16 par6.q@cn6050 
>> cn6045 16 par6.q@cn6045 

The above looks like the PE_HOSTFILE. So it should be 16 slots per node.

I wonder whether any environment variable was reset, which normally allows Open 
MPI to discover that it's running inside SGE.

I.e. SGE_ROOT, JOB_ID, ARC and PE_HOSTFILE are untouched before the job starts?

Supplying "-np $NSLOTS" shouldn't be necessary though.

-- Reuti



>> Tue Nov 11 12:37:37 GMT 2014
>>  
>> ==   ALLOCATED NODES   ==
>> cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
>> =
>> Data for JOB [57374,1] offset 0
>>  
>>    JOB MAP   
>>  
>> Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
>> Process OMPI jobid: [57374,1] App: 0 Process rank: 0
>> Process OMPI jobid: [57374,1] App: 0 Process rank: 1
>>  
>> …
>> Process OMPI jobid: [57374,1] App: 0 Process rank: 31
>>  
>>  
>> Also
>> ompi_info | grep grid
>> gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
>> v1.8.3)
>> and
>> ompi_info | grep psm
>> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
>> because the intercoonect is TrueScale/QLogic
>>  
>> and
>>  
>> setenv OMPI_MCA_mtl "psm"
>>  
>> is set in the script. This is the PE
>>  
>> pe_name   orte
>> slots 4000
>> user_listsNONE
>> xuser_lists   NONE
>> start_proc_args   /bin/true
>> stop_proc_args/bin/true
>> allocation_rule   $fill_up
>> control_slavesTRUE
>> job_is_first_task FALSE
>> urgency_slots min
>>  
>> Many thanks
>>  
>> Henk
>>  
>>  
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
>> Sent: 10 November 2014 05:07
>> To: Open MPI Users
>> Subject: Re: [OMPI users] oversubscription of slots with GridEngine
>>  
>> You might also add the —display-allocation flag to mpirun so we can see what 
>> it thinks the allocation looks like. If there are only 16 slots on the node, 
>> it seems odd that OMPI would assign 32 procs to it unless it thinks there is 
>> only 1 node in the job, and oversubscription is allowed (which it won’t be 
>> by default if it read the GE allocation)
>>  
>>  
>> On Nov 9, 2014, at 9:56 AM, Reuti  wrote:
>>  
>> Hi,
>> 
>> 
>> Am 09.11.2014 um 18:20 schrieb SLIM H.A. :
>> 
>> We switched on hyper threading on our cluster with two eight core sockets 
>> per node (32 threads per node).
>> 
>> We configured  gridengine with 16 slots per node to allow the 16 extra 
>> threads for kernel process use but this apparently does not work. Printout 
>> of the gridengine hostfile shows that for a 32 slots job, 16 slots are 
>> placed on each of two nodes as expected. Including the openmpi --display-map 
>> option shows that all 32 processes are incorrectly  placed on the head node.
>> 
>> You mean the master node of the parallel job I assume.
>> 
>> 
>> Here is part of the output
>> 
>> master=cn6083
>> PE=orte
>> 
>> What allocation rule was defined for this PE - "control_slave yes" is set?
>> 
>> 
>> JOB_ID=2481793
>> Got 32 slots.
>> slots:
>> cn6083 16 par6.q@cn6083 
>> cn6085 16 par6.q@cn6085 
>> Sun Nov  9 16:50:59 GMT 2014
>> Data for JOB [44767,1] offset 0
>> 
>>    JOB MAP   
>> 
>> Data for node: cn6083  Num slots: 16   Max slots: 0Num procs: 32
>>   Process OMPI jobid: [44767,1] App: 0 Process rank: 0
>>   Process OMPI jobid: [44767,1] App: 0 Process rank:

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Reuti

Am 11.11.2014 um 17:52 schrieb Ralph Castain:

> 
>> On Nov 11, 2014, at 7:57 AM, Reuti  wrote:
>> 
>> Am 11.11.2014 um 16:13 schrieb Ralph Castain:
>> 
>>> This clearly displays the problem - if you look at the reported “allocated 
>>> nodes”, you see that we only got one node (cn6050). This is why we mapped 
>>> all your procs onto that node.
>>> 
>>> So the real question is - why? Can you show us the content of PE_HOSTFILE?
>>> 
>>> 
>>>> On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
>>>> 
>>>> Dear Reuti and Ralph
>>>> 
>>>> Below is the output of the run for openmpi 1.8.3 with this line
>>>> 
>>>> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 
>>>> $exe
>>>> 
>>>> 
>>>> master=cn6050
>>>> PE=orte
>>>> JOB_ID=2482923
>>>> Got 32 slots.
>>>> slots:
>>>> cn6050 16 par6.q@cn6050 
>>>> cn6045 16 par6.q@cn6045 
>> 
>> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
> 
> Hey Reuti
> 
> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
> module, and it looks like it is expecting a different format. I suspect that 
> is the problem

Well, the fourth column can be a processer range in older versions of SGE and 
the binding in newer ones, but the first three columns were always this way.

-- Reuti


> Ralph
> 
>> 
>> I wonder whether any environment variable was reset, which normally allows 
>> Open MPI to discover that it's running inside SGE.
>> 
>> I.e. SGE_ROOT, JOB_ID, ARC and PE_HOSTFILE are untouched before the job 
>> starts?
>> 
>> Supplying "-np $NSLOTS" shouldn't be necessary though.
>> 
>> -- Reuti
>> 
>> 
>> 
>>>> Tue Nov 11 12:37:37 GMT 2014
>>>> 
>>>> ==   ALLOCATED NODES   ==
>>>>   cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>> =
>>>> Data for JOB [57374,1] offset 0
>>>> 
>>>>    JOB MAP   
>>>> 
>>>> Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
>>>>   Process OMPI jobid: [57374,1] App: 0 Process rank: 0
>>>>   Process OMPI jobid: [57374,1] App: 0 Process rank: 1
>>>> 
>>>> …
>>>>   Process OMPI jobid: [57374,1] App: 0 Process rank: 31
>>>> 
>>>> 
>>>> Also
>>>> ompi_info | grep grid
>>>> gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
>>>> v1.8.3)
>>>> and
>>>> ompi_info | grep psm
>>>> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
>>>> because the intercoonect is TrueScale/QLogic
>>>> 
>>>> and
>>>> 
>>>> setenv OMPI_MCA_mtl "psm"
>>>> 
>>>> is set in the script. This is the PE
>>>> 
>>>> pe_name   orte
>>>> slots 4000
>>>> user_listsNONE
>>>> xuser_lists   NONE
>>>> start_proc_args   /bin/true
>>>> stop_proc_args/bin/true
>>>> allocation_rule   $fill_up
>>>> control_slavesTRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots min
>>>> 
>>>> Many thanks
>>>> 
>>>> Henk
>>>> 
>>>> 
>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
>>>> Sent: 10 November 2014 05:07
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] oversubscription of slots with GridEngine
>>>> 
>>>> You might also add the —display-allocation flag to mpirun so we can see 
>>>> what it thinks the allocation looks like. If there are only 16 slots on 
>>>> the node, it seems odd that OMPI would assign 32 procs to it unless it 
>>>> thinks there is only 1 node in the job, and oversubscription is allowed 
>>>> (which it won’t be by default if it read the GE allocation)
>>>> 
>>>> 
>>>> On Nov 9, 2014, at 9:56 AM, Reuti  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> Am 09.11.2014 um 18:20 schrieb SLIM H.A. :
>>&

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Reuti


Am 11.11.2014 um 19:29 schrieb Ralph Castain:

> 
>> On Nov 11, 2014, at 10:06 AM, Reuti  wrote:
>> 
>> Am 11.11.2014 um 17:52 schrieb Ralph Castain:
>> 
>>> 
>>>> On Nov 11, 2014, at 7:57 AM, Reuti  wrote:
>>>> 
>>>> Am 11.11.2014 um 16:13 schrieb Ralph Castain:
>>>> 
>>>>> This clearly displays the problem - if you look at the reported 
>>>>> “allocated nodes”, you see that we only got one node (cn6050). This is 
>>>>> why we mapped all your procs onto that node.
>>>>> 
>>>>> So the real question is - why? Can you show us the content of PE_HOSTFILE?
>>>>> 
>>>>> 
>>>>>> On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
>>>>>> 
>>>>>> Dear Reuti and Ralph
>>>>>> 
>>>>>> Below is the output of the run for openmpi 1.8.3 with this line
>>>>>> 
>>>>>> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 
>>>>>> $exe
>>>>>> 
>>>>>> 
>>>>>> master=cn6050
>>>>>> PE=orte
>>>>>> JOB_ID=2482923
>>>>>> Got 32 slots.
>>>>>> slots:
>>>>>> cn6050 16 par6.q@cn6050 
>>>>>> cn6045 16 par6.q@cn6045 
>>>> 
>>>> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
>>> 
>>> Hey Reuti
>>> 
>>> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
>>> module, and it looks like it is expecting a different format. I suspect 
>>> that is the problem
>> 
>> Well, the fourth column can be a processer range in older versions of SGE 
>> and the binding in newer ones, but the first three columns were always this 
>> way.
> 
> Hmmm…perhaps I’m confused here. I guess you’re saying that just the last two 
> lines of his output contain the PE_HOSTFILE, as opposed to the entire thing?

Yes. The entire thing looks like an output of the jobscript from the OP. Only 
the last two lines should be the content of the PE_HOSTFILE


> If so, I’m wondering if that NULL he shows in there is the source of the 
> trouble. The parser doesn’t look like it would handle that very well, though 
> I’d need to test it. Is that NULL expected? Or is the NULL not really in the 
> file?

I must admit here: for me the fourth column is either literally UNDEFINED or 
the tuple cpu,core in case of turned on binding like 0,0 But it's never , 
neither literally nor the byte 0x00. Maybe the OP can tell us which GE version 
he uses,.

-- Reuti


>> -- Reuti
>> 
>> 
>>> Ralph
>>> 
>>>> 
>>>> I wonder whether any environment variable was reset, which normally allows 
>>>> Open MPI to discover that it's running inside SGE.
>>>> 
>>>> I.e. SGE_ROOT, JOB_ID, ARC and PE_HOSTFILE are untouched before the job 
>>>> starts?
>>>> 
>>>> Supplying "-np $NSLOTS" shouldn't be necessary though.
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>> 
>>>>>> Tue Nov 11 12:37:37 GMT 2014
>>>>>> 
>>>>>> ==   ALLOCATED NODES   ==
>>>>>>  cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>>>> =
>>>>>> Data for JOB [57374,1] offset 0
>>>>>> 
>>>>>>    JOB MAP   
>>>>>> 
>>>>>> Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
>>>>>>  Process OMPI jobid: [57374,1] App: 0 Process rank: 0
>>>>>>  Process OMPI jobid: [57374,1] App: 0 Process rank: 1
>>>>>> 
>>>>>> …
>>>>>>  Process OMPI jobid: [57374,1] App: 0 Process rank: 31
>>>>>> 
>>>>>> 
>>>>>> Also
>>>>>> ompi_info | grep grid
>>>>>> gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
>>>>>> v1.8.3)
>>>>>> and
>>>>>> ompi_info | grep psm
>>>>>> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
>>>>>> because the intercoonect is TrueScale/QLogic
>>>>>> 
>>>>>> and
>>

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-12 Thread Reuti


Am 11.11.2014 um 02:12 schrieb Gilles Gouaillardet:

> Hi,
> 
> IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really use 
> all the published interfaces.
> 
> by any change, are you running a firewall on your head node ?

Yes, but only for the interface to the outside world. Nevertheless I switched 
it off and the result was the same 2 minutes delay during startup.


> one possible explanation is the compute node tries to access the public 
> interface of the head node, and packets get dropped by the firewall.
> 
> if you are running a firewall, can you make a test without it ?
> /* if you do need NAT, then just remove the DROP and REJECT rules "/
> 
> an other possible explanation is the compute node is doing (reverse) dns 
> requests with the public name and/or ip of the head node and that takes some 
> time to complete (success or failure, this does not really matter here)

I tried in the machinefile the internal and the external name of the headnode, 
i.e. different names for different interfaces. The result is the same.


> /* a simple test is to make sure all the hosts/ip of the head node are in the 
> /etc/hosts of the compute node */
> 
> could you check your network config (firewall and dns) ?
> 
> can you reproduce the delay when running mpirun on the head node and with one 
> mpi task on the compute node ?

You mean one on the head node and one on the compute node, opposed to two + two 
in my initial test?

Sure, but with 1+1 I get the same result.


> if yes, then the hard way to trace the delay issue would be to strace -ttt 
> both orted and mpi task that are launched on the compute node and see where 
> the time is lost.
> /* at this stage, i would suspect orted ... */

As the `ssh` on the headnode hangs for a while, I suspect it's something on the 
compute node. I see there during the startup:

orted -mca ess env -mca orte_ess_jobid 2412773376 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 2 -mca orte_hnp_uri 
2412773376.0;tcp://137.248.x.y,192.168.154.30,192.168.154.187:58782 
--tree-spawn -mca plm rsh

===
Only the subnet 192.168.154.0/26 (yes, 26) is used to access the nodes from the 
master i.e. login machine. As an additional information: the nodes have two 
network interfaces: one in 192.168.154.0/26 and one in 192.168.154.64/26 to 
reach a file server.
==


Falling back to 1.8.1 I see:

bash -c  orted -mca ess env -mca orte_ess_jobid 3182034944 -mca orte_ess_vpid 1 
-mca orte_ess_num_procs 2 -mca orte_hnp_uri 
"3182034944.0;tcp://137.248.x.y,192.168.154.30,192.168.154.187:54436" 
--tree-spawn -mca plm rsh -mca hwloc_base_binding_policy none

So, the bash was removed. But I don't think that this causes anything.

-- Reuti


> Cheers,
> 
> Gilles
> 
> On Mon, Nov 10, 2014 at 5:56 PM, Reuti  wrote:
> Hi,
> 
> Am 10.11.2014 um 16:39 schrieb Ralph Castain:
> 
> > That is indeed bizarre - we haven’t heard of anything similar from other 
> > users. What is your network configuration? If you use oob_tcp_if_include or 
> > exclude, can you resolve the problem?
> 
> Thx - this option helped to get it working.
> 
> These tests were made for sake of simplicity between the headnode of the 
> cluster and one (idle) compute node. I tried then between the (identical) 
> compute nodes and this worked fine. The headnode of the cluster and the 
> compute node are slightly different though (i.e. number of cores), and using 
> eth1 resp. eth0 for the internal network of the cluster.
> 
> I tried --hetero-nodes with no change.
> 
> Then I turned to:
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include 
> 192.168.154.0/26 -n 4 --hetero-nodes --hostfile machines ./mpihello; date
> 
> and the application started instantly. On another cluster, where the headnode 
> is identical to the compute nodes but with the same network setup as above, I 
> observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the 
> working addition was the correct "oob_tcp_if_include" to solve the issue.
> 
> The questions which remain: a) is this a targeted behavior, b) what changed 
> in this scope between 1.8.1 and 1.8.2?
> 
> -- Reuti
> 
> 
> >
> >> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
> >>
> >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> >>
> >>> Wow, that's pretty terrible!  :(
> >>>
> >>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
> >>> BTLs, does the delay disappear?
> >>
> >> You mean something like:
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile 
> >> machines ./mpihello; date
> >> M

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-12 Thread Reuti

Am 11.11.2014 um 02:25 schrieb Ralph Castain:

> Another thing you can do is (a) ensure you built with —enable-debug, and then 
> (b) run it with -mca oob_base_verbose 100  (without the tcp_if_include 
> option) so we can watch the connection handshake and see what it is doing. 
> The —hetero-nodes will have not affect here and can be ignored.

Done. It really tries to connect to the outside interface of the headnode. But 
being there a firewall or not: the nodes have no clue how to reach 137.248.0.0 
- they have no gateway to this network at all.

It tries so independent from the internal or external name of the headnode 
given in the machinefile - I hit ^C then. I attached the output of Open MPI 
1.8.1 for this setup too.

-- Reuti

Wed Nov 12 16:43:12 CET 2014
[annemarie:01246] mca: base: components_register: registering oob components
[annemarie:01246] mca: base: components_register: found loaded component tcp
[annemarie:01246] mca: base: components_register: component tcp register 
function successful
[annemarie:01246] mca: base: components_open: opening oob components
[annemarie:01246] mca: base: components_open: found loaded component tcp
[annemarie:01246] mca: base: components_open: component tcp open function 
successful
[annemarie:01246] mca:oob:select: checking available component tcp
[annemarie:01246] mca:oob:select: Querying component [tcp]
[annemarie:01246] oob:tcp: component_available called
[annemarie:01246] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[annemarie:01246] [[37241,0],0] oob:tcp:init rejecting loopback interface lo
[annemarie:01246] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[annemarie:01246] [[37241,0],0] oob:tcp:init adding 137.248.x.y to our list of 
V4 connections
[annemarie:01246] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[annemarie:01246] [[37241,0],0] oob:tcp:init adding 192.168.154.30 to our list 
of V4 connections
[annemarie:01246] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[annemarie:01246] [[37241,0],0] oob:tcp:init adding 192.168.154.187 to our list 
of V4 connections
[annemarie:01246] [[37241,0],0] TCP STARTUP
[annemarie:01246] [[37241,0],0] attempting to bind to IPv4 port 0
[annemarie:01246] [[37241,0],0] assigned IPv4 port 53661
[annemarie:01246] mca:oob:select: Adding component to end
[annemarie:01246] mca:oob:select: Found 1 active transports
[node28:05663] mca: base: components_register: registering oob components
[node28:05663] mca: base: components_register: found loaded component tcp
[node28:05663] mca: base: components_register: component tcp register function 
successful
[node28:05663] mca: base: components_open: opening oob components
[node28:05663] mca: base: components_open: found loaded component tcp
[node28:05663] mca: base: components_open: component tcp open function 
successful
[node28:05663] mca:oob:select: checking available component tcp
[node28:05663] mca:oob:select: Querying component [tcp]
[node28:05663] oob:tcp: component_available called
[node28:05663] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[node28:05663] [[37241,0],1] oob:tcp:init rejecting loopback interface lo
[node28:05663] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[node28:05663] [[37241,0],1] oob:tcp:init adding 192.168.154.28 to our list of 
V4 connections
[node28:05663] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[node28:05663] [[37241,0],1] oob:tcp:init adding 192.168.154.98 to our list of 
V4 connections
[node28:05663] [[37241,0],1] TCP STARTUP
[node28:05663] [[37241,0],1] attempting to bind to IPv4 port 0
[node28:05663] [[37241,0],1] assigned IPv4 port 45802
[node28:05663] mca:oob:select: Adding component to end
[node28:05663] mca:oob:select: Found 1 active transports
[node28:05663] [[37241,0],1]: set_addr to uri 
2440626176.0;tcp://137.248.x.y,192.168.154.30,192.168.154.187:53661
[node28:05663] [[37241,0],1]:set_addr checking if peer [[37241,0],0] is 
reachable via component tcp
[node28:05663] [[37241,0],1] oob:tcp: working peer [[37241,0],0] address 
tcp://137.248.x.y,192.168.154.30,192.168.154.187:53661
[node28:05663] [[37241,0],1] PASSING ADDR 137.248.x.y TO MODULE
[node28:05663] [[37241,0],1]:tcp set addr for peer [[37241,0],0]
[node28:05663] [[37241,0],1] PASSING ADDR 192.168.154.30 TO MODULE
[node28:05663] [[37241,0],1]:tcp set addr for peer [[37241,0],0]
[node28:05663] [[37241,0],1] PASSING ADDR 192.168.154.187 TO MODULE
[node28:05663] [[37241,0],1]:tcp set addr for peer [[37241,0],0]
[node28:05663] [[37241,0],1]: peer [[37241,0],0] is reachable via component tcp
[node28:05663] [[37241,0],1] OOB_SEND: rml_oob_send.c:199
[node28:05663] [[37241,0],1]:tcp:processing set_peer cmd
[node28:05663] [[37241,0],1] SET_PEER ADDING PEER [[37241,0],0]
[node28:05663] [[37241,0],1] set_peer: peer [[37241,0],0] is listening on net 
137.248.x.y port 53661
[node28:05663] [[37241,0],1]:tcp:processing set_peer cmd
[node28:05663] [[37241,0],1] set_peer: peer [[37241,0],0] is listening on net 
192.168.154.30 port 53661
[node28:05663] [[37241,0],1]:tc

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-12 Thread Reuti

Am 12.11.2014 um 17:27 schrieb Reuti:

> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
> 
>> Another thing you can do is (a) ensure you built with —enable-debug, and 
>> then (b) run it with -mca oob_base_verbose 100  (without the tcp_if_include 
>> option) so we can watch the connection handshake and see what it is doing. 
>> The —hetero-nodes will have not affect here and can be ignored.
> 
> Done. It really tries to connect to the outside interface of the headnode. 
> But being there a firewall or not: the nodes have no clue how to reach 
> 137.248.0.0 - they have no gateway to this network at all.

I have to revert this. They think that there is a gateway although it isn't. 
When I remove the entry by hand for the gateway in the routing table it starts 
up instantly too.

While I can do this on my own cluster I still have the 30 seconds delay on a 
cluster where I'm not root, while this can be because of the firewall there. 
The gateway on this cluster is indeed going to the outside world.

Personally I find this behavior a little bit too aggressive to use all 
interfaces. If you don't check this carefully beforehand and start a long 
running application one might even not notice the delay during the startup.

-- Reuti


> It tries so independent from the internal or external name of the headnode 
> given in the machinefile - I hit ^C then. I attached the output of Open MPI 
> 1.8.1 for this setup too.
> 
> -- Reuti
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25777.php

Re: [OMPI users] OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti

Am 13.11.2014 um 00:55 schrieb Gilles Gouaillardet:

> Could you please send the output of netstat -nr on both head and compute node 
> ?

Head node:

annemarie:~ # netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags   MSS Window  irtt Iface
0.0.0.0 137.248.x.y 0.0.0.0 UG0 0  0 eth0
127.0.0.0   0.0.0.0 255.0.0.0   U 0 0  0 lo
137.248.x.0   0.0.0.0 255.255.255.0   U 0 0  0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0  0 eth0
192.168.151.80  0.0.0.0 255.255.255.255 UH0 0  0 eth1
192.168.154.0   0.0.0.0 255.255.255.192 U 0 0  0 eth1
192.168.154.128 0.0.0.0 255.255.255.192 U 0 0  0 eth3

Compute node with (wrong) entry for the non-existing GW:

node28:~ # netstat -nr 
Kernel IP routing table
Destination Gateway Genmask Flags   MSS Window  irtt Iface
0.0.0.0 192.168.154.60  0.0.0.0 UG0 0  0 eth0
127.0.0.0   0.0.0.0 255.0.0.0   U 0 0  0 lo
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0  0 eth0
192.168.154.0   0.0.0.0 255.255.255.192 U 0 0  0 eth0
192.168.154.64  0.0.0.0 255.255.255.192 U 0 0  0 eth1

As said: when I remove the "default" entry for the GW it starts up instantly.

-- Reti



> no problem obfuscating the ip of the head node, i am only interested in 
> netmasks and routes.
> 
> Ralph Castain  wrote:
>> 
>>> On Nov 12, 2014, at 2:45 PM, Reuti  wrote:
>>> 
>>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>> 
>>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>>> 
>>>>> Another thing you can do is (a) ensure you built with —enable-debug, and 
>>>>> then (b) run it with -mca oob_base_verbose 100  (without the 
>>>>> tcp_if_include option) so we can watch the connection handshake and see 
>>>>> what it is doing. The —hetero-nodes will have not affect here and can be 
>>>>> ignored.
>>>> 
>>>> Done. It really tries to connect to the outside interface of the headnode. 
>>>> But being there a firewall or not: the nodes have no clue how to reach 
>>>> 137.248.0.0 - they have no gateway to this network at all.
>>> 
>>> I have to revert this. They think that there is a gateway although it 
>>> isn't. When I remove the entry by hand for the gateway in the routing table 
>>> it starts up instantly too.
>>> 
>>> While I can do this on my own cluster I still have the 30 seconds delay on 
>>> a cluster where I'm not root, while this can be because of the firewall 
>>> there. The gateway on this cluster is indeed going to the outside world.
>>> 
>>> Personally I find this behavior a little bit too aggressive to use all 
>>> interfaces. If you don't check this carefully beforehand and start a long 
>>> running application one might even not notice the delay during the startup.
>> 
>> Agreed - do you have any suggestions on how we should choose the order in 
>> which to try them? I haven’t been able to come up with anything yet. Jeff 
>> has some fancy algo in his usnic BTL that we are going to discuss after SC 
>> that I’m hoping will help, but I’d be open to doing something better in the 
>> interim for 1.8.4
>> 
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> It tries so independent from the internal or external name of the headnode 
>>>> given in the machinefile - I hit ^C then. I attached the output of Open 
>>>> MPI 1.8.1 for this setup too.
>>>> 
>>>> -- Reuti
>>>> 
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/11/25777.php
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25781.php
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25782.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25783.php
>

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti

Gus,

Am 13.11.2014 um 02:59 schrieb Gus Correa:

> On 11/12/2014 05:45 PM, Reuti wrote:
>> Am 12.11.2014 um 17:27 schrieb Reuti:
>> 
>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>> 
>>>> Another thing you can do is (a) ensure you built with —enable-debug,
>> and then (b) run it with -mca oob_base_verbose 100
>> (without the tcp_if_include option) so we can watch
>> the connection handshake and see what it is doing.
>> The —hetero-nodes will have not affect here and can be ignored.
>>> 
>>> Done. It really tries to connect to the outside
>> interface of the headnode. But being there a firewall or not:
>> the nodes have no clue how to reach 137.248.0.0 -
>> they have no gateway to this network at all.
>> 
>> I have to revert this.
>> They think that there is a gateway although it isn't.
>> When I remove the entry by hand for the gateway in the
>> routing table it starts up instantly too.
>> 
>> While I can do this on my own cluster I still have the
>> 30 seconds delay on a cluster where I'm not root,
>> while this can be because of the firewall there.
>> The gateway on this cluster is indeed going
>> to the outside world.
>> 
>> Personally I find this behavior a little bit too aggressive
>> to use all interfaces. If you don't check this carefully
>> beforehand and start a long running application one might
>> even not notice the delay during the startup.
>> 
>> -- Reuti
>> 
> 
> Hi Reuti
> 
> You could use the mca parameter file
> (say, $prefix/etc/openmpi-mca-params.conf) to configure cluster-wide
> the oob (and btl) interfaces to be used.
> The users can still override your choices if they want.
> 
> Just put a line like this in openmpi-mca-params.conf :
> oob_tcp_if_include=192.168.154.0/26
> 
> (and similar for btl_tcp_if_include, btl_openib_if_include).
> 
> Get a full list from "ompi_info --all --all |grep if_include".
> 
> See these FAQ:
> 
> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
> 
> Compute nodes tend to be multi-homed, so what criterion would OMPI use
> to select one interface among many,

My compute nodes are having two interfaces: one for MPI (and the low ssh/SGE 
traffic to start processes somewhere) and one for NFS to transfer files from/to 
the file server. So: Open MPI may use both interfaces without telling me 
anything about it? How will it split the traffic? 50%/50%? When there is a 
heavy file transfer on the NFS interface: might it hurt Open MPI's 
communication or will it balance the usage on-the-fly?

When I prepare a machinefile with the name of the interfaces (or get the names 
from SGE's PE_HOSTFILE) it should use just this (except native IB), and not 
looking around for other paths to the other machine(s) (IMO). Therefore 
different interfaces have different names in my setup. "node01" is just eth0 
and different from "node01-nfs" for eth1.

> not knowing beforehand what exists in a particular computer?
> There would be a risk to make a bad choice.
> The current approach gives you everything, and you
> pick/select/restrict what you want to fit your needs,
> with mca parameters (which can be set in several
> ways and with various scopes).
> 
> I don't think this bad.
> However, I am biased about this.
> I like and use the openmpi-mca-params.conf file
> to setup sensible defaults.
> At least I think they are sensible. :)

I see that this can be prepared for all users this way. Whenever they use my 
installed version it will work - maybe I have to investigate on some other 
clusters where I'm not root what to enter there, but it can be done for sure.

BUT: it may be a rare situation that a group for quantum chemistry is having a 
sysadmin on their own taking care of the clusters and the well behaving 
operation of the installed software, being it applications or libraries. Often 
any PhD student in other groups will get a side project: please install 
software XY for the group. They are chemists and want to get the software 
running - they are no experts of Open MPI*. They don't care for a tight 
integration or using the correct interfaces as long as the application delivers 
the results in the end. For example: ORCA**. It's necessary for the users of 
the software to install a shared library of Open MPI in a specific version. I 
see in the ORCA*** forum that many struggle with it to compile a shared library 
version of Open MPI and have access to it during execution, i.e. how to set 
LD_LIBRARY_PATH that it's known on the slaves. The cluster admins are in 
another department and refuse to make any

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti

Am 13.11.2014 um 00:34 schrieb Ralph Castain:

>> On Nov 12, 2014, at 2:45 PM, Reuti  wrote:
>> 
>> Am 12.11.2014 um 17:27 schrieb Reuti:
>> 
>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>> 
>>>> Another thing you can do is (a) ensure you built with —enable-debug, and 
>>>> then (b) run it with -mca oob_base_verbose 100  (without the 
>>>> tcp_if_include option) so we can watch the connection handshake and see 
>>>> what it is doing. The —hetero-nodes will have not affect here and can be 
>>>> ignored.
>>> 
>>> Done. It really tries to connect to the outside interface of the headnode. 
>>> But being there a firewall or not: the nodes have no clue how to reach 
>>> 137.248.0.0 - they have no gateway to this network at all.
>> 
>> I have to revert this. They think that there is a gateway although it isn't. 
>> When I remove the entry by hand for the gateway in the routing table it 
>> starts up instantly too.
>> 
>> While I can do this on my own cluster I still have the 30 seconds delay on a 
>> cluster where I'm not root, while this can be because of the firewall there. 
>> The gateway on this cluster is indeed going to the outside world.
>> 
>> Personally I find this behavior a little bit too aggressive to use all 
>> interfaces. If you don't check this carefully beforehand and start a long 
>> running application one might even not notice the delay during the startup.
> 
> Agreed - do you have any suggestions on how we should choose the order in 
> which to try them? I haven’t been able to come up with anything yet. Jeff has 
> some fancy algo in his usnic BTL that we are going to discuss after SC that 
> I’m hoping will help, but I’d be open to doing something better in the 
> interim for 1.8.4

The plain`mpiexec` should just use the specified interface it finds in the 
hostfile. Being it hand crafted or prepared by any queuing system.


Option: could a single entry for a machine in the hostfile contain a list of 
interfaces? I mean something like:

node01,node01-extra-eth1,node01-extra-eth2 slots=4

or

node01* slots=4

Means: use exactly these interfaces or even try to find all available 
interfaces on/between the machines.

In case all interfaces have the same name, then it's up to the admin to correct 
this.

-- Reuti


>> -- Reuti
>> 
>> 
>>> It tries so independent from the internal or external name of the headnode 
>>> given in the machinefile - I hit ^C then. I attached the output of Open MPI 
>>> 1.8.1 for this setup too.
>>> 
>>> -- Reuti
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25777.php
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25781.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25782.php
>

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-13 Thread Reuti

Am 13.11.2014 um 17:14 schrieb Ralph Castain:

> Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to 
> assign different hostnames to their interfaces - I’ve seen it in the Hadoop 
> world, but not in HPC. Still, no law against it.

Maybe it depends on the background to do it this way. At one point in the past 
I read this Howto:

https://arc.liv.ac.uk/SGE/howto/multi_intrfcs.html

and appreciated the idea to route different services to different interfaces - 
a large file copy won't hurt the MPI communication this way. As SGE handles it 
well to contact the qmaster or execds on the correct interface of the machines 
(which might be eth0, eth1 or any else one), I'm doing it for a decade now this 
way and according to the mails on the SGE lists others are doing it too. Hence 
I don't see it that unusual.

> This will take a little thought to figure out a solution. One problem that 
> immediately occurs is if someone includes a hostfile that has lines which 
> refer to the same physical server, but using different interface names.

Yes, I see this point too. Therefore I had the idea to list all the interfaces 
one want to use in one line. In case they put it in different lines, they would 
do it the wrong way - their fault. One line = one machine, unless the list of 
interfaces is exactly the same in multiple lines, then they could be added up 
like now.

(Under SGE there is the [now correctly working] setup to get the same machine a 
couple of times in case they origin from several queues. But this would still 
fit with the above interpretation: the interface name is the same although they 
are coming from different queues they can just be added up like now in the 
GridEngine MCA.)

> We’ll think those are completely distinct servers, and so the process 
> placement will be totally messed up.
> 
> We’ll also encounter issues with the daemon when it reports back, as the 
> hostname it gets will almost certainly differ from the hostname we were 
> expecting. Not as critical, but need to check to see where that will impact 
> the code base

Hence I prefer to use eth0 for Open MPI (for now). But I remember that there 
was a time when it could be set up to route the MPI traffic dedicated to eth1, 
although it was for MPICH(1):

https://arc.liv.ac.uk/SGE/howto/mpich-integration.html => Wrong interface 
selected for the back channel of the MPICH-tasks with the ch_p4-device

> We can look at the hostfile changes at that time - no real objection to them, 
> but would need to figure out how to pass that info to the appropriate 
> subsystems. I assume you want this to apply to both the oob and tcp/btl?

Yes.

> Obviously, this won’t make it for 1.8 as it is going to be fairly intrusive, 
> but we can probably do something for 1.9
> 
>> On Nov 13, 2014, at 4:23 AM, Reuti  wrote:
>> 
>> Am 13.11.2014 um 00:34 schrieb Ralph Castain:
>> 
>>>> On Nov 12, 2014, at 2:45 PM, Reuti  wrote:
>>>> 
>>>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>>> 
>>>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>>>> 
>>>>>> Another thing you can do is (a) ensure you built with —enable-debug, and 
>>>>>> then (b) run it with -mca oob_base_verbose 100  (without the 
>>>>>> tcp_if_include option) so we can watch the connection handshake and see 
>>>>>> what it is doing. The —hetero-nodes will have not affect here and can be 
>>>>>> ignored.
>>>>> 
>>>>> Done. It really tries to connect to the outside interface of the 
>>>>> headnode. But being there a firewall or not: the nodes have no clue how 
>>>>> to reach 137.248.0.0 - they have no gateway to this network at all.
>>>> 
>>>> I have to revert this. They think that there is a gateway although it 
>>>> isn't. When I remove the entry by hand for the gateway in the routing 
>>>> table it starts up instantly too.
>>>> 
>>>> While I can do this on my own cluster I still have the 30 seconds delay on 
>>>> a cluster where I'm not root, while this can be because of the firewall 
>>>> there. The gateway on this cluster is indeed going to the outside world.
>>>> 
>>>> Personally I find this behavior a little bit too aggressive to use all 
>>>> interfaces. If you don't check this carefully beforehand and start a long 
>>>> running application one might even not notice the delay during the startup.
>>> 
>>> Agreed - do you have any suggestions on how we should choose the order in 
>>> which to try them? I haven’t been able to come up with anything yet. Jeff 
>>> has

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-14 Thread Reuti

Jeff, Gus, Gilles,

Am 14.11.2014 um 15:56 schrieb Jeff Squyres (jsquyres):

> I lurked on this thread for a while, but I have some thoughts on the many 
> issues that were discussed on this thread (sorry, I'm still pretty under 
> water trying to get ready for SC next week...).

I appreciate your replies and will read them thoroughly. I think it's best to 
continue with the discussion after SC14. I don't want to put any burden on 
anyone when time is tight.

-- Reuti


>  These points are in no particular order...
> 
> 0. Two fundamental points have been missed in this thread:
> 
>   - A hostname technically has nothing to do with the resolvable name of an 
> IP interface.  By convention, many people set the hostname to be the same as 
> some "primary" IP interface (for some definition of "primary", e.g., eth0).  
> But they are actually unrelated concepts.
> 
>   - Open MPI uses host specifications only to specify a remote server, *NOT* 
> an interface.  E.g., when you list names in a hostile or the --host CLI 
> option, those only specify the server -- not the interface(s).  This was an 
> intentional design choice because there tends to be confusion and different 
> schools of thought about the question "What's the [resolvable] name of that 
> remote server?"  Hence, OMPI will take any old name you throw at it to 
> identify that remote server, but then we have separate controls for 
> specifying which interface(s) to use to communicate with that server.
> 
> 1. Remember that there is at least one, and possibly two, uses of TCP 
> communications in Open MPI -- and they are used differently:
> 
>   - Command/control (sometimes referred to as "oob"): used for things like 
> mpirun control messages, shuttling IO from remote processes back to mpirun, 
> etc.  Generally, unless you have a mountain of stdout/stderr from your 
> launched processes, this isn't a huge amount of traffic.
> 
>   - MPI messages: kernel-based TCP is the fallback if you don't have some 
> kind of faster off-server network -- i.e., the TCP BTL.  Like all BTLs, the 
> TCP BTL carries all MPI traffic when it is used.  How much traffic is 
> sent/received depends on your application.
> 
> 2. For OOB, I believe that the current ORTE mechanism is that it will try all 
> available IP interfaces and use the *first* one that succeeds.  Meaning: 
> after some negotiation, only one IP interface will be used to communicate 
> with a given peer.
> 
> 3. The TCP BTL will examine all local IP interfaces and determine all that 
> can be used to reach each peer according to the algorithm described here: 
> http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3.  It will use 
> *all* IP interfaces to reach a given peer in order to maximize the available 
> bandwidth.
> 
> 4. The usNIC BTL uses UDP as its wire transport, and therefore has the same 
> reachability issues as both the TCP OOB and BTL.  However, we use a different 
> mechanism than the algorithm described in the above-cited FAQ item: we simply 
> query the Linux routing table.  This can cause ARP requests, but the kernel 
> caches them (e.g., for multiple MPI procs on the same server making the 
> same/similar requests), and for a properly-segmented L3 network, each MPI 
> process will effectively end up querying about its local gateway (vs. the 
> actual peer), and therefore the chances of having that ARP already cached are 
> quite high.
> 
> --> I want to make this clear: there's nothing magic our the 
> usNIC/check-the-routing-table approach.  It's actually a very standard 
> IP/datacenter method.  With a proper routing table, you can know fairly 
> quickly whether local IP interface X can reach remote IP interface Y.
> 
> 5. The original problem cited in this thread was about the TCP OOB, not the 
> TCP BTL.  It's important to keep straight that the OOB, with no guidance from 
> the user, was trying to probe the different IP interfaces and find one that 
> would reach a peer.  Using the check-the-routing-table approach cited in #4, 
> we might be able to make this better (that's what Ralph and I are going to 
> talk about in December / post-SC / post-US Thanksgiving holiday).
> 
> 6. As a sidenote to #5, the TCP OOB and TCP BTL determine reachability in 
> different ways.  Remember that the TCP BTL has the benefit of having all the 
> ORTE infrastructure up and running.  Meaning: MPI processes can exchange IP 
> interface information and then use that information to compute which peer IP 
> interfaces can be reached.  The TCP OOB doesn't have this benefit -- it's 
> being used to establish initial connectivity.  Hence, it probes each IP 
> interface to

Re: [OMPI users] Cannot open configuration file - openmpi/mpic++-wrapper-data.txt

2014-12-09 Thread Reuti

Hi,

please have a look here:

http://www.open-mpi.org/faq/?category=building#installdirs

-- Reuti


Am 09.12.2014 um 07:26 schrieb Manoj Vaghela:

> Hi OpenMPI Users,
> 
> I am trying to build OpenMPI libraries using standard configuration and 
> compile procedure. It is just the one thing that I want to install all in a 
> user specified path like following:
> 
> OMPI_DIR is something like $HOME/Shared_Build/openmpi-1.8.3
> 
> [OMPI_DIR] $ ./configure --prefix=$PWD/linux_x64
> 
> It all went successfully and it installed all in the path above.
> 
> I then moved the linux_x64 folder to location $HOME/mpi/openmpi/1.8.3. Now 
> the path of installation is $HOME/mpi/openmpi/1.8.3/linux_x64 
> 
> I added PATH and LD_LIBRARY_PATH as below:
> 
> export PATH=$HOME/mpi/openmpi/1.8.3/linux_x64/bin:$PATH
> export LD_LIBRARY_PATH=$HOME/mpi/openmpi/1.8.3/linux_x64/lib
> 
> which when using mpic++ command gives following:
> 
> Cannot open configuration file 
> /home/manoj//linux_x64/share/openmpi/mpic++-wrapper-data.txt
> Error parsing data file mpic++: Not found
> 
> This shows the OLD installation path for which --prefix was specified. Now 
> the installation folder moved to NEW path. But still searches the same OLD 
> location.
> 
> I searched on the web, but with that info (./configure --with-devel-headers 
> --enable-binaries did not work and gave the same issue)
> 
> This question may be a repeat but please experts guide me. I also will need 
> to copy linux_x64 folder to other similar machine from which these libraries 
> can be used to compile and run application without compiling the whole source 
> code.
> 
> Thanks.
> 
> --
> regards,
> Manoj
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/12/25931.php

Re: [OMPI users] OMPI users] MPI inside MPI (still)

2014-12-13 Thread Reuti

Hi,

Am 13.12.2014 um 02:43 schrieb Alex A. Schmidt:

> MPI_comm_disconnect seem to work but not quite.
> The call to it returns almost immediatly while
> the spawn processes keep piling up in the background
> until they are all done...
> 
> I think system('env -i qsub...') to launch the third party apps
> would take the execution of every call back to the scheduler 
> queue. How would I track each one for their completion?

So your goal is to implement some kind of workflow supervisor by submitting 
jobs to the queuing system and act after their completion depending on the 
results of the QC software? Does this workflow software need MPI to do this?

Nevertheless, submitting from an application can be done by DRMAA instead of 
plain system calls. There you can also check the states of the job in the 
queuing system and/or wait for them and control them more easily (i.e. 
terminate, suspend,...).

-- Reuti

http://www.drmaa.org/
https://arc.liv.ac.uk/SGE/howto/howto.html#DRMAA


> Alex
> 
> 2014-12-12 22:35 GMT-02:00 Gilles Gouaillardet 
> :
> Alex,
> 
> You need MPI_Comm_disconnect at least.
> I am not sure if this is 100% correct nor working.
> 
> If you are using third party apps, why dont you do something like
> system("env -i qsub ...")
> with the right options to make qsub blocking or you manually wait for the end 
> of the job ?
> 
> That looks like a much cleaner and simpler approach to me.
> 
> Cheers,
> 
> Gilles
> 
> "Alex A. Schmidt"  wrote:
> Hello Gilles,
> 
> Ok, I believe I have a simple toy app running as I think it should:
> 'n' parent processes running under mpi_comm_world, each one
> spawning its own 'm' child processes (each child group work 
> together nicely, returning the expected result for an mpi_allreduce call).
> 
> Now, as I mentioned before, the apps I want to run in the spawned 
> processes are third party mpi apps and I don't think it will be possible 
> to exchange messages with them from my app. So, I do I tell 
> when the spawned processes have finnished running? All I have to work
> with is the intercommunicator returned from the mpi_comm_spawn call...
> 
> Alex
> 
> 
> 
> 
> 2014-12-12 2:42 GMT-02:00 Alex A. Schmidt :
> Gilles,
> 
> Well, yes, I guess
> 
> I'll do tests with the real third party apps and let you know.
> These are huge quantum chemistry codes (dftb+, siesta and Gaussian)
> which greatly benefits from a parallel environment. My code is just
> a front end to use those, but since we have a lot of data to process
> it also benefits from a parallel environment. 
> 
> Alex
>  
> 
> 2014-12-12 2:30 GMT-02:00 Gilles Gouaillardet :
> Alex,
> 
> just to make sure ...
> this is the behavior you expected, right ?
> 
> Cheers,
> 
> Gilles
> 
> 
> On 2014/12/12 13:27, Alex A. Schmidt wrote:
>> Gilles,
>> 
>> Ok, very nice!
>> 
>> When I excute
>> 
>> do rank=1,3
>> call  MPI_Comm_spawn('hello_world','
>> ',5,MPI_INFO_NULL,rank,MPI_COMM_WORLD,my_intercomm,MPI_ERRCODES_IGNORE,status)
>> enddo
>> 
>> I do get 15 instances of the 'hello_world' app running: 5 for each parent
>> rank 1, 2 and 3.
>> 
>> Thanks a lot, Gilles.
>> 
>> Best regargs,
>> 
>> Alex
>> 
>> 
>> 
>> 
>> 2014-12-12 1:32 GMT-02:00 Gilles Gouaillardet >> :
>>> 
>>>  Alex,
>>> 
>>> just ask MPI_Comm_spawn to start (up to) 5 tasks via the maxprocs
>>> parameter :
>>> 
>>>int MPI_Comm_spawn(char *command, char *argv[], int maxprocs,
>>> MPI_Info info,
>>>  int root, MPI_Comm comm, MPI_Comm *intercomm,
>>>  int array_of_errcodes[])
>>> 
>>> INPUT PARAMETERS
>>>maxprocs
>>>   - maximum number of processes to start (integer, significant
>>> only at root)
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> On 2014/12/12 12:23, Alex A. Schmidt wrote:
>>> 
>>> Hello Gilles,
>>> 
>>> Thanks for your reply. The "env -i PATH=..." stuff seems to work!!!
>>> 
>>> call system("sh -c 'env -i PATH=/usr/lib64/openmpi/bin:/bin mpirun -n 2
>>> hello_world' ")
>>> 
>>> did produce the expected result with a simple openmi "hello_world" code I
>>> wrote.
>>> 
>>> I might be harder though with the real third party app I have in mind. And
>>> I realize

Re: [OMPI users] OMPI users] OMPI users] OMPI users] OMPI users] MPI inside MPI (still)

2014-12-18 Thread Reuti

Am 18.12.2014 um 04:24 schrieb Alex A. Schmidt :
> 
> The option system("env -i ...") has been tested earlier by me and it does
> work. There is doubt though it would work along with a job scheduler.
> I will reserve this as a last resort solution.

You could also redirect the stdin before the `system()` call in your 
application by dup2(). But I'm somewhat lost in the complete setup: you have a 
batch scheduler, your MPI application and third party apps (using MPI, Linda, 
OpenMP,...) with both of the latter working in parallel (how to forward granted 
machines/cores to siesta in case it should spread further to other machines?).

What will call what? Do you submit your application or the third party apps as 
mentioned earlier? Will the batch scheduler honor any redirection which you 
discussed recently, and/or you can you tell the batch scheduler to use a 
different stdin/-out/-err in DRMAA by setting 
drmaa_input_path/drmaa_output_path/drmaa_error_path for example?

-- Reuti


> mpi_comm_spawn("/bin/sh","-c","siesta < infile",..) definitely  does not work.
> 
> Patching siesta to start as "siesta -in infile" sounds very promissing.
> That option as well as a "-o outfile" option should be there to start with. 
> I can bring that to the attention of the siesta developers. But then we have
> this other app "dftb+", and this other one "Gaussian" (just to mention a few).
> Some of them might already act like that, others don't. So, having
> mpi_comm_spawn working with i/o redirection would be in fact more
> simpler.
> 
> Perhaps a shell script "siesta_alt" could be written to convert 
> "siesta_alt infile outfile" into "siesta < infile > outfile" and then
> do mpi_comm_spawn("siesta_alt","infile","outfile",...). But I am not
> sure if the spawn on siesta_alt would actually lead to a spawn on siesta
> as expected...
> 
> Alex
> 
> 2014-12-17 22:35 GMT-02:00 Gilles Gouaillardet 
> :
> Alex,
> 
> You do not want to spawn mpirun.
> Or if this is really what you want, then just use system("env -i ...")
> 
> I think what you need is spawn a shell that do the redirection and then 
> invoke your app.
> This is something like
> MPI_Comm_spawn("/bin/sh", "-c", "siesta < infile")
> 
> That being said, i strongly recommend you patch siesta so it can be invoked 
> like this
> siesta -in infile
> (plus the MPI_Comm_disconnect call explained by George)
> That would make everything so much easier
> 
> Cheers,
> 
> Gilles
> 
> "Alex A. Schmidt"  wrote:
> Let me rephrase the previous message:
> 
> Putting "/bin/sh" in command with info key "ompi_non_mpi"  set to  ".true." 
> (if command is empty, mpi_comm_spawn tries to execute ' ') of 
> mpi_comm_spawn and "-c" "mpirun -n 1 myapp" in args results in 
> this message:
> 
> **
> 
> Open MPI does not support recursive calls of mpirun
> 
> **
> 
> Putting a single string in args as "-c mpirun -n 1 myapp" or  "-c 'mpirun -n 
> 1 myapp' "
> returns
> 
> /usr/bin/sh: - : invalid option
> 
> Alex
> 
> 
> 
> 2014-12-17 21:47 GMT-02:00 Alex A. Schmidt :
> Putting "/bin/sh" in command with info key "ompi_non_mpi"  set to  ".true." 
> (if command is empty, mpi_comm_spawn tries to execute ' ') of 
> mpi_comm_spawn and "-c" "mpirun -n 1 myapp" in args results in 
> this message:
> 
> /usr/bin/sh: -c: option requires an argument
> 
> Putting a single string in args as "-c mpirun -n 1 myapp" or  "-c 'mpirun -n 
> 1 myapp' "
> returns
> 
> /usr/bin/sh: - : invalid option
> 
> Alex
> 
> 2014-12-17 20:17 GMT-02:00 George Bosilca :
> I don't think this has any chance of working. The redirection is something 
> interpreted by the shell, and when Open MPI "fork-exec" a process it does not 
> behave as the shell.
> 
> Thus a potentially non-portable solution would be to instead of launching the 
> mpirun directly to launch it through a shell. Maybe something like "/bin/sh", 
> "-c", "mpirun -n 1 myapp". 
> 
>   George.
> 
> 
> On Wed, Dec 17, 2014 at 5:02 PM, Alex A. Schmidt  wrote:
> Ralph,
> 
> Sorry, "<" as an element of argv to mpi_comm_spawn is interpreted just the
> same, as another parameter by the spawnee process.
> 
> But I am confuse

[OMPI users] SGE integration broken in 2.0.0

2016-08-11 Thread Reuti

Hi,

In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, which 
seems to prevent the tight integration with SGE to start:

if (NULL == mca_plm_rsh_component.agent) {

Why is it there (it wasn't in 1.10.3)?

If I just remove it I get:

[node17:25001] [[27678,0],0] plm:rsh: final template argv:
qrsh   orted --hnp-topo-sig ...

instead of the former:

/usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   orted 
--hnp-topo-sig ...

So, just removing the if-statement is not a perfect cure as the $SGE_ROOT/$ARC 
does not prefix `qrsh`.

==

BTW: why is there blank before " orted" in the assembled command line - and 
it's really in the argument when I check this on the slave nodes what should be 
started by the `qrsh_starter`? As long as there is a wrapping shell, it will be 
removed anyway. But in a special setup we noticed this additional blank.

==

I also notice, that I have to supply "-ldl" to `mpicc` to allow the compilation 
of an application to succeed in 2.0.0.

-- Reuti
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-11 Thread Reuti


> Am 11.08.2016 um 13:28 schrieb Reuti :
> 
> Hi,
> 
> In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, which 
> seems to prevent the tight integration with SGE to start:
> 
>if (NULL == mca_plm_rsh_component.agent) {
> 
> Why is it there (it wasn't in 1.10.3)?
> 
> If I just remove it I get:
> 
> [node17:25001] [[27678,0],0] plm:rsh: final template argv:
>qrsh   orted --hnp-topo-sig ...
> 
> instead of the former:
> 
> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   orted 
> --hnp-topo-sig ...
> 
> So, just removing the if-statement is not a perfect cure as the 
> $SGE_ROOT/$ARC does not prefix `qrsh`.

I forgot to mention: the original effect is, that it always tries to use `ssh` 
to contact the slave nodes, despite the fact that it's running under SGE.

-- Reuti


> ==
> 
> BTW: why is there blank before " orted" in the assembled command line - and 
> it's really in the argument when I check this on the slave nodes what should 
> be started by the `qrsh_starter`? As long as there is a wrapping shell, it 
> will be removed anyway. But in a special setup we noticed this additional 
> blank.
> 
> ==
> 
> I also notice, that I have to supply "-ldl" to `mpicc` to allow the 
> compilation of an application to succeed in 2.0.0.
> 
> -- Reuti
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] mpirun won't find programs from the PATH environment variable that are in directories that are relative paths

2016-08-12 Thread Reuti


Am 12.08.2016 um 20:34 schrieb r...@open-mpi.org:

> Sorry for the delay - I had to catchup on some other things before I could 
> come back to checking this one. Took me awhile to track this down, but the 
> change is in test for master:
> 
> https://github.com/open-mpi/ompi/pull/1958
> 
> Once complete, I’ll set it up for inclusion in v2.0.1
> 
> Thanks for reporting it!
> Ralph
> 
> 
>> On Jul 29, 2016, at 5:47 PM, Phil Regier  
>> wrote:
>> 
>> If I'm reading you right, you're presently unable to do the equivalent 
>> (albeit probably with PATH set on a different line somewhere above) of
>> 
>> PATH=arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
>> 
>> I'm mildly curious whether it would help to add a leading "./" to get the 
>> equivalent of
>> 
>> PATH=./arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
>> 
>> But to be clear, I'm advocating
>> 
>> PATH=$PWD/arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 psana
>> 
>> as opposed to
>> 
>> mpirun -n 1 $PWD/arch/x86_64-rhel7-gcc48-opt/bin/psana
>> 
>> mostly because you still get to set the path once and use it many times 
>> without duplicating code.
>> 
>> 
>> For what it's worth, I've seen Ralph's suggestion generalized to something 
>> like
>> 
>> PREFIX=$PWD/arch/x86_64-rhel7-gcc48-opt/bin mpirun -n 1 $PREFIX/psana

AFAICS $PREFIX is evaluated too early.

$ PREFIX=old_value
$ PREFIX=foobar /bin/echo $PREFIX
old_value

Unless exactly this is the desired effect.

-- Reuti


>> 
>> where PREFIX might be set above in the same script, or sourced from a common 
>> config script or a custom environment module.  I think this style appeals to 
>> many users on many levels.
>> 
>> 
>> In any event, though, if this really is a bug that gets fixed, you've got 
>> lots of options.
>> 
>> 
>> 
>> 
>> On Fri, Jul 29, 2016 at 5:24 PM, Schneider, David A. 
>>  wrote:
>> Hi, Thanks for the reply! It does look like mpirun runs from the same 
>> directory as where I launch it, and that the environment has the same value 
>> for PATH that I had before (with the relative directory in front), but of 
>> course, there are lots of other MPI based environment variables defined - 
>> maybe one of those means don't use the relative paths?
>> 
>> Explicitly setting the path with $PWD like you say, yes, I agree that is a 
>> good defensive practice, but it is more cumbersome, the actually path looks
>> 
>>  mpirun -n 1 $PWD/arch/x86_64-rhel7-gcc48-opt/bin/psana
>> 
>> best,
>> 
>> David Schneider
>> SLAC/LCLS
>> 
>> From: users [users-boun...@lists.open-mpi.org] on behalf of Phil Regier 
>> [preg...@penguincomputing.com]
>> Sent: Friday, July 29, 2016 5:12 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] mpirun won't find programs from the PATH 
>> environment variable that are in directories that are relative paths
>> 
>> I might be three steps behind you here, but does "mpirun  pwd" show 
>> that all your launched processes are running in the same directory as the 
>> mpirun command?  I assume that "mpirun  env" would show that your PATH 
>> variable is being passed along correctly, since you don't have any problems 
>> with absolute paths.  In any event, is PATH=$PWD/dir/bin not an option?
>> 
>> Seems to me that this last would be good practice for location-sensitive 
>> launches in general, though I do tend to miss things.
>> 
>> On Fri, Jul 29, 2016 at 4:34 PM, Schneider, David A. 
>> mailto:david...@slac.stanford.edu>> wrote:
>> I am finding, on linux, rhel7, with openmpi 1.8.8 and 1.10.3, that mpirun 
>> won't find apps that are specified on a relative path, i.e, if I have
>> 
>> PATH=dir/bin
>> 
>> and I am in a directory which has dir/bin as a subdirectory, and an 
>> executable bir/bin/myprogram, I can't do
>> 
>> mpirun myprogram
>> 
>> I get the error message that
>> 
>> mpirun was unable to find the specified executable file, and therefore
>> did not launch the job.
>> 
>> whereas if I put an absolute path, something like
>> 
>> PATH=/home/me/dir/bin
>> 
>> then it works.
>> 
>> This causes some problematic silent failure, sometimes we use relative 
>> directories to override a 'base' release, so if I had
>> 
>> PATH=dir/bin:/central/install/dir/bin
>> 
>> and m

Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread Reuti


Am 12.08.2016 um 16:52 schrieb r...@open-mpi.org:

> IIRC, the rationale behind adding the check was that someone using SGE wanted 
> to specify a custom launch agent, and we were overriding it with qrsh. 
> However, the check is incorrect as that MCA param cannot be NULL.
> 
> I have updated this on master - can you see if this fixes the problem for you?
> 
> https://github.com/open-mpi/ompi/pull/1957

As written initially, I get now this verbose output with " --mca 
plm_base_verbose 10":

[node22:02220] mca: base: close: component isolated closed
[node22:02220] mca: base: close: unloading component isolated
[node22:02220] mca: base: close: component slurm closed
[node22:02220] mca: base: close: unloading component slurm
[node22:02220] [[28119,0],0] plm:rsh: final template argv:
qrsh   orted --hnp-topo-sig 2N:2S:2L3:8L2:8L1:8C:8H:x86_64 
-mca ess "env" -mca ess_base_jobid "1842806784" -mca es
s_base_vpid "" -mca ess_base_num_procs "9" -mca orte_hnp_uri 
"1842806784.0;usock;tcp://192.168.154.22,192.168.154.92:46186
" --mca plm_base_verbose "10" -mca plm "rsh" -mca pmix "^s1,s2,cray"
bash: node13: command not found
bash: node20: command not found
bash: node12: command not found
bash: node16: command not found
bash: node17: command not found
bash: node14: command not found
bash: node15: command not found
Your "qrsh" request could not be scheduled, try again later.

Sure, the name of the machine is allowed only after the additional "-inherit" 
to `qrsh`. Please see below for the complete  in 1.10.3,  hence the 
assembly seems also not to be done in the correct way.

-- Reuti


> On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
> ...
> instead of the former:
> 
> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   orted 
> --hnp-topo-sig ...
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread Reuti


> Am 12.08.2016 um 16:52 schrieb r...@open-mpi.org:
> 
> IIRC, the rationale behind adding the check was that someone using SGE wanted 
> to specify a custom launch agent, and we were overriding it with qrsh. 
> However, the check is incorrect as that MCA param cannot be NULL.
> 
> I have updated this on master - can you see if this fixes the problem for you?
> 
> https://github.com/open-mpi/ompi/pull/1957

I updated my tools to:

autoconf-2.69
automake-1.15
libtool-2.4.6

but I face with Open MPI's ./autogen.pl:

configure.ac:152: error: possibly undefined macro: AC_PROG_LIBTOOL

I recall seeing in already before, how to get rid of it? For now I fixed the 
single source file just by hand.

-- Reuti


> As for the blank in the cmd line - that is likely due to a space reserved for 
> some entry that you aren’t using (e.g., when someone manually specifies the 
> prefix). It shouldn’t cause any harm as the cmd line parser is required to 
> ignore spaces
> 
> The -ldl problem sounds like a configuration issue - you might want to file a 
> separate issue about it
> 
>> On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
>> 
>> Hi,
>> 
>> In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, which 
>> seems to prevent the tight integration with SGE to start:
>> 
>>   if (NULL == mca_plm_rsh_component.agent) {
>> 
>> Why is it there (it wasn't in 1.10.3)?
>> 
>> If I just remove it I get:
>> 
>> [node17:25001] [[27678,0],0] plm:rsh: final template argv:
>>   qrsh   orted --hnp-topo-sig ...
>> 
>> instead of the former:
>> 
>> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   orted 
>> --hnp-topo-sig ...
>> 
>> So, just removing the if-statement is not a perfect cure as the 
>> $SGE_ROOT/$ARC does not prefix `qrsh`.
>> 
>> ==
>> 
>> BTW: why is there blank before " orted" in the assembled command line - and 
>> it's really in the argument when I check this on the slave nodes what should 
>> be started by the `qrsh_starter`? As long as there is a wrapping shell, it 
>> will be removed anyway. But in a special setup we noticed this additional 
>> blank.
>> 
>> ==
>> 
>> I also notice, that I have to supply "-ldl" to `mpicc` to allow the 
>> compilation of an application to succeed in 2.0.0.
>> 
>> -- Reuti
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-12 Thread Reuti


Am 12.08.2016 um 21:44 schrieb r...@open-mpi.org:

> Don’t know about the toolchain issue - I use those same versions, and don’t 
> have a problem. I’m on CentOS-7, so that might be the difference?
> 
> Anyway, I found the missing code to assemble the cmd line for qrsh - not sure 
> how/why it got deleted.
> 
> https://github.com/open-mpi/ompi/pull/1960

Yep, it's working again - thx.

But for sure there was a reason behind the removal, which may be elaborated in 
the Open MPI team to avoid any side effects by fixing this issue.

-- Reuti

PS: The other items I'll investigate on Monday.


>> On Aug 12, 2016, at 12:15 PM, Reuti  wrote:
>> 
>>> 
>>> Am 12.08.2016 um 16:52 schrieb r...@open-mpi.org:
>>> 
>>> IIRC, the rationale behind adding the check was that someone using SGE 
>>> wanted to specify a custom launch agent, and we were overriding it with 
>>> qrsh. However, the check is incorrect as that MCA param cannot be NULL.
>>> 
>>> I have updated this on master - can you see if this fixes the problem for 
>>> you?
>>> 
>>> https://github.com/open-mpi/ompi/pull/1957
>> 
>> I updated my tools to:
>> 
>> autoconf-2.69
>> automake-1.15
>> libtool-2.4.6
>> 
>> but I face with Open MPI's ./autogen.pl:
>> 
>> configure.ac:152: error: possibly undefined macro: AC_PROG_LIBTOOL
>> 
>> I recall seeing in already before, how to get rid of it? For now I fixed the 
>> single source file just by hand.
>> 
>> -- Reuti
>> 
>> 
>>> As for the blank in the cmd line - that is likely due to a space reserved 
>>> for some entry that you aren’t using (e.g., when someone manually specifies 
>>> the prefix). It shouldn’t cause any harm as the cmd line parser is required 
>>> to ignore spaces
>>> 
>>> The -ldl problem sounds like a configuration issue - you might want to file 
>>> a separate issue about it
>>> 
>>>> On Aug 11, 2016, at 4:28 AM, Reuti  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> In the file orte/mca/plm/rsh/plm_rsh_component I see an if-statement, 
>>>> which seems to prevent the tight integration with SGE to start:
>>>> 
>>>>  if (NULL == mca_plm_rsh_component.agent) {
>>>> 
>>>> Why is it there (it wasn't in 1.10.3)?
>>>> 
>>>> If I just remove it I get:
>>>> 
>>>> [node17:25001] [[27678,0],0] plm:rsh: final template argv:
>>>>  qrsh   orted --hnp-topo-sig ...
>>>> 
>>>> instead of the former:
>>>> 
>>>> /usr/sge/bin/lx24-amd64/qrsh -inherit -nostdin -V -verbose   
>>>> orted --hnp-topo-sig ...
>>>> 
>>>> So, just removing the if-statement is not a perfect cure as the 
>>>> $SGE_ROOT/$ARC does not prefix `qrsh`.
>>>> 
>>>> ==
>>>> 
>>>> BTW: why is there blank before " orted" in the assembled command line - 
>>>> and it's really in the argument when I check this on the slave nodes what 
>>>> should be started by the `qrsh_starter`? As long as there is a wrapping 
>>>> shell, it will be removed anyway. But in a special setup we noticed this 
>>>> additional blank.
>>>> 
>>>> ==
>>>> 
>>>> I also notice, that I have to supply "-ldl" to `mpicc` to allow the 
>>>> compilation of an application to succeed in 2.0.0.
>>>> 
>>>> -- Reuti
>>>> ___
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-16 Thread Reuti


Am 16.08.2016 um 13:26 schrieb Jeff Squyres (jsquyres):

> On Aug 12, 2016, at 2:15 PM, Reuti  wrote:
>> 
>> I updated my tools to:
>> 
>> autoconf-2.69
>> automake-1.15
>> libtool-2.4.6
>> 
>> but I face with Open MPI's ./autogen.pl:
>> 
>> configure.ac:152: error: possibly undefined macro: AC_PROG_LIBTOOL
>> 
>> I recall seeing in already before, how to get rid of it? For now I fixed the 
>> single source file just by hand.
> 
> This means your Autotools installation isn't correct.  A common mistake that 
> I've seen people do is install Autoconf, Automake, and Libtool in separate 
> prefixes (vs. installing all 3 into a single prefix).

Thx a bunch - that was it. Despite searching for a solution I found only hints 
that didn't solve the issue.

-- Reuti


>  Another common mistake is accidentally using the wrong autoconf, automake, 
> and/or libtool (e.g., using 2 out of the 3 from your new/correct install, but 
> accidentally using a system-level install for the 3rd).
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OS X El Capitan 10.11.6 ld: symbol(s) not found for architecture x86_64

2016-08-23 Thread Reuti

Hi,

Am 23.08.2016 um 21:43 schrieb Richard G French:

> Hi, all -
> I'm trying to build the SPH code Gadget2 
> (http://wwwmpa.mpa-garching.mpg.de/gadget/) under OS X 10.11.6 and I am 
> getting the following type of error:
> 
> 222 rfrench@cosmos> make
> 
> mpicc main.o  run.o  predict.o begrun.o endrun.o global.o timestep.o  init.o 
> restart.o  io.o accel.o   read_ic.o  ngb.o system.o  allocate.o  density.o 
> gravtree.o hydra.o  driftfac.o domain.o  allvars.o potential.o forcetree.o   
> peano.o gravtree_forcetest.o pm_periodic.o pm_nonperiodic.o longrange.o   -g  
> -L/opt/local/lib/mpich-mp  -L/usr/local/lib -lgsl -lgslcblas -lm 
> -L/usr/local/lib -lrfftw_mpi -lfftw_mpi -lrfftw -lfftw-o  Gadget2

By default the `mpicc` should add all necessary libraries to compile and link 
the application. But I wonder why you speak about mpich-mp as this is a 
different implementation of MPI, not Open MPI. Also the default location of 
Open MPI isn't mpich-mp.

- what does:

$ mpicc -show
$ which mpicc

output?

- which MPI library was used to build the parallel FFTW?

-- Reuti


> Undefined symbols for architecture x86_64:
> 
>   "_ompi_mpi_byte", referenced from:
> 
>   _read_parameter_file in begrun.o
> 
>   _compute_global_quantities_of_system in global.o
> 
>   _restart in restart.o
> 
>   _write_file in io.o
> 
>   _read_file in read_ic.o
> 
>   _find_files in read_ic.o
> 
>   _density in density.o
> 
> ..
> 
> I built the mpich library using 
> 
> cd openmpi-2.0.0/
> 
> 
> ./configure
> 
> 
> sudo make all install
> 
> which installed the libraries in
> 
> 
> /opt/local/lib/mpich-mp
> 
> 
> 
> I can't seem to track down the library that contains ompi_mpi_byte.
> 
> 
> 
> Any suggestions would be welcome. Thanks!
> 
> Dick French
> 
> 
> 
> 
> -- 
> Richard G. French
> McDowell and Whiting Professor of Astrophysics
> Chair of the Astronomy Department, Wellesley College
> Director of the Whitin Observatory
> Cassini Mission to Saturn Radio Science Team Leader
> Wellesley, MA 02481-8203
> (781) 283-3747
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] static linking MPI libraries with applications

2016-09-14 Thread Reuti


> Am 14.09.2016 um 15:05 schrieb Gilles Gouaillardet 
> :
> 
> in this case, you should configure OpenMPI with
> --disable-shared --enable-static --disable-dlopen
> 
> then you can manually run the mpifort link command with --showme,
> so you get the fully expanded gfortran link command.
> then you can edit this command, and non system libs (e.g. lapack, openmpi, 
> siesta) with -static, but system libs (e.g. m, ibverbs, ...) with -dynamic
> you might have to manually append -ldl, though that should not be needed

For me it's necessary to use -ldl, I didn't find the time to look further into 
it. See my post from Aug 11, 2016. With older versions of Open MPI it wasn't 
necessary to supply it in addition.

-- Reuti


> 
> Cheers,
> 
> Gilles
> 
> 
> 
> On Wednesday, September 14, 2016, Mahmood Naderan  
> wrote:
> Well I want to omit LD_LIBRARY_PATH. For that reason I am building the binary 
> statically.
> 
> > note this is not required when Open MPI is configure'd with
> >--enable-mpirun-prefix-by-default
> I really did that. Using Rocks-6, I installed the application and openmpi on 
> the shared file system (/export).
> Yes it is possible to add the library paths to LD_LIBRARY_PATH, but I want to 
> statically put the required libraries in the binary.
> 
> 
> 
> Regards,
> Mahmood
> 
> 
> 
> On Wed, Sep 14, 2016 at 4:44 PM, Gilles Gouaillardet 
>  wrote:
> Mahmood,
> 
> try to prepend /export/apps/siesta/openmpi-1.8.8/lib to your $LD_LIBRARY_PATH
> 
>  note this is not required when Open MPI is configure'd with
> --enable-mpirun-prefix-by-default
> 
> 
> Cheers,
> 
> Gilles
> 
> On Wednesday, September 14, 2016, Mahmood Naderan  
> wrote:
> Hi,
> Here is the problem with statically linking an application with a program.
> 
> by specifying the library names:
> 
> FC=/export/apps/siesta/openmpi-1.8.8/bin/mpifort
> FFLAGS=-g -Os
> FPPFLAGS= -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT
> LDFLAGS=-static
> MPI1=/export/apps/siesta/openmpi-1.8.8/lib/libmpi_mpifh.a
> MPI2=/export/apps/siesta/openmpi-1.8.8/lib/libmpi_usempi.a
> BLAS_LIBS=../libopenblas.a
> SCALAPACK_LIBS=../libscalapack.a
> LIBS=$(SCALAPACK_LIBS) $(BLAS_LIBS) $(MPI1) $(MPI2)
> 
> 
> 
> 
> The output of "make" is:
> 
> /export/apps/siesta/openmpi-1.8.8/bin/mpifort -o transiesta \
>-static automatic_cell.o  
> libmpi_f90.a 
>   `FoX/FoX-config --libs --wcml` ../libscalapack.a   
> ../libopenblas.a  /export/apps/siesta/openmpi-1.8.8/lib/libmpi_mpifh.a 
> /export/apps/siesta/openmpi-1.8.8/lib/libmpi_usempi.a
> /export/apps/siesta/openmpi-1.8.8/lib/libopen-pal.a(dl_dlopen_module.o): In 
> function `dlopen_open':
> dl_dlopen_module.c:(.text+0x473): warning: Using 'dlopen' in statically 
> linked applications requires at runtime the shared libraries from the glibc 
> version used for linking
> /usr/bin/ld: cannot find -libverbs
> collect2: ld returned 1 exit status
> 
> 
> 
> 
> If I drop -static, the error is gone... However, ldd command shoes that 
> binary can not access those two MPI libraries.
> 
> 
> Regards,
> Mahmood
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] static linking MPI libraries with applications

2016-09-14 Thread Reuti


> Am 14.09.2016 um 19:12 schrieb Mahmood Naderan :
> 
> So, I used
> 
> ./configure --prefix=/export/apps/siesta/openmpi-1.8.8 
> --enable-mpirun-prefix-by-default --enable-static --disable-shared  
> --disable-dlopen
> 
> and added -static to LDFLAGS, but I get:
> 
> /export/apps/siesta/openmpi-1.8.8/bin/mpifort -o transiesta -static 
> libfdf.a libSiestaXC.a \
>libmpi_f90.a  \
> `FoX/FoX-config --libs --wcml` ../libscalapack.a   
> ../libopenblas.a  /export/apps/siesta/openmpi-1.8.8/lib/libmpi_mpifh.a 
> /export/apps/siesta/openmpi-1.8.8/lib/libmpi_usempi.a
> /usr/bin/ld: cannot find -libverbs

What do you have on the system for "libibverbs.*"?


> collect2: ld returned 1 exit status
> 
> 
> removing -static will eliminate the error but that is not what I want. Should 
> I build libverbs from source first? Am I on the right direction?

The "-l" includes already the "lib" prefix when it tries to find the library. 
Hence "-libverbs" might be misleading due to the "lib" in the word, as it looks 
for "libibverbs.{a|so}". Like "-lm" will look for "libm.a" resp. "libm.so".

-- Reuti


> Regards,
> Mahmood
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] static linking MPI libraries with applications

2016-09-14 Thread Reuti


Am 14.09.2016 um 20:09 schrieb Mahmood Naderan:

> I installed libibverb-devel-static.x86_64 via yum
> 
> 
> root@cluster:tpar# yum list libibverb*
> Installed Packages
> libibverbs.x86_64 1.1.8-4.el6 
>@base
> libibverbs-devel.x86_64   1.1.8-4.el6 
>@base
> libibverbs-devel-static.x86_641.1.8-4.el6 
>@base
> Available Packages
> libibverbs.i686   1.1.8-4.el6 
>base
> libibverbs-devel.i686 1.1.8-4.el6 
>base
> libibverbs-utils.x86_64   1.1.8-4.el6 
>base
> root@cluster:tpar# find /usr -name libibverb*
> /usr/lib64/libibverbs.so.1.0.0
> /usr/lib64/libibverbs.so
> /usr/lib64/libibverbs.a
> /usr/lib64/libibverbs.so.1
> /usr/share/doc/libibverbs-1.1.8
> 
> 
> and added /usr/lib64/libibverbs.a similar to the scalapack I added... Just 
> gave the full path.
> 
> 
> 
> However, this is what I get:
> 
> libmpi_f90.a  \
> `FoX/FoX-config --libs --wcml` ../libscalapack.a   
> ../libopenblas.a  /export/apps/siesta/openmpi-1.8.8/lib/libmpi_mpifh.a 
> /export/apps/siesta/openmpi-1.8.8/lib/libmpi_usempi.a /usr/lib64/libibverbs.a
> /export/apps/siesta/openmpi-1.8.8/lib/libopen-rte.a(session_dir.o): In 
> function `orte_session_dir_get_name':
> session_dir.c:(.text+0x751): warning: Using 'getpwuid' in statically linked 
> applications requires at runtime the shared libraries from the glibc version 
> used for linking
> sockets.o: In function `open_socket':
> sockets.c:(.text+0xb5): warning: Using 'getaddrinfo' in statically linked 
> applications requires at runtime the shared libraries from the glibc version 
> used for linking
> /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../lib64/libpthread.a(libpthread.o):
>  In function `sem_open':
> (.text+0x764d): warning: the use of `mktemp' is dangerous, better use 
> `mkstemp'
> /export/apps/siesta/openmpi-1.8.8/lib/libopen-rte.a(ras_slurm_module.o): In 
> function `init':
> ras_slurm_module.c:(.text+0x6d5): warning: Using 'gethostbyname' in 
> statically linked applications requires at runtime the shared libraries from 
> the glibc version used for linking
> /export/apps/siesta/openmpi-1.8.8/lib/libopen-pal.a(evutil.o): In function 
> `evutil_unparse_protoname':
> /export/apps/siesta/openmpi-1.8.8/opal/mca/event/libevent2021/libevent/evutil.c:758:
>  warning: Using 'getprotobynumber' in statically linked applications requires 
> at runtime the shared libraries from the glibc version used for linking
> /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../lib64/libnl.a(utils.o): In 
> function `nl_str2ip_proto':
> (.text+0x599): warning: Using 'getprotobyname' in statically linked 
> applications requires at runtime the shared libraries from the glibc version 
> used for linking
> /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../lib64/libibverbs.a(src_libibverbs_la-init.o):
>  In function `load_driver':
> (.text+0x2ec): undefined reference to `dlopen'
> /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../lib64/libibverbs.a(src_libibverbs_la-init.o):
>  In function `load_driver':
> (.text+0x331): undefined reference to `dlerror'
> /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../lib64/libibverbs.a(src_libibverbs_la-init.o):
>  In function `ibverbs_init':
> (.text+0xd25): undefined reference to `dlopen'
> /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../lib64/libibverbs.a(src_libibverbs_la-init.o):
>  In function `ibverbs_init':
> (.text+0xd36): undefined reference to `dlclose'

Now you may need: -ldl

-- Reuti


> collect2: ld returned 1 exit status
> make: *** [transiesta] Error 1
> 
> 
> 
> 
> Regards,
> Mahmood
> 
> 
> 
> On Wed, Sep 14, 2016 at 9:54 PM, Reuti  wrote:
> 
> The "-l" includes already the "lib" prefix when it tries to find the library. 
> Hence "-libverbs" might be misleading due to the "lib" in the word, as it 
> looks for "libibverbs.{a|so}". Like "-lm" will look for "libm.a" resp. 
> "libm.so".
> 
> -- Reuti
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Reuti


Am 15.09.2016 um 19:54 schrieb Mahmood Naderan:

> The differences are very very minor
> 
> root@cluster:tpar# echo | gcc -v -E - 2>&1 | grep cc1
>  /usr/libexec/gcc/x86_64-redhat-linux/4.4.7/cc1 -E -quiet -v - -mtune=generic
> 
> [root@compute-0-1 ~]# echo | gcc -v -E - 2>&1 | grep cc1
>  /usr/libexec/gcc/x86_64-redhat-linux/4.4.6/cc1 -E -quiet -v - -mtune=generic
> 
> 
> Even I tried to compile the program with -march=amdfam10. Something like these
> 
> /export/apps/siesta/openmpi-2.0.0/bin/mpifort -c -g -Os -march=amdfam10   
> `FoX/FoX-config --fcflags`  -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT 
> -DTRANSIESTA/export/apps/siesta/siesta-4.0/Src/pspltm1.F
> 
> But got the same error.
> 
> /proc/cpuinfo on the frontend shows (family 21, model 2) and on the compute 
> node it shows (family 21, model 1).

Just for curiosity: what is the model name of them?


> >That being said, my best bet is you compile on a compute node ...
> gcc is there on the computes, but the NFS permission is another issue. It 
> seems that nodes are not able to write on /share (the one which is shared 
> between frontend and computes).

Would it work to compile with a shared target and copy it to /shared on the 
frontend?

-- Reuti


> An important question is that, how can I find out what is the name of the 
> illegal instruction. Then, I hope to find the document that points which 
> instruction set (avx, sse4, ...) contains that instruction.
> 
> Is there any option in mpirun to turn on the verbosity to see more 
> information?
> 
> Regards,
> Mahmood
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-22 Thread Reuti


> Am 22.09.2016 um 17:20 schrieb Mahmood Naderan :
> 
> Although this problem is not related to OMPI *at all*, I think it is good to 
> tell the others what was going on. Finally, I caught the illegal instruction 
> :)
> 
> Briefly, I built the serial version of Siesta on the frontend and ran it 
> directly on the compute node. Fortunately, "x/i $pc" from GDB showed that the 
> illegal instruction was a FMA3 instruction. More detail is available at 
> https://gcc.gnu.org/ml/gcc-help/2016-09/msg00084.html
> 
> According to the Wikipedia,
> 
>   • FMA4 is supported in AMD processors starting with the Bulldozer 
> architecture. FMA4 was realized in hardware before FMA3.
>   • FMA3 is supported in AMD processors starting with the Piledriver 
> architecture and Intel starting with Haswell processors and Broadwell 
> processors since 2014.
> Therefore, the frontend (piledriver) inserts a FMA3 instruction while the 
> compute node (Bulldozer) doesn't recognize it.

Thx for sharing, quite interesting. But does this mean, that there is no 
working command line flag for gcc to switch this off (like -march=bdver1 what 
Gilles mentioned) or to tell me what he thinks it should compile for?

For pgcc there is -show and I can spot the target it discovered in the 
USETPVAL= line.

-- Reuti

> 
> The solution was (as stated by guys) building Siesta on the compute node. I 
> have to say that I tested all related programs (OMPI, Scalapack, OpenBLAS) 
> sequentially on the compute node in order to find who generate the illegal 
> instruction.
> 
> Anyway... thanks a lot for your comments. Hope this helps others in the 
> future.
> 
> 
> 
> Regards,
> Mahmood
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI Behaviour Question

2016-10-11 Thread Reuti

Hi,

> Am 11.10.2016 um 14:56 schrieb Mark Potter :
> 
> This question is related to OpenMPI 2.0.1 compiled with GCC 4.8.2 on
> RHEL 6.8 using Torque 6.0.2 with Moab 9.0.2. To be clear, I am an
> administrator and not a coder and I suspect this is expected behavior
> but I have been asked by a client to explain why this is happening.
> 
> Using Torque, the following command returns the hostname of the first
> node only, regardless of how the nodes/cores are split up:
> 
> mpirun -np 20 echo "Hello from $HOSTNAME"

The $HOSTNAME will be expanded and used as argument before `mpirun` even 
starts. Instead it has to be evaluated on the nodes:

$ mpirun bash -c "echo \$HOSTNAME"


> (the behaviour is the same with "echo $(hostname))
> 
> The Torque script looks like this:
> 
> #PBS -V
> #PBS -N test-job
> #PBS -l nodes=2:ppn=16
> #PBS -e ERROR
> #PBS -o OUTPUT
> 
> 
> cd $PBS_O_WORKDIR
> date
> cat $PBS_NODEFILE
> 
> mpirun -np32 echo "Hello from $HOSTNAME"
> 
> If the echo statement is replaced with "hostname" then a proper
> response is received from all nodes.
> 
> While I know there are better ways to test OpenMPI's functionality,
> like compiling and using the programs in examples/, this is the method
> a specific client chose.

There are small "Hello world" programs like here:

http://mpitutorial.com/tutorials/mpi-hello-world/

to test whether e.g. the libraries are found at runtime by the application(s).

-- Reuti


> I was using both the examples and a Torque job
> script calling just "hostname" as a command and not using echo and the
> client was using the script above. It took some doing to figure out why
> he thought it wasn't working and all my tests were successful and when
> I figured it, he wanted an explanation that's beyond my current
> knowledge. Any help towards explaining the behaviour would be greatly
> appreciated.
> 
> -- 
> Regards,
> 
> Mark L. Potter
> Senior Consultant
> PCPC Direct, Ltd.
> O: 713-344-0952 
> M: 713-965-4133
> S: mpot...@pcpcdirect.com
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Running a computer on multiple computers

2016-10-14 Thread Reuti


Am 14.10.2016 um 23:10 schrieb Mahdi, Sam:

> Hello everyone,
> 
> I am attempting to run a single program on 32 cores split across 4 computers 
> (So each computer has 8 cores). I am attempting to use mpich for this. I 
> currently am just testing on 2 computers, I have the program installed on 
> both, as well as mpich installed on both. I have created a register key and 
> can login in using ssh into the other computer without a password. I have 
> come across 2 problems. One, when I attempt to connect using the mpirun -np 3 
> --host a (the IP of the computer I am attempting to connect to) hostname 
> I recieve the error 
>  unable to connect from "localhost.localdomain" to "localhost.localdomain"
> 
> This is indicating my computers "localhost.localdomain" is attempting to 
> connect to another "localhost.localdomain". How can I change this so that it 
> connects via my IP to the other computers IP?
> 
> Secondly, I attempted to use a host file instead using the hydra process 
> wiki. I created a hosts file with just the IP of the computer I am attempting 
> to connect to. When I type in the command mpiexec -f hosts -n 4 ./applic 
> 
> I get this error 
> [mpiexec@localhost.localdomain] HYDU_parse_hostfile 
> (./utils/args/args.c:323): unable to open host file: hosts

As you mentioned MPICH and their Hydra startup, you better ask at their list:

http://www.mpich.org/support/mailing-lists/

This list is for the Open MPI implementation.

-- Reuti

> 
> along with other errors of unable to parse hostfile, match handler etc. I 
> assume this is all due to it being unable to read the host file. Is there any 
> specific place I should save my hosts file? I have it saved directly on my 
> Desktop. I have attempted to indicate the full path where it is located, but 
> I still get the same error.  
> 
> For the first problem, I have read that I need to change /etc/hosts manually 
> by using the sudo command to manually enter the IP of the computer I am 
> attempting to connect to in the /etc/hosts file. Thank you in advance.
> 
> Sincerely,
> Sam
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Low CPU utilization

2016-10-16 Thread Reuti

Hi,

Am 16.10.2016 um 20:34 schrieb Mahmood Naderan:

> Hi,
> I am running two softwares that use OMPI-2.0.1. Problem is that the CPU 
> utilization is low on the nodes.
> 
> 
> For example, see the process information below
> 
> [root@compute-0-1 ~]# ps aux | grep siesta
> mahmood  14635  0.0  0.0 108156  1300 ?S21:58   0:00 /bin/bash 
> /share/apps/chemistry/siesta-4.0-mpi201/spar/siesta.p1 A.fdf
> mahmood  14636  0.0  0.0 108156  1300 ?S21:58   0:00 /bin/bash 
> /share/apps/chemistry/siesta-4.0-mpi201/spar/siesta.p1 A.fdf
> mahmood  14637 61.6  0.2 372076 158220 ?   Rl   21:58   0:38 
> /share/apps/chemistry/siesta-4.0-mpi201/spar/siesta
> mahmood  14639 59.6  0.2 365992 154228 ?   Rl   21:58   0:37 
> /share/apps/chemistry/siesta-4.0-mpi201/spar/siesta
> 
> 
> Note that the cpu utilization is the third column. The "siesta.pl" script is
> 
> #!/bin/bash
> BENCH=$1
> export OMP_NUM_THREADS=1
> /share/apps/chemistry/siesta-4.0-mpi201/spar/siesta < $BENCH
> 
> 
> 
> 
> I also saw a similar behavior from Gromacs which has been discussed at 
> https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2016-October/108939.html
> 
> It seems that there is a tricky thing with OMPI. Any idea is welcomed.

Sounds like the two jobs are using the same cores by automatic core binding as 
one instance doesn't know anything about the other. For a first test you can 
start both with "mpiexec --bind-to none ..." and check whether you see a 
different behavior.

`man mpiexec` mentions some hints about threads in applications.

-- Reuti

> 
> 
> Regards,
> Mahmood
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread Reuti

Hi,

> Am 03.02.2017 um 17:10 schrieb Mark Dixon :
> 
> Hi,
> 
> Just tried upgrading from 2.0.1 to 2.0.2 and I'm getting error messages that 
> look like openmpi is using ssh to login to remote nodes instead of qrsh (see 
> below). Has anyone else noticed gridengine integration being broken, or am I 
> being dumb?
> 
> I built with "./configure 
> --prefix=/apps/developers/libraries/openmpi/2.0.2/1/intel-17.0.1 --with-sge 
> --with-io-romio-flags=--with-file-system=lustre+ufs --enable-mpi-cxx 
> --with-cma"

SGE on its own is not configured to use SSH? (I mean the entries in `qconf 
-sconf` for rsh_command resp. daemon).

-- Reuti


> Can see the gridengine component via:
> 
> $ ompi_info -a | grep gridengine
> MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2)
>  MCA ras gridengine: ---
>  MCA ras gridengine: parameter "ras_gridengine_priority" (current value: 
> "100", data source: default, level: 9 dev/all, type: int)
>  Priority of the gridengine ras component
>  MCA ras gridengine: parameter "ras_gridengine_verbose" (current value: 
> "0", data source: default, level: 9 dev/all, type: int)
>  Enable verbose output for the gridengine ras 
> component
>  MCA ras gridengine: parameter "ras_gridengine_show_jobid" (current 
> value: "false", data source: default, level: 9 dev/all, type: bool)
> 
> Cheers,
> 
> Mark
> 
> ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
> Permission denied, please try again.
> ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
> Permission denied, please try again.
> ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>  settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>  Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>  Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>  (e.g., on Cray). Please check your configure cmd line and consider using
>  one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>  lack of common network interfaces and/or no route found between
>  them. Please check network connectivity (including firewalls
>  and network routing requirements).
> --
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread Reuti

Hi,

> Am 27.02.2017 um 14:33 schrieb Angel de Vicente :
> 
> Hi,
> 
> "r...@open-mpi.org"  writes:
>>> With the DVM, is it possible to keep these jobs in some sort of queue,
>>> so that they will be executed when the cores get free?
>> 
>> It wouldn’t be hard to do so - as long as it was just a simple FIFO 
>> scheduler. I wouldn’t want it to get too complex.
> 
> a simple FIFO should be probably enough. This can be useful as a simple
> way to make a multi-core machine accessible to a small group of (friendly)
> users, making sure that they don't oversubscribe the machine, but
> without going the full route of installing/maintaining a full resource
> manager.

At first I thought you want to run a queuing system inside a queuing system, 
but this looks like you want to replace the resource manager.

Under which user account the DVM daemons will run? Are all users using the same 
account?

-- Reuti


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread Reuti


> Am 27.02.2017 um 18:24 schrieb Angel de Vicente :
> 
> […]
> 
> For a small group of users if the DVM can run with my user and there is
> no restriction on who can use it or if I somehow can authorize others to
> use it (via an authority file or similar) that should be enough.

AFAICS there is no user authorization at all. Everyone can hijack a running DVM 
once he knows the URI. The only problem might be, that all processes are 
running under the account of the user who started the DVM. I.e. output files 
have to go to the home directory of this user, as any other user can't write to 
his own directory any longer this way.

Running the DVM under root might help, but this would be a high risk that any 
faulty script might write to a place where sensible system information is 
stored and may leave the machine unusable afterwards.

My first attempts using DVM often leads to a terminated DVM once a process 
returned with a non-zero exit code. But once the DVM is gone, the queued jobs 
might be lost too I fear. I would wish that the DVM could be more forgivable 
(or this feature be adjustable what to do in case of a non-zero exit code).

-- Reuti


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] State of the DVM in Open MPI

2017-02-28 Thread Reuti

Hi,

Only by reading recent posts I got aware of the DVM. This would be a welcome 
feature for our setup*. But I see not all options working as expected - is it 
still a work in progress, or should all work as advertised?

1)

$ soft@server:~> orte-submit -cf foo --hnp file:/home/reuti/dvmuri -n 1 touch 
/home/reuti/hacked

Open MPI has detected that a parameter given to a command line
option does not match the expected format:

  Option: np
  Param:  foo

==> The given option is -cf, not -np

2)

According to `man orte-dvm` there is -H, -host, --host, -machinefile, -hostfile 
but none of them seem operational (Open MPI 2.0.2). A given hostlist given by 
SGE is honored though.

-- Reuti


*) We run Open MPI jobs inside SGE. This works fine. Some applications invoke 
several `mpiexec`-calls during their execution and rely on temporary files they 
created in the last step(s). While this is working fine on one and the same 
machine, it fails in case SGE granted slots on several machines as the scratch 
directories created by `qrsh -inherit …` vanish once the `mpiexec`-call on this 
particular node finishes (and not at the end of the complete job). I can mimic 
persistent scratch directories in SGE for a complete job, but invoking the DVM 
before and shutting it down later on (either by hand in the job script or by 
SGE killing all remains at the end of the job) might be more straight forward 
(looks like `orte-dvm` is started by `qrsh -inherit …` too).


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI-2.1.0 problem with executing orted when using SGE

2017-03-22 Thread Reuti

Hi,

> Am 22.03.2017 um 10:44 schrieb Heinz-Ado Arnolds 
> :
> 
> Dear users and developers,
> 
> first of all many thanks for all the great work you have done for OpenMPI!
> 
> Up to OpenMPI-1.10.6 the mechanism for starting orted was to use SGE/qrsh:
>  mpirun -np 8 --map-by ppr:4:node ./myid
>  /opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V  Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess 
> "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp:// Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" 
> --tree-spawn
> 
> Now with OpenMPI-2.1.0 (and the release candidates) "ssh" is used to start 
> orted:
>  mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 
> ./myid
>  /usr/bin/ssh -x  
> PATH=/afs/./openmpi-2.1.0/bin:$PATH ; export PATH ; 
> LD_LIBRARY_PATH=/afs/./openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export 
> LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/afs/./openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; export 
> DYLD_LIBRARY_PATH ;   /afs/./openmpi-2.1.0/bin/orted --hnp-topo-sig 
> 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca ess_base_jobid 
> "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca 
> orte_hnp_uri "1626013696.0;usock;tcp://:43019" -mca 
> plm_rsh_args "-x" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" 
> -mca pmix "^s1,s2,cray"
> 
> qrsh set the environment properly on the remote side, so that environment 
> variables from job scripts are properly transferred. With the ssh variant the 
> environment is not set properly on the remote side, and it seems that there 
> are handling problems with Kerberos tickets and/or AFS tokens.
> 
> Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use SGE/qrsh) 
> one? Are there mca params to set this?
> 
> If you need more info, please let me know. (Job submitting machine and target 
> cluster are the same with all tests. SW is residing in AFS directories 
> visible on all machines. Parameter "plm_rsh_disable_qrsh" current value: 
> "false")

It looks like `mpirun` still needs:

-mca plm_rsh_agent foo

to allow SGE to be detected.

-- Reuti



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI-2.1.0 problem with executing orted when using SGE

2017-03-22 Thread Reuti


> Am 22.03.2017 um 15:31 schrieb Heinz-Ado Arnolds 
> :
> 
> Dear Reuti,
> 
> thanks a lot, you're right! But why did the default behavior change but not 
> the value of this parameter:
> 
> 2.1.0: MCA plm rsh: parameter "plm_rsh_agent" (current value: "ssh : rsh", 
> data source: default, level: 2 user/detail, type: string, synonyms: 
> pls_rsh_agent, orte_rsh_agent)
>  The command used to launch executables on remote 
> nodes (typically either "ssh" or "rsh")
> 
> 1.10.6:  MCA plm: parameter "plm_rsh_agent" (current value: "ssh : rsh", data 
> source: default, level: 2 user/detail, type: string, synonyms: pls_rsh_agent, 
> orte_rsh_agent)
>  The command used to launch executables on remote 
> nodes (typically either "ssh" or "rsh")
> 
> That means there must have been changes in the code regarding that, perhaps 
> for detecting SGE? Do you know of a way to revert to the old style (e.g. 
> configure option)? Otherwise all my users have to add this option.

There was a discussion in https://github.com/open-mpi/ompi/issues/2947

For now you can make use of 
https://www.open-mpi.org/faq/?category=tuning#setting-mca-params

Essentially to have it set for all users automatically, put:

plm_rsh_agent=foo

in $prefix/etc/openmpi-mca-params.conf of your central Open MPI 2.1.0 
installation.

-- Reuti


> Thanks again, and have a nice day
> 
> Ado Arnolds
> 
> On 22.03.2017 13:58, Reuti wrote:
>> Hi,
>> 
>>> Am 22.03.2017 um 10:44 schrieb Heinz-Ado Arnolds 
>>> :
>>> 
>>> Dear users and developers,
>>> 
>>> first of all many thanks for all the great work you have done for OpenMPI!
>>> 
>>> Up to OpenMPI-1.10.6 the mechanism for starting orted was to use SGE/qrsh:
>>> mpirun -np 8 --map-by ppr:4:node ./myid
>>> /opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V >> Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess 
>>> "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca 
>>> orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp://>> Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" 
>>> --tree-spawn
>>> 
>>> Now with OpenMPI-2.1.0 (and the release candidates) "ssh" is used to start 
>>> orted:
>>> mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 
>>> ./myid
>>> /usr/bin/ssh -x  
>>> PATH=/afs/./openmpi-2.1.0/bin:$PATH ; export PATH ; 
>>> LD_LIBRARY_PATH=/afs/./openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export 
>>> LD_LIBRARY_PATH ; 
>>> DYLD_LIBRARY_PATH=/afs/./openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; export 
>>> DYLD_LIBRARY_PATH ;   /afs/./openmpi-2.1.0/bin/orted --hnp-topo-sig 
>>> 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca ess_base_jobid 
>>> "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca 
>>> orte_hnp_uri "1626013696.0;usock;tcp://:43019" -mca 
>>> plm_rsh_args "-x" -mca plm "rsh" -mca rmaps_base_mapping_policy 
>>> "ppr:4:node" -mca pmix "^s1,s2,cray"
>>> 
>>> qrsh set the environment properly on the remote side, so that environment 
>>> variables from job scripts are properly transferred. With the ssh variant 
>>> the environment is not set properly on the remote side, and it seems that 
>>> there are handling problems with Kerberos tickets and/or AFS tokens.
>>> 
>>> Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use SGE/qrsh) 
>>> one? Are there mca params to set this?
>>> 
>>> If you need more info, please let me know. (Job submitting machine and 
>>> target cluster are the same with all tests. SW is residing in AFS 
>>> directories visible on all machines. Parameter "plm_rsh_disable_qrsh" 
>>> current value: "false")
>> 
>> It looks like `mpirun` still needs:
>> 
>> -mca plm_rsh_agent foo
>> 
>> to allow SGE to be detected.
>> 
>> -- Reuti
>> 
>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Help with Open MPI 2.1.0 and PGI 16.10: Configure and C++

2017-03-23 Thread Reuti

Hi,

Am 22.03.2017 um 20:12 schrieb Matt Thompson:

> […]
> 
> Ah. PGI 16.9+ now use pgc++ to do C++ compiling, not pgcpp. So, I hacked 
> configure so that references to pgCC (nonexistent on macOS) are gone and all 
> pgcpp became pgc++, but:

This is not unique to macOS. pgCC used STLPort STL and is no longer included 
with their compiler suite, pgc++ now uses a GCC compatible library format and 
replaces the former one on Linux too.

There I get, ignoring the gnu output during `configure` and compiling anyway:

$ mpic++ --version

pgc++ 16.10-0 64-bit target on x86-64 Linux -tp bulldozer
The Portland Group - PGI Compilers and Tools
Copyright (c) 2016, NVIDIA CORPORATION.  All rights reserved.

Maybe some options for the `mpic++` wrapper were just set in a wrong way?

===

Nevertheless: did you see the error on the Mac at the end of the `configure` 
step too, or was it gone after the hints on the discussion's link you posted? 
As I face it there still about "libeevent".

-- Reuti


> 
> *** C++ compiler and preprocessor
> checking whether we are using the GNU C++ compiler... yes
> checking whether pgc++ accepts -g... yes
> checking dependency style of pgc++... none
> checking how to run the C++ preprocessor... pgc++ -E
> checking for the C++ compiler vendor... gnu
> 
> Well, at this point, I think I'm stopping until I get help. Will this chunk 
> of configure always return gnu for PGI? I know the C part returns 'portland 
> group':
> 
> *** C compiler and preprocessor
> checking for gcc... (cached) pgcc
> checking whether we are using the GNU C compiler... (cached) no
> checking whether pgcc accepts -g... (cached) yes
> checking for pgcc option to accept ISO C89... (cached) none needed
> checking whether pgcc understands -c and -o together... (cached) yes
> checking for pgcc option to accept ISO C99... none needed
> checking for the C compiler vendor... portland group
> 
> so I thought the C++ section would as well. I also tried passing in 
> --enable-mpi-cxx, but that did nothing.
> 
> Is this just a red herring? My real concern is with pgfortran/mpifort, but I 
> thought I'd start with this. If this is okay, I'll move on and detail the 
> fortran issues I'm having.
> 
> Matt
> --
> Matt Thompson
> Man Among Men
> Fulcrum of History
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-24 Thread Reuti

Hi,

Am 24.03.2017 um 20:39 schrieb Jeff Squyres (jsquyres):

> Limiting MPI processes to hyperthreads *helps*, but current generation Intel 
> hyperthreads are not as powerful as cores (they have roughly half the 
> resources of a core), so -- depending on your application and your exact 
> system setup -- you will almost certainly see performance degradation of 
> running N MPI processes across N cores vs. across N hyper threads.  You can 
> try it yourself by running the same size application over N cores on a single 
> machine, and then run the same application over N hyper threads (i.e., N/2 
> cores) on the same machine.
> 
> […]
> 
> - Disabling HT in the BIOS means that the one hardware thread left in each 
> core will get all the cores resources (buffers, queues, processor units, 
> etc.).
> - Enabling HT in the BIOS means that each of the 2 hardware threads will 
> statically be allocated roughly half the core's resources (buffers, queues, 
> processor units, etc.).

Do you have a reference for the two topics above (sure, I will try next week on 
my own)? My knowledge was, that there is no dedicated HT core, and using all 
cores will not give the result that the real cores get N x 100%, plus the HT 
ones N x 50% (or alike). But the scheduler inside the CPU will balance the 
resources between the double face of a single core and both are equal.


> […]
> Spoiler alert: many people have looked at this.  In *most* (but not all) 
> cases, using HT is not a performance win for MPI/HPC codes that are designed 
> to run processors at 100%.

I think it was also on this mailing list, that someone mentioned that the 
pipelines in the CPU are reorganized in case you switch HT off, as only half of 
them would be needed and these resources are then bound to the real cores too, 
extending their performance. Similar, but not exactly what Jeff mentiones above.

Another aspect is, that even if they are not really doubling the performance, 
one might get 150%. And if you pay per CPU hours, it can be worth to have it 
switched on.

My personal experience is, that it depends not only application, but also on 
the way how you oversubscribe. Using all cores for a single MPI application 
leads to the effect, that all processes are doing the same stuff at the same 
time (at least often) and fight for the same part of the CPU, essentially 
becoming a bottleneck. But using each half of a CPU for two (or even more) 
applications will allow a better interleaving in the demand for resources. To 
allow this in the best way: no taskset or binding to cores, let the Linux 
kernel and CPU do their best - YMMV.

-- Reuti
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Reuti

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Am 03.04.2017 um 22:01 schrieb Prentice Bisbal:

> Nevermind. A coworker helped me figure this one out. Echo is treating the 
> '-E' as an argument to echo and interpreting it instead of passing it to sed. 
> Since that's used by the configure tests, that's a bit of a problem, Just 
> adding another -E before $@, should fix the problem.

It's often suggested to use printf instead of the non-portable echo.

- -- Reuti


> 
> Prentice
> 
> On 04/03/2017 03:54 PM, Prentice Bisbal wrote:
>> I've decided to work around this problem by creating a wrapper script for 
>> pgcc that strips away the -pthread argument, but my sed expression works on 
>> the command-line, but not in the script. I'm essentially reproducing the 
>> workaround from 
>> https://www.open-mpi.org/community/lists/users/2009/04/8724.php.
>> 
>> Can anyone see what's wrong with my implementation the workaround? It's a 
>> very simple sed expression. Here's my script:
>> 
>> #!/bin/bash
>> 
>> realcmd=/path/to/pgcc
>> echo "original args: $@"
>> newargs=$(echo "$@" | sed s/-pthread//)
>> echo "new args: $newargs"
>> #$realcmd $newargs
>> exit
>> 
>> And here's what happens when I run it:
>> 
>> /path/to/pgcc -E conftest.c
>> original args: -E conftest.c
>> new args: conftest.c
>> 
>> As you can see, the -E argument is getting lost in translation. If I add 
>> more arguments, it works fine:
>> 
>> /path/to/pgcc -A -B -C -D -E conftest.c
>> original args: -A -B -C -D -E conftest.c
>> new args: -A -B -C -D -E conftest.c
>> 
>> It only seems to be a problem when -E is the first argument:
>> 
>> $ /path/to/pgcc -E -D -C -B -A conftest.c
>> original args: -E -D -C -B -A conftest.c
>> new args: -D -C -B -A conftest.c
>> 
>> Prentice
>> 
>> On 04/03/2017 02:24 PM, Aaron Knister wrote:
>>> To be thorough couldn't one replace -pthread in the slurm .la files with 
>>> -lpthread? I ran into this last week and this was the solution I was 
>>> thinking about implementing. Having said that, I can't think of a situation 
>>> in which the -pthread/-lpthread argument would be required other than 
>>> linking against statically compiled SLURM libraries and even then I'm not 
>>> so sure about that.
>>> 
>>> -Aaron
>>> 
>>> On 4/3/17 1:46 PM, Åke Sandgren wrote:
>>>> We build slurm with GCC, drop the -pthread arg in the .la files, and
>>>> have never seen any problems related to that. And we do build quite a
>>>> lot of code. And lots of versions of OpenMPI with multiple different
>>>> compilers (and versions).
>>>> 
>>>> On 04/03/2017 04:51 PM, Prentice Bisbal wrote:
>>>>> This is the second suggestion to rebuild Slurm
>>>>> 
>>>>> The  other from Åke Sandgren, who recommended this:
>>>>> 
>>>>>> This usually comes from slurm, so we always do
>>>>>> 
>>>>>> perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
>>>>>> /lap/slurm/${version}/lib/libslurm.la
>>>>>> 
>>>>>> when installing a new slurm version. Thus no need for a fakepg wrapper.
>>>>> 
>>>>> I don't really have the luxury to rebuild Slurm at the moment. How would
>>>>> I rebuild Slurm to change this behavior? Is rebuilding Slurm with PGI
>>>>> the only option to fix this in slurm, or use Åke's suggestion above?
>>>>> 
>>>>> If I did use Åke's suggestion above, how would that affect the operation
>>>>> of Slurm, or future builds of OpenMPI and any other software that might
>>>>> rely on Slurm, particulary with regards to building those apps with
>>>>> non-PGI compilers?
>>>>> 
>>>>> Prentice
>>>>> 
>>>>> On 04/03/2017 10:31 AM, Gilles Gouaillardet wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> The -pthread flag is likely pulled by libtool from the slurm libmpi.la
>>>>>> <http://libmpi.la> and/or libslurm.la <http://libslurm.la>
>>>>>> Workarounds are
>>>>>> - rebuild slurm with PGI
>>>>>> - remove the .la files (*.so and/or *.a are enough)
>>>>>> - wrap the PGI compiler to ignore the -pthread option
>>

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Reuti

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Am 03.04.2017 um 23:07 schrieb Prentice Bisbal:

> FYI - the proposed 'here-doc' solution below didn't work for me, it produced 
> an error. Neither did printf. When I used printf, only the first arg was 
> passed along:
> 
> #!/bin/bash
> 
> realcmd=/usr/pppl/pgi/17.3/linux86-64/17.3/bin/pgcc.real
> echo "original args: $@"
> newargs=$(printf -- "$@" | sed s/-pthread//g)

The format string is missing:

printf "%s " "$@"


> echo "new args: $newargs"
> #$realcmd $newargs
> exit
> 
> $ pgcc -tp=x64 -fast conftest.c
> original args: -tp=x64 -fast conftest.c
> new args: -tp=x64
> 
> Any ideas what I might be doing wrong here?
> 
> So, my original echo "" "$@" solution works, and another colleague also 
> suggested this expressions, which appears to work, too:
> 
> newargs=${@/-pthread/}
> 
> Although I don't know how portable that is. I'm guessing that's very 
> bash-specific syntax.
> 
> Prentice
> 
> On 04/03/2017 04:26 PM, Prentice Bisbal wrote:
>> A coworker came up with another idea that works, too:
>> 
>> newargs=sed s/-pthread//g <> $@
>> EOF
>> 
>> That should work, too, but I haven't test it.
>> 
>> Prentice
>> 
>> On 04/03/2017 04:11 PM, Andy Riebs wrote:
>>> Try
>>> $ printf -- "-E" ...
>>> 
>>> On 04/03/2017 04:03 PM, Prentice Bisbal wrote:
 Okay. the additional -E doesn't work,either. :(
 
 Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory 
 http://www.pppl.gov
 On 04/03/2017 04:01 PM, Prentice Bisbal wrote:
> Nevermind. A coworker helped me figure this one out. Echo is treating the 
> '-E' as an argument to echo and interpreting it instead of passing it to 
> sed. Since that's used by the configure tests, that's a bit of a problem, 
> Just adding another -E before $@, should fix the problem.
> 
> Prentice
> 
> On 04/03/2017 03:54 PM, Prentice Bisbal wrote:
>> I've decided to work around this problem by creating a wrapper script 
>> for pgcc that strips away the -pthread argument, but my sed expression 
>> works on the command-line, but not in the script. I'm essentially 
>> reproducing the workaround from 
>> https://www.open-mpi.org/community/lists/users/2009/04/8724.php.
>> 
>> Can anyone see what's wrong with my implementation the workaround? It's 
>> a very simple sed expression. Here's my script:
>> 
>> #!/bin/bash
>> 
>> realcmd=/path/to/pgcc
>> echo "original args: $@"
>> newargs=$(echo "$@" | sed s/-pthread//)
>> echo "new args: $newargs"
>> #$realcmd $newargs
>> exit
>> 
>> And here's what happens when I run it:
>> 
>> /path/to/pgcc -E conftest.c
>> original args: -E conftest.c
>> new args: conftest.c
>> 
>> As you can see, the -E argument is getting lost in translation. If I add 
>> more arguments, it works fine:
>> 
>> /path/to/pgcc -A -B -C -D -E conftest.c
>> original args: -A -B -C -D -E conftest.c
>> new args: -A -B -C -D -E conftest.c
>> 
>> It only seems to be a problem when -E is the first argument:
>> 
>> $ /path/to/pgcc -E -D -C -B -A conftest.c
>> original args: -E -D -C -B -A conftest.c
>> new args: -D -C -B -A conftest.c
>> 
>> Prentice
>> 
>> On 04/03/2017 02:24 PM, Aaron Knister wrote:
>>> To be thorough couldn't one replace -pthread in the slurm .la files 
>>> with -lpthread? I ran into this last week and this was the solution I 
>>> was thinking about implementing. Having said that, I can't think of a 
>>> situation in which the -pthread/-lpthread argument would be required 
>>> other than linking against statically compiled SLURM libraries and even 
>>> then I'm not so sure about that.
>>> 
>>> -Aaron
>>> 
>>> On 4/3/17 1:46 PM, �ke Sandgren wrote:
 We build slurm with GCC, drop the -pthread arg in the .la files, and
 have never seen any problems related to that. And we do build quite a
 lot of code. And lots of versions of OpenMPI with multiple different
 compilers (and versions).
 
 On 04/03/2017 04:51 PM, Prentice Bisbal wrote:
> This is the second suggestion to rebuild Slurm
> 
> The  other from �ke Sandgren, who recommended this:
> 
>> This usually comes from slurm, so we always do
>> 
>> perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
>> /lap/slurm/${version}/lib/libslurm.la
>> 
>> when installing a new slurm version. Thus no need for a fakepg 
>> wrapper.
> 
> I don't really have the luxury to rebuild Slurm at the moment. How 
> would
> I rebuild Slurm to change this behavior? Is rebuilding Slurm with PGI
> the only option to fix this in slurm, or use �ke's suggestion above?
> 
> If I did use �ke's suggestion above, how would that affect t

Re: [OMPI users] mpicc and libstdc++, general building question

2017-04-07 Thread Reuti

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Am 07.04.2017 um 19:11 schrieb Christof Koehler:

> […]
> 
> On top, all OpenMPI libraries when checked with ldd (gcc 6.3 module
> still loaded) reference the /usr/lib64/libstdc++.so.6 and not
> /cluster/comp/gcc/6.3.0/lib64/libstdc++.so.6 which leads to the idea
> that the OpenMPI installation might be the reason we have the
> /usr/lib64/libstdc++.so.6 dependency in the mpi4py libraries as well.
> 
> What we would like to have is that libstdc++.so.6 resolves to the
> libstdc++ provided by the gcc 6.3 compiler for the mpi4py, which would be 
> available in its installation directory, i.e. 
> /cluster/comp/gcc/6.3.0/lib64/libstdc++.so.6.
> 
> So, am I missing options in my OpenMPI build ? Should I explicitely do a
> ./configure CC=/cluster/comp/gcc/6.3.0/bin/gcc 
> CXX=/cluster/comp/gcc/6.3.0/bin/g++ ...
> or similar ? Am I building it correctly with a gcc contained in a
> separate module anyway ? Or do we have a problem with our ld configuration ?

I have a default GCC 4.7.2 in the system, and I just prepend PATH and 
LD_LIBRARY_PATH with the paths to my private GCC 6.2.0 compilation in my home 
directory by a plain `export`. I can spot:

$ ldd libmpi_cxx.so.20
…
libstdc++.so.6 => 
/home/reuti/local/gcc-6.2.0/lib64/../lib64/libstdc++.so.6 (0x7f184d2e2000)

So this looks fine (although /lib64/../lib64/ looks nasty). In the library, the 
RPATH and RUNPATH are set:

$ readelf -a libmpi_cxx.so.20
…
 0x000f (RPATH)  Library rpath: 
[/home/reuti/local/openmpi-2.1.0_gcc-6.2.0_shared/lib64:/home/reuti/local/gcc-6.2.0/lib64/../lib64]
 0x0000001d (RUNPATH)Library runpath: 
[/home/reuti/local/openmpi-2.1.0_gcc-6.2.0_shared/lib64:/home/reuti/local/gcc-6.2.0/lib64/../lib64]

Can you check the order in your PATH and LD_LIBRARY_PATH – are they as expected 
when loading the module?

- -- Reuti
-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAljn+KkACgkQo/GbGkBRnRq3wwCgkRkiaPyXBdAMHoABmFBfDevu
ftkAnR3gul9AnZL0qqb8vZg8zjJvIHtR
=M5Ya
-END PGP SIGNATURE-
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] No more default core binding since 2.0.2?

2017-04-09 Thread Reuti

Hi,

While I noticed an automatic core binding in Open MPI 1.8 (which in a shared 
cluster may lead to oversubscribing of cores), I can't spot this any longer in 
the 2.x series. So the question arises:

- Was this a general decision to no longer enable automatic core binding?

First I thought it might be because of:

- We define plm_rsh_agent=foo in $OMPI_ROOT/etc/openmpi-mca-params.conf
- We compiled with --with-sge

But also started on the command line by `ssh` to the nodes, there seems no 
automatic core binding to take place any longer.

-- Reuti
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] No more default core binding since 2.0.2?

2017-04-09 Thread Reuti

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Am 09.04.2017 um 16:35 schrieb r...@open-mpi.org:

> There has been no change in the policy - however, if you are oversubscribed, 
> we did fix a bug to ensure that we don’t auto-bind in that situation
> 
> Can you pass along your cmd line? So far as I can tell, it still seems to be 
> working.

I'm not sure whether it was the case with 1.8, but according to the man page it 
binds now to sockets for number of processes > 2 . And this can lead the effect 
that one sometimes may notice a drop in performance when just this socket has 
other jobs running (by accident).

So, this is solved - I wasn't aware of the binding by socket.

But I can't see a binding by core for number of processes <= 2. Does it mean 2 
per node or 2 overall for the `mpiexec`? 

- -- Reuti


> 
>> On Apr 9, 2017, at 3:40 AM, Reuti  wrote:
>> 
>> Hi,
>> 
>> While I noticed an automatic core binding in Open MPI 1.8 (which in a shared 
>> cluster may lead to oversubscribing of cores), I can't spot this any longer 
>> in the 2.x series. So the question arises:
>> 
>> - Was this a general decision to no longer enable automatic core binding?
>> 
>> First I thought it might be because of:
>> 
>> - We define plm_rsh_agent=foo in $OMPI_ROOT/etc/openmpi-mca-params.conf
>> - We compiled with --with-sge
>> 
>> But also started on the command line by `ssh` to the nodes, there seems no 
>> automatic core binding to take place any longer.
>> 
>> -- Reuti
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 

-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAljqnnYACgkQo/GbGkBRnRrwtACgpUAlpvQElzbjoVdvsQubZmTo
Pj4An05kJd3pW0YWW4HXaf/7Zl7xTc+y
=kzwG
-END PGP SIGNATURE-
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] No more default core binding since 2.0.2?

2017-04-09 Thread Reuti

>> But I can't see a binding by core for number of processes <= 2. Does it mean 
>> 2 per node or 2 overall for the `mpiexec`? 
> 
> It’s 2 processes overall

Having a round-robin allocation in the cluster, this might not be what was 
intended (to bind only one or two cores per exechost)?

Obviously the default changes (from --bind-to core to --bin-to socket), whether 
I compiled Open MPI with or w/o libnuma (I wanted to get rid of the warning in 
the output only – now it works). But "--bind-to core" I could also use w/o 
libnuma and it worked, I got only that warning in addition about the memory 
couldn't be bound.

BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in 
libnuma, this necessity is gone.

-- Reuti
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] No more default core binding since 2.0.2?

2017-04-10 Thread Reuti


> Am 10.04.2017 um 01:58 schrieb r...@open-mpi.org:
> 
> Let me try to clarify. If you launch a job that has only 1 or 2 processes in 
> it (total), then we bind to core by default. This is done because a job that 
> small is almost always some kind of benchmark.

Yes, I see. But only if libnuma was compiled in AFAICS.


> If there are more than 2 processes in the job (total), then we default to 
> binding to NUMA (if NUMA’s are present - otherwise, to socket) across the 
> entire job.

Mmh - can I spot a difference in --report-bindings between these two? To me 
both looks like being bound to socket.

-- Reuti


> You can always override these behaviors.
> 
>> On Apr 9, 2017, at 3:45 PM, Reuti  wrote:
>> 
>>>> But I can't see a binding by core for number of processes <= 2. Does it 
>>>> mean 2 per node or 2 overall for the `mpiexec`?
>>> 
>>> It’s 2 processes overall
>> 
>> Having a round-robin allocation in the cluster, this might not be what was 
>> intended (to bind only one or two cores per exechost)?
>> 
>> Obviously the default changes (from --bind-to core to --bin-to socket), 
>> whether I compiled Open MPI with or w/o libnuma (I wanted to get rid of the 
>> warning in the output only – now it works). But "--bind-to core" I could 
>> also use w/o libnuma and it worked, I got only that warning in addition 
>> about the memory couldn't be bound.
>> 
>> BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in 
>> libnuma, this necessity is gone.
>> 
>> -- Reuti
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] No more default core binding since 2.0.2?

2017-04-10 Thread Reuti


> Am 10.04.2017 um 00:45 schrieb Reuti :
> […]BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in 
> libnuma, this necessity is gone.

Looks like I compiled too many versions in the last couple of days. The -ldl is 
necessary in case --disable-shared --enable-static was given to have a plain 
static version.

-- Reuti


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] No more default core binding since 2.0.2?

2017-04-10 Thread Reuti


> Am 10.04.2017 um 17:27 schrieb r...@open-mpi.org:
> 
> 
>> On Apr 10, 2017, at 1:37 AM, Reuti  wrote:
>> 
>>> 
>>> Am 10.04.2017 um 01:58 schrieb r...@open-mpi.org:
>>> 
>>> Let me try to clarify. If you launch a job that has only 1 or 2 processes 
>>> in it (total), then we bind to core by default. This is done because a job 
>>> that small is almost always some kind of benchmark.
>> 
>> Yes, I see. But only if libnuma was compiled in AFAICS.
>> 
>> 
>>> If there are more than 2 processes in the job (total), then we default to 
>>> binding to NUMA (if NUMA’s are present - otherwise, to socket) across the 
>>> entire job.
>> 
>> Mmh - can I spot a difference in --report-bindings between these two? To me 
>> both looks like being bound to socket.
> 
> You won’t see a difference if the NUMA and socket are identical in terms of 
> the cores they cover.

Ok, thx.


>> 
>> -- Reuti
>> 
>> 
>>> You can always override these behaviors.
>>> 
>>>> On Apr 9, 2017, at 3:45 PM, Reuti  wrote:
>>>> 
>>>>>> But I can't see a binding by core for number of processes <= 2. Does it 
>>>>>> mean 2 per node or 2 overall for the `mpiexec`?
>>>>> 
>>>>> It’s 2 processes overall
>>>> 
>>>> Having a round-robin allocation in the cluster, this might not be what was 
>>>> intended (to bind only one or two cores per exechost)?
>>>> 
>>>> Obviously the default changes (from --bind-to core to --bin-to socket), 
>>>> whether I compiled Open MPI with or w/o libnuma (I wanted to get rid of 
>>>> the warning in the output only – now it works). But "--bind-to core" I 
>>>> could also use w/o libnuma and it worked, I got only that warning in 
>>>> addition about the memory couldn't be bound.
>>>> 
>>>> BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in 
>>>> libnuma, this necessity is gone.
>>>> 
>>>> -- Reuti
>>>> ___
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Run-time issues with openmpi-2.0.2 and gcc

2017-04-13 Thread Reuti

Hi,

> Am 13.04.2017 um 11:00 schrieb Vincent Drach :
> 
> 
> Dear mailing list,
> 
> We are experimenting run time failure  on a small cluster with openmpi-2.0.2 
> and gcc 6.3 and gcc 5.4.
> The job start normally and lots of communications are performed. After 5-10 
> minutes the connection to the hosts is closed and
> the following error message is reported:
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> 
> 
> 
> The issue does not seem to be due to the infiniband configuration, because 
> the job also crash when using tcp protocol.
> 
> Do you have any clue of what could be the issue ?

Is it a single MPI process or is the application issuing many `mpiexec` during 
its runtime?

Is there any limit how often `ssh` may access a node in a timeframe? Do you use 
any queuing system?

-- Reuti



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] openmpi-2.0.2

2017-04-20 Thread Reuti

Due to the last post in this thread this copy I suggested seems not to be 
possible, but I also want to test whether this post goes through to the list 
now.

-- Reuti

===

Hi,

> Am 19.04.2017 um 19:53 schrieb Jim Edwards :
> 
> Hi,
> 
> I have openmpi-2.0.2 builds on two different machines and I have a test code 
> which works on one machine and does not on the other machine.  I'm struggling 
> to understand why and I hope that by posting here someone may have some 
> insight.
> 
> The test is using mpi derived data types and mpi_alltoallw on 4 tasks.  On 
> the machine that fails it appears to ignore the displacement in the derived 
> datatype defined on task 0 and just send 0-3 to all tasks.The failing 
> machine is built against gcc 5.4.0, the working machine has both intel 16.0.3 
> and gcc 6.3.0 builds.

And what happens when you copy one compilation from one node to the other 
(including all addressed shared libraries)?

-- Reuti


> #include "mpi.h"
> 
> #include 
> 
> 
> 
> int main(int argc, char *argv[])
> 
> {
> 
>  int rank, size;
> 
>  MPI_Datatype type[4], type2[4];
> 
>  int displacement[1];
> 
>  int sbuffer[16];
> 
>  int rbuffer[4];
> 
>  MPI_Status status;
> 
>  int scnts[4], sdispls[4], rcnts[4], rdispls[4];
> 
> 
>  MPI_Init(&argc, &argv);
> 
>  MPI_Comm_size(MPI_COMM_WORLD, &size);
> 
>  if (size < 4)
> 
>  {
> 
>  printf("Please run with 4 processes.\n");
> 
>  MPI_Finalize();
> 
>  return 1;
> 
>  }
> 
>  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> 
> 
> 
>  /* task 0 has sbuffer of size 16 and we are going to send 4 values to each 
> of tasks 0-3, offsetting in each
> 
> case so that the expected result is
> 
> task[0] 0-3
> 
> task[1] 4-7
> 
> task[2] 8-11
> 
> task[3] 12-15
> 
>  */
> 
> 
> 
> 
> 
>  for( int i=0; i 
>if (rank == 0){
> 
> scnts[i] = 1;
> 
>}else{
> 
> scnts[i] = 0;
> 
>}
> 
>sdispls[i] = 0;
> 
>rcnts[i] = 0;
> 
>rdispls[i] = 0;
> 
>  }
> 
>  rcnts[0] = 1;
> 
> 
>  for (int i=0; i 
>type[i] = MPI_INT;
> 
>type2[i] = MPI_INT;
> 
>rbuffer[i] = -1;
> 
>  }
> 
>/* on the recv side we create a data type which is a single block of 4 
> integers for the recv from 0
> 
> otherwise we use MPI_INT as a placeholder for the type
> 
> (openmpi does not want us to use MPI_DATATYPE_NULL a stupid misinterpretation 
> of the standard imho )*/
> 
> 
> 
>  displacement[0] = 0;
> 
>  MPI_Type_create_indexed_block(1, 4, displacement, MPI_INT, type2);
> 
>  MPI_Type_commit(type2);
> 
> 
> 
>  if (rank == 0)
> 
>  {
> 
>for( int i=0; i 
> displacement[0] = i*4;
> 
> /* we create a datatype which is a single block of 4 integers with offset 4 
> from the start of sbuffer */
> 
> MPI_Type_create_indexed_block(1, 4, displacement, MPI_INT, type + i);
> 
> MPI_Type_commit(type+i);
> 
>}
> 
>for (int i=0; i<16; i++)
> 
> sbuffer[i] = i;
> 
>  }
> 
> 
> 
>  for (int i=0; i 
>printf("rank %d i=%d: scnts %d sdispls %d stype %d rcnts %d rdispls %d 
> rtype %d\n", rank, i, scnts[i], sdispls[i], type[i], rcnts[i], rdispls[i], 
> type2[i]);
> 
> 
>  MPI_Alltoallw(sbuffer, scnts, sdispls, type, rbuffer, rcnts, rdispls, type2, 
> MPI_COMM_WORLD);
> 
> 
> 
>  for (int i=0; i<4; i++)
> 
>printf("rbuffer[%d] = %d\n", i, rbuffer[i]);
> 
>  fflush(stdout);
> 
> 
> 
>  MPI_Finalize();
> 
>  return 0;
> 
> }
> 
> 
> --
> Jim Edwards
> 
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Behavior of `ompi_info`

2017-04-25 Thread Reuti

Hi,

In case Open MPI is moved to a different location than it was installed into 
initially, one has to export OPAL_PREFIX. While checking for the availability 
of the GridEngine integration, I exported OPAL_PREFIX but obviously with a typo 
and came to the conclusion that it's not available, as I did a:

$ export PATH=…
$ export LD_LIBRARY_PATH=…
$ export OPAL_PREFIX=anything_with_a_typo_or_nothing_at_all
$ ompi_info | grep grid

There was no indication, that `ompi_info` listed only part of the usual output 
because of the faulty OPAL_PREFIX. When I recheck now, even the exit code of 
`ompi_info` is still 0 in this case.

I suggest to include some tests in `ompi_info` whether the set OPAL_PREFIX 
makes sense. For now it lists just the set value in the "Prefix:" line and 
that's all.

Expected behavior: If `ompi_info` can't find any modules in the specified 
place, an appropriate output should go to stderr and the exit code set to 1.

-- Reuti

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Behavior of `ompi_info`

2017-05-01 Thread Reuti

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Am 25.04.2017 um 17:27 schrieb Reuti:

> Hi,
> 
> In case Open MPI is moved to a different location than it was installed into 
> initially, one has to export OPAL_PREFIX. While checking for the availability 
> of the GridEngine integration, I exported OPAL_PREFIX but obviously with a 
> typo and came to the conclusion that it's not available, as I did a:
> 
> $ export PATH=…
> $ export LD_LIBRARY_PATH=…
> $ export OPAL_PREFIX=anything_with_a_typo_or_nothing_at_all
> $ ompi_info | grep grid
> 
> There was no indication, that `ompi_info` listed only part of the usual 
> output because of the faulty OPAL_PREFIX. When I recheck now, even the exit 
> code of `ompi_info` is still 0 in this case.
> 
> I suggest to include some tests in `ompi_info` whether the set OPAL_PREFIX 
> makes sense. For now it lists just the set value in the "Prefix:" line and 
> that's all.
> 
> Expected behavior: If `ompi_info` can't find any modules in the specified 
> place, an appropriate output should go to stderr and the exit code set to 1.

No one commented on this. As the purpose of `ompi_info` is to check the 
installation (according to its man page), something should really be improved 
here.

It's even possible to have both a 2.0.2 and 2.1.0 installed but only one in the 
$PATH of course, which may lead to:

$ ~/local/openmpi-2.0.2_gcc-4.7.2_shared/bin/ompi_info -V
Open MPI v2.0.2

Looks fine, but getting the full output:

$ ~/local/openmpi-2.0.2_gcc-4.7.2_shared/bin/ompi_info
 Package: Open MPI reuti Distribution
Open MPI: 2.1.0


shows a different Open MPI version as `mpiexec` and alike are in the actual set 
$PATH (not 2.0.2) (the former seems to be a constant in `ompi_info`, while the 
latter comes from a loaded library). Having a static version it's different:

$ ~/local/openmpi-2.0.2_gcc-4.7.2_static/bin/ompi_info -V
Open MPI v2.0.2

$ ~/local/openmpi-2.0.2_gcc-4.7.2_static/bin/ompi_info
 Package: Open MPI reuti Distribution
Open MPI: 2.0.2


- -- Reuti
-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAlkHa38ACgkQo/GbGkBRnRredgCeNmACv+HyJX2ERsltN+U+Zvs5
1wwAn22+zQBQVcf5d19zPxLbhqq81fqh
=fd5O
-END PGP SIGNATURE-
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI the correct solution?

2017-05-08 Thread Reuti

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Am 08.05.2017 um 23:25 schrieb David Niklas:

> Hello,
> I originally ported this question at LQ, but the answer I got back shows
> rather poor insight on the subject of MPI, so I'm taking the liberty of
> posting here also.
> 
> https://www.linuxquestions.org/questions/showthread.php?p=5707962
> 
> What I'm trying to do is figure out how/what to use to update an osm file
> (open street map), in a cross system manner. I know the correct program
> osmosis and for de/re-compression lbzip2 but how to do this across
> computers is confusing me, even after a few hours of searching online.

lbzip2 is only thread parallel on a single machine. With pbzip2 you mention 
it's the same, but it exists an MPI version MPIBZIP2 - unfortunately it looks 
unmaintained since 2007. Maybe you can contact the author about its state. 
Without an MPI application like this, the MPI library is nothing on its own 
which would divide and distribute one task to several machines automatically.

osmosis itself seems to run in serial only (they don't say any word whether it 
uses any parallelism).

For the intended task the only option is to use a single machine with as many 
cores as possible AFAICS.

- -- Reuti
-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAlkQ8Y8ACgkQo/GbGkBRnRq4jgCeKI39e2U22qsx9f6VeNZyUqNK
QzQAoNsbE7yS1SuQpabW67z+7oTGQ7QP
=uSOG
-END PGP SIGNATURE-
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-18 Thread Reuti

Hi,

> Am 18.05.2017 um 14:02 schrieb Gabriele Fatigati :
> 
> Dear OpenMPI users and developers, I'm using IBM Spectrum MPI 10.1.0

I noticed this on IBM's website too. Is this freely available? Up to now I was 
always bounced back to their former Platform MPI when trying to download the 
community edition (even the evaluation link on the Spectrum MPI page does the 
same).

-- Reuti


>  based on OpenMPI, so I hope there are some MPI expert can help me to solve 
> the problem. 
> 
> When I run a simple Hello World MPI program, I get the follow error message:
> 
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
> 
> Host:  openpower
> Framework: pml
> Component: pami
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   mca_pml_base_open() failed
>   --> Returned "Not found" (-13) instead of "Success" (0)
> --
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> 
> My sysadmin used official IBM Spectrum packages to install MPI, so It's quite 
> strange that there are some components missing (pami). Any help? Thanks
> 
> -- 
> Ing. Gabriele Fatigati
> 
> HPC specialist
> 
> SuperComputing Applications and Innovation Department
> 
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> 
> www.cineca.itTel:   +39 051 6171722
> 
> g.fatigati [AT] cineca.it  
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI the correct solution?

2017-05-19 Thread Reuti

As I think it's not relevant to Open MPI itself, I answered in PM only.

-- Reuti


> Am 18.05.2017 um 18:55 schrieb do...@mail.com:
> 
> On Tue, 9 May 2017 00:30:38 +0200
> Reuti  wrote:
>> Hi,
>> 
>> Am 08.05.2017 um 23:25 schrieb David Niklas:
>> 
>>> Hello,
>>> I originally ported this question at LQ, but the answer I got back
>>> shows rather poor insight on the subject of MPI, so I'm taking the
>>> liberty of posting here also.
>>> 
>>> https://www.linuxquestions.org/questions/showthread.php?p=5707962
>>> 
>>> What I'm trying to do is figure out how/what to use to update an osm
>>> file (open street map), in a cross system manner. I know the correct
>>> program osmosis and for de/re-compression lbzip2 but how to do this
>>> across computers is confusing me, even after a few hours of searching
>>> online.  
>> 
>> lbzip2 is only thread parallel on a single machine. With pbzip2 you
>> mention it's the same, but it exists an MPI version MPIBZIP2 -
> I can't find the project, do you have a link?
> 
>> unfortunately it looks unmaintained since 2007. Maybe you can contact
>> the author about its state. Without an MPI application like this, the
>> MPI library is nothing on its own which would divide and distribute one
>> task to several machines automatically.
> Well, there might be other ways to cause a program to run on multiple
> computers. Perhaps a virtual machine made of of multiple physical
> machines?
> 
>> osmosis itself seems to run in serial only (they don't say any word
>> whether it uses any parallelism).
> Yes, it does run multiple threads, you just start another task (and add a
> buffer). I tested this on my machine, I think it is --read-xml
> --write-xml and --read-xml-change that start new threads. The question is
> whether or not java is naively MPI aware or does the app need special
> coding?
> 
>> For the intended task the only option is to use a single machine with
>> as many cores as possible AFAICS.
> Though about that, and it is doable with respect to memory and disk
> constraints, the problem is that it would take a *long* time esp. with the
> amount of updates I must do, hence my inquiry.
> 
> Thanks,
> David
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

1 2 3 4 5 6 >

1 - 100 of 548 matches

Mail list logo