Re: [OMPI users] random IB failures when running medium core counts

2010-08-30 Thread Joshua Bernstein

Hello Brock,

While it doesn't solve the problem, have you tried increasing the btl 
timeouts like the message suggest? With 1884 cores in use perhaps there 
is some over subscription in the fabric?


-Joshua Bernstein
Penguin Computing


Brock Palen wrote:
We recently installed a modest IB network to our cluster, 
When running a 1884 core IB HPL job after a run we will get an error about IB, it does not always happen in the same place, some iterations will pass others will fail the error is below, we are using openmpi/1.4.2 with the intel 11 compilers.

Note that 1000 core jobs and other sizes work well also but this larger one 
does not.  Thanks!

[[62713,1],1867][btl_openib_component.c:3224:handle_wc] from 
nyx5011.engin.umich.edu to: nyx5120 error polling LP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 413569408 opcode 2  vendor error 
129 qp_idx 0
--
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.  


Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:

 4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   nyx5011.engin.umich.edu
  Local device: mlx4_0
  Peer host:nyx5120

You may need to consult with your system administrator to get this
problem fixed.
--
--
mpirun has exited due to process rank 1867 with PID 3474 on
node nyx5011 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[nyx5049.engin.umich.edu:07901] [[62713,0],32] ORTED_CMD_PROCESSOR: STUCK IN 
INFINITE LOOP - ABORTING
[nyx5049:07901] *** Process received signal ***
[nyx5049:07901] Signal: Aborted (6)
[nyx5049:07901] Signal code:  (-6)
[nyx5049:07901] [ 0] /lib64/libpthread.so.0 [0x2b5dcbc70b10]
[nyx5049:07901] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b5dcbeae265]
[nyx5049:07901] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b5dcbeafd10]
[nyx5049:07901] [ 3] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x216)
 [0x2b5dcacdb7e6]
[nyx5049:07901] [ 4] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_event_loop+0x2ca)
 [0x2b5dcaf3a9aa]
[nyx5049:07901] [ 5] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_progress+0x5e)
 [0x2b5dcaf2d26e]
[nyx5049:07901] [ 6] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/openmpi/mca_rml_oob.so 
[0x2b5dcce37e5c]
[nyx5049:07901] [ 7] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x3ae)
 [0x2b5dcacdb97e]
[nyx5049:07901] [ 8] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_event_loop+0x2ca)
 [0x2b5dcaf3a9aa]
[nyx5049:07901] [ 9] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_progress+0x5e)
 [0x2b5dcaf2d26e]
[nyx5049:07901] [10] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/openmpi/mca_rml_oob.so 
[0x2b5dcce37e5c]
[nyx5049:07901] [11] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x3ae)
 [0x2b5dcacdb97e]
[nyx5049:07901] [12] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_event_loop+0x2ca)
 [0x2b5dcaf3a9aa]
[nyx5049:07901] [13] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_event_dispatch+0x8)
 [0x2b5dcaf3a6d8]
[nyx5049:07901] [14] 
/home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-rte.so.0(orte_daemon+0xaaf)
 [0x2b5dcacdb15f]
[nyx5049:07901] [15] orted [0x401ad6]
[nyx5049:07901] [16] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b5dcbe9b994]
[nyx5049:07901] [17] orted [0x401999]
[nyx5049:07901] *** End of error message ***



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing

[OMPI users] Deprecated parameter: plm_rsh_agent

2010-11-05 Thread Joshua Bernstein

Hello All,

When building the examples included with OpenMPI version 1.5 I see a 
message printed as follows:


--
A deprecated MCA parameter value was specified in an MCA parameter
file.  Deprecated MCA parameters should be avoided; they may disappear
in future releases.

  Deprecated parameter: plm_rsh_agent
--

While I know that in pre 1.3.x releases the variable was pls_rsh_agent, 
plm_rsh_agent worked all the way through at least 1.4.3. What is the new 
keyword name? I can't seem to find it in the FAQ located here:


http://www.open-mpi.org/faq/?category=rsh

-Josh


Re: [OMPI users] Deprecated parameter: plm_rsh_agent

2010-11-05 Thread Joshua Bernstein

Thanks Samuel,

I should have checked ompi_info myself.

The FAQ on the website should probably be updated to reflect this 
function change.


-Joshua Bernstein
Software Development Manager
Penguin Computing

Samuel K. Gutierrez wrote:

Hi Josh,

I -think- the new name is orte_rsh_agent.  At least according to ompi_info.

$ ompi_info -a --parsable | grep orte_rsh_agent
mca:orte:base:param:orte_rsh_agent:value:ssh : rsh
mca:orte:base:param:orte_rsh_agent:data_source:default value
mca:orte:base:param:orte_rsh_agent:status:writable
mca:orte:base:param:orte_rsh_agent:help:The command used to launch 
executables on remote nodes (typically either "ssh" or "rsh")

mca:orte:base:param:orte_rsh_agent:deprecated:no
mca:orte:base:param:orte_rsh_agent:synonym:name:pls_rsh_agent
mca:orte:base:param:orte_rsh_agent:synonym:name:plm_rsh_agent
mca:plm:base:param:plm_rsh_agent:synonym_of:name:orte_rsh_agent

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Nov 5, 2010, at 12:41 PM, Joshua Bernstein wrote:


Hello All,

When building the examples included with OpenMPI version 1.5 I see a 
message printed as follows:


-- 


A deprecated MCA parameter value was specified in an MCA parameter
file.  Deprecated MCA parameters should be avoided; they may disappear
in future releases.

 Deprecated parameter: plm_rsh_agent
-- 



While I know that in pre 1.3.x releases the variable was 
pls_rsh_agent, plm_rsh_agent worked all the way through at least 
1.4.3. What is the new keyword name? I can't seem to find it in the 
FAQ located here:


http://www.open-mpi.org/faq/?category=rsh

-Josh
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


[OMPI users] Displaying Selected MCA Modules

2008-06-19 Thread Joshua Bernstein

Hi There,

	I'm attempting to debug some configuration issue with the recent 
version of OMPI, version 1.2.6. I'm able to build all of the MCA 
modules, and I've figured out how to display the list of AVAILABLE 
modules using ompi_info, but is there a way to display the list of 
modules that was selected at runtime? I've tried the -v option to 
mpirun, and read through the FAQs, but I can't seem to figure out how to 
 have OMPI display the selected MCAs when a job starts. Any help or 
guidance would be appreciated.


-Josh


Re: [OMPI users] Displaying Selected MCA Modules

2008-06-19 Thread Joshua Bernstein

Well to answer my own question,

	If I use the -display-map option, I get printed out a nice bit of 
information that includes a list of the modules in use during the run as 
shown below:


---SNIP---
Argv[0]: ./cpi
Env[0]: OMPI_MCA_pls=proxy
Env[1]: OMPI_MCA_rmaps_base_display_map=1
Env[2]: 
OMPI_MCA_orte_precondition_transports=ad81e32181314110-4aea4dd5040c2593

Env[3]: OMPI_MCA_rds=proxy
Env[4]: OMPI_MCA_ras=proxy
Env[5]: OMPI_MCA_rmaps=proxy
Env[6]: OMPI_MCA_rmgr=proxy
Working dir: /home/ats (user: 0)
---END SNIP--

-Josh

Joshua Bernstein wrote:

Hi There,

I'm attempting to debug some configuration issue with the recent 
version of OMPI, version 1.2.6. I'm able to build all of the MCA 
modules, and I've figured out how to display the list of AVAILABLE 
modules using ompi_info, but is there a way to display the list of 
modules that was selected at runtime? I've tried the -v option to 
mpirun, and read through the FAQs, but I can't seem to figure out how to 
 have OMPI display the selected MCAs when a job starts. Any help or 
guidance would be appreciated.


-Josh
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Displaying Selected MCA Modules

2008-06-20 Thread Joshua Bernstein

Thanks for the response Jeff,

Jeff Squyres wrote:

Greetings Josh.

No, we don't have an easy way to show which plugins were loaded and 
may/will be used during the run.  The modules you found below in 
--display-map are only a few of the plugins (all dealing with the 
run-time environment, and only used on the back-end nodes, so it may not 
be what you're looking for -- e.g., it doesn't show the plugins used by 
mpirun).

What do you need to know?


Well basically I want to know what MTA's are being used to startup a 
job. I'm confused as to what the difference is between "used by mpirun" 
versus user on the back-end nodes. Doesn't --display-map show which MTA 
modules will used to start the backend processes?


The overarching issue is that I'm attempting to just begin testing my 
build and when I attempt to startup a job, it just hangs:


[ats@nt147 ~]$ mpirun --mca pls rsh -np 1 ./cpi
[nt147.penguincomputing.com:04640] [0,0,0] ORTE_ERROR_LOG: Not available 
in file ras_bjs.c at line 247


The same thing happens if I just disable the bjs RAS MTA, since bjs, 
really isn't used with Scyld anymore:


[ats@nt147 ~]$ mpirun --mca ras ^bjs --mca pls rsh -np 1 ./cpi


The interesting thing here is that orted starts up, but I'm not sure 
what is supposed to happen next:


[root@nt147 ~]# ps -auxwww | grep orte
Warning: bad syntax, perhaps a bogus '-'? See 
/usr/share/doc/procps-3.2.3/FAQ
ats   4647  0.0  0.0 48204 2136 ?Ss   12:45   0:00 orted 
--bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename 
nt147.penguincomputing.com --universe 
a...@nt147.penguincomputing.com:default-universe-4645 --nsreplica 
"0.0.0;tcp://192.168.5.211:59110;tcp://10.10.10.1:59110;tcp://10.11.10.1:59110" 
--gprreplica 
"0.0.0;tcp://192.168.5.211:59110;tcp://10.10.10.1:59110;tcp://10.11.10.1:59110" 
--set-sid


Finally, it should be noted that the upcoming release of Scyld will now 
include OpenMPI. This notion is how all of this got started.


-Joshua Bernstein
Software Engineer
Penguin Computing


Re: [OMPI users] Displaying Selected MCA Modules

2008-06-23 Thread Joshua Bernstein

Wow,

	Seems like I've fallen behind in replying. I'll try to be sure to make 
sure I answer everbody's questions about what I am trying to accomplish.


Jeff Squyres wrote:

On Jun 20, 2008, at 3:50 PM, Joshua Bernstein wrote:

No, we don't have an easy way to show which plugins were loaded and 
may/will be used during the run.  The modules you found below in 
--display-map are only a few of the plugins (all dealing with the 
run-time environment, and only used on the back-end nodes, so it may 
not be what you're looking for -- e.g., it doesn't show the plugins 
used by mpirun).

What do you need to know?


Well basically I want to know what MTA's are being used to startup a job.


MTA?


Sorry, I should have said MCA

I'm confused as to what the difference is between "used by mpirun" 
versus user on the back-end nodes. Doesn't --display-map show which 
MTA modules will used to start the backend processes?


Yes.  But OMPI's run-time design usually has mpirun load one plugin of a 
given type, and then have the MPI processes load another plugin of the 
same type.  For example, for I/O forwarding - mpirun will load the "svc" 
plugin, while MPI processes will load the "proxy" plugin.  In this case, 
mpirun is actually providing all the smarts for I/O forwarding, and all 
the MPI processes simply proxy requests up to mpirun.  This is a common 
model throughout our run-time support, for example.


Ah, okay. So then --display-map will show what modules the backend 
processes are using, not MPIRUN itself.


The overarching issue is that I'm attempting to just begin testing my 
build and when I attempt to startup a job, it just hangs:


[ats@nt147 ~]$ mpirun --mca pls rsh -np 1 ./cpi
[nt147.penguincomputing.com:04640] [0,0,0] ORTE_ERROR_LOG: Not 
available in file ras_bjs.c at line 247


The same thing happens if I just disable the bjs RAS MTA, since bjs, 
really isn't used with Scyld anymore:


[ats@nt147 ~]$ mpirun --mca ras ^bjs --mca pls rsh -np 1 ./cpi



I know very, very little about the bproc support in OMPI -- I know that 
it evolved over time and is disappearing in v1.3 due to lack of 
interest.  If you want it to stay, I think you've missed the v1.3 boat 
(we're in feature freeze for v1.3), but possibilities exist for future 
versions if you're willing to get involved in Open MPI.


Bummer! I would absolutely support, (along with Penguin) further 
contributions and development of BProc support.


Note, though that BProc Scyld, and LANL BProc, have long ago forked. We 
believe our BProc functionality has been developed beyond what was 
running at LANL, (for example we have support for threads...). I 
understand it it probably too late to add BProc in for 1.3, but perhaps 
for subsequent releases, combined with contributions from Penguin, BProc 
support could be resurrected in some capacity.


The interesting thing here is that orted starts up, but I'm not sure 
what is supposed to happen next:


[root@nt147 ~]# ps -auxwww | grep orte
Warning: bad syntax, perhaps a bogus '-'? See 
/usr/share/doc/procps-3.2.3/FAQ
ats   4647  0.0  0.0 48204 2136 ?Ss   12:45   0:00 orted 
--bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename 
nt147.penguincomputing.com --universe 
a...@nt147.penguincomputing.com:default-universe-4645 --nsreplica 
"0.0.0;tcp://192.168.5.211:59110;tcp://10.10.10.1:59110;tcp://10.11.10.1:59110" 
--gprreplica 
"0.0.0;tcp://192.168.5.211:59110;tcp://10.10.10.1:59110;tcp://10.11.10.1:59110" 
--set-sid


I'm not sure that just asking for the rsh pls is the Right thing to do 
-- I'll have to defer to Ralph on this one...

Can you successfully run non-MPI apps, like hostname?


Yes. Absoultely.

Finally, it should be noted that the upcoming release of Scyld will 
now include OpenMPI. This notion is how all of this got started.



Great!  It sounds like you need to get involved, though, to preserve 
bproc support going forward.  LANL was the only proponent of bproc-like 
support; they have been moving away from bproc-like clusters, however, 
and so support faded.  We made the decision to axe bproc support in v1.3 
because there was no one to maintain it.  :-(


This is what I'm in the process of doing right now. I'd like to be able 
to take the existing BProc functionality and modify if needed to support 
our BProc. I have buy in from the higher ups around here, and I will 
proceed with the Membership forms likely at the "Contributer" level, 
considering we hope to be contributing code. Signing of the 3rd part 
contribution agreement shouldn't be an issue.


-Joshua Bernstein
Software Engineer
Penguin Computing


Re: [OMPI users] Displaying Selected MCA Modules

2008-06-23 Thread Joshua Bernstein



Ralph Castain wrote:

Hi Joshua

Again, forwarded by the friendly elf - so include me directly in any reply.

I gather from Jeff that you are attempting to do something with bproc -
true? If so, I will echo what Jeff said: bproc support in OMPI is being
dropped with the 1.3 release due to lack of interest/support. Just a "heads
up".


Understood.


If you are operating in a bproc environment, then I'm not sure why you are
specifying that the system use the rsh launcher. Bproc requires some very
special handling which is only present in the bproc launcher. You can run
both MPI and non-MPI apps with it, but bproc is weird and so OMPI some
-very- different logic in it to make it all work.


Well, I'm trying to determine how broken, if at all, the bproc support 
is in OpenMPI. So considering out of the gate it wasn't working, I 
thought I'd try to disable the built in BProc stuff and fall back to RSH.



I suspect the problem you are having is that all of the frameworks are
detecting bproc and trying to run accordingly. This means that the orted is
executing process startup procedures for bproc - which are totally different
than for any other environment (e.g., rsh). If mpirun is attempting to
execute an rsh launch, and the orted is expecting a bproc launch, then I can
guarantee that no processes will be launched and you will hang.


Exactly, what I'm seeing now...


I'm not sure there is a way in 1.2 to tell the orteds to ignore the fact
that they see bproc and do something else. I can look, but would rather wait
to hear if that is truly what you are trying to do, and why.


I would really appreciate it if you wouldn't mind looking. From reading 
the documentation I didn't realize that mpirun and the orted were doing 
two different things. I thought the --mca parameter applied to both.


-Joshua Bernstein
Software Engineer
Penguin Computing


Re: [OMPI users] Displaying Selected MCA Modules

2008-06-24 Thread Joshua Bernstein



Jeff Squyres wrote:

On Jun 23, 2008, at 2:52 PM, Joshua Bernstein wrote:

Excellent.  I'll let Ralph chime in with the relevant technical 
details.  AFAIK, bproc works just fine in the v1.2 series (they use it 
at LANL every day).  But note that we changed a *LOT* in ORTE between 
v1.2 and v1.3; the BPROC support will likely need to be re-written.  
This is likely worth some phone calls to describe what will be needed.


Excellent! I'll be sure to initiate this when the time comes.

-Joshua Bernstein
Software Engineer
Penguin Computing


Re: [OMPI users] Displaying Selected MCA Modules

2008-06-24 Thread Joshua Bernstein

Ralph,

I really appreciate all of your help and guidance on this.

Ralph H Castain wrote:

Of more interest would be understanding why your build isn't working in
bproc. Could you send me the error you are getting? I'm betting that the
problem lies in determining the node allocation as that is the usual place
we hit problems - not much is "standard" about how allocations are
communicated in the bproc world, though we did try to support a few of the
more common methods.


Alright, I've been playing around a bit more, and I think I'm 
understanding what is going on. Though it seems that for whatever reason 
the ORTE daemon is failing to launch on a remote node, and I'm left with:


[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Not 
available in file ras_bjs.c at line 247

--
A daemon (pid 4208) launched by the bproc PLS component on node 0 died
unexpectedly so we are aborting.

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in 
file pls_bproc.c at line 717
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in 
file pls_bproc.c at line 1164
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in 
file rmgr_urm.c at line 462

[goldstar.penguincomputing.com:04207] mpirun: spawn failed with errno=-1

So, I take the advice suggested in the note, and double check to make 
sure our library caching is working. It nicely picks up the libraries 
though once they are staged on the compute nodes, now mpirun just dies:


[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[goldstar.penguincomputing.com:09335] [0,0,0] ORTE_ERROR_LOG: Not 
available in file ras_bjs.c at line 247

[ats@goldstar mpi]$

I thought maybe it was actually working, but I/O forwarding wasn't setup 
properly, though checking the exit code shows that it infact crashed:


[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[ats@goldstar mpi]$ echo $?
1

Any ideas here?

If I use the NODES envar, I can run a job on the head node though:

[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
Process 0 on goldstar.penguincomputing.com
pi is approximately 3.1416009869231254, Error is 0.0823
wall clock time = 0.97

What also is interesting, and you suspected correctly, only the NODES 
envar is being honored, things like BEOWULF_JOB_MAP is not being 
honored. This probably correct as I imagine this BEOWULF_JOB_MAP envar 
is Scyld specific and likely not implemented. This isn't a big issue 
though, its something I'll likely add later on.


-Joshua Bernstein
Software Engineer
Penguin Computing




Re: [OMPI users] Displaying Selected MCA Modules

2008-06-24 Thread Joshua Bernstein



Ralph Castain wrote:

Hmmmwell, the problem is as I suspected. The system doesn't see any
allocation of nodes to your job, and so it aborts with a crummy error
message that doesn't really tell you the problem. We are working on
improving them.

How are you allocating nodes to the job? Does this BEOWULF_JOB_MAP contain
info on the nodes that are to be used?


BEOWULF_JOB_MAP is an array of integers separated by a colon that 
contains node mapping information. The easiest way to explain is is just 
my example:


BEOWULF_JOB_MAP=0:0

This is a two process job, with each process running on node 0.

BEOWULF_JOB_MAP=0:1:1

A three process job with the first process on node 0, and the next two 
on node 1.


All said, this is of little consequent right now, and we/I can worry 
about adding support for this later.



One of the biggest headaches with bproc is that there is no adhered-to
standard for describing the node allocation. What we implemented will
support LSF+Bproc (since that is what was being used here) and BJS. It
sounds like you are using something different - true?


Understood. We aren't using BJS, and have long depricated BJS in favor 
of bundling TORQUE with Scyld instead, though legacy functionality for 
things like envars like NP, NO_LOCAL, and BEOWULF_JOB_MAP are present in 
the MPICH extensions we've put together.



If so, we can work around it by just mapping enviro variables to what the
system is seeking. Or, IIRC, we could use the hostfile option (have to check
on that one).


Exactly, but for now, if I make sure the NODES envar is setup correctly 
and make sure the OpenMPI is NFS mounted, and I actually have to copy 
out the mca libraries (libcache doesn't seem to work), I actually end up 
with something running!


[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 2 ./cpi
Process 0 on n0
pi is approximately 3.1416009869231241, Error is 0.0809
wall clock time = 0.005377
Process 1 on n0
Hangup

It seems the -H option and using a hostfile with BProc aren't honored 
correct? So the only thing that I can use to derrive the host mapping 
with BProc support is the BJS RAS MCA (via the NODES Envar?)


-Josh




Re: [OMPI users] Setting up Open MPI to run on multiple servers

2008-08-11 Thread Joshua Bernstein



Rayne wrote:

Hi all,
I am trying to set up Open MPI to run on multiple servers, but as I
have very little experience in networking, I'm getting confused by the
info on open-mpi.org, with the .rhosts, rsh, ssh etc.

Basically what I have now is a PC with Open MPI installed. I want to
connect it to, say, 10 servers, so I can run MPI programs on all 11
nodes. From what I've read, I think I need to install Open MPI on the
10 servers too, and there must be a shared directory where I keep all
the MPI programs I've written, so all nodes can access them.

Then I need to create a machine file on my local PC (I found a default
hostfile "openmpi-default-hostfile" in {prefix}/etc/. Can I use that
instead so I need not have "-machinefile machine" with every mpiexec?)
with the list of the 10 servers. I'm assuming I need to put down the
IP addresses of the 10 servers in this file. I've also read that the
10 servers also need to each have a .rhosts file that tells them the
machine (i.e. my local PC) and user from which the programs may be
launched from. Is this right?

There is also the rsh/ssh configuration, which I find the most
confusing. How do I know whether I'm using rsh or ssh? Is following
the instructions on http://www.open-mpi.org/faq/?category=rsh under
"3: How can I make ssh not ask me for a password?" sufficient? Does
this mean that when I'm using the 10 servers to run the MPI program,
I'm login to them via ssh? Is this necessary in every case?

Is doing all of the above all it takes to run MPI programs on all 11
nodes, or is there something else I missed?


More or less. Though the first step is to setup password-less SSH 
between all 11 machines. I'd completely skip the use of RSH as its very 
insecure and shouldn't be used in non-dedicated cluster, and even 
then... You should basically setup SSH so a user can SSH from one node 
to another without specify a password or entering in any other information.


Then, the next is to setup NFS. NFS provides you with a way to share a 
directory on one computer, to many other computers avoiding the hassel 
of having to copy all your MPI programs to all of the nodes. This is 
generally as easy as configuring /etc/exports, and then just mounting 
the directory on the other computers. Be Sure you mount the directories 
in the same place on every node though.


Lastly, give your MPI programs a shot. While you don't need to have a 
hostlist, because you can specify the hostname (or IPs). on the mpirun 
command line. But you your case its likely a good idea.


Hope that gets you started...

-Joshua Bernstein
Software Engineer
Penguin Computing


Re: [OMPI users] Question on open-mpi not working over wireless between Ubuntu and Mac OS-X

2009-09-18 Thread Joshua Bernstein

Hello Pallab,

	Is there a chance its something simple like having the Mac's  
Fireware turned on? On my 10.4 system this is in System Preference- 
>Sharing, and then the Firewall tab.


-Joshua Bernstein
Senior Software Engineer
Penguin Computing

On Sep 18, 2009, at 3:56 PM, Pallab Datta wrote:


Hello,

I am running open-mpi between a Mac OSX (v.10.5) and Ubuntu Server  
V.9.04

(Linux Box).  I have configured OMPI V.1.3.3 on both of them with
--enable-heterogeneous --disable-shared --enable-static options.  
The Linux

box is connected via a wireless USB Adapter to the same sub-network in
which the Macinstosh is sitting.

When I tried to run mpirun with the following options between the  
Linux

box with the wireless card with another linux machine on the network
everything works fine.
I ran :
/usr/local/bin/mpirun --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -H
localhost,10.11.14.205 ./app

and it works.

When I tried to run mpirun with the -hetero option from the  
Macintosh it

invokes the processes on both ends and then hangs at the MPI_Send
MPI_Receive functions.
I ran:

/usr/local/bin/mpirun  --mca btl_base_verbose 30 --mca
OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205
./app

and it hangs. I saw that the linux box is trying to connect() the Mac
using port 4/260. So I purposely forced mpi to look for higher  
numbered

ports..

I ran :
/usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205
./app

and it still hangs giving the following message:
btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360

10.11.14.203 == localhost.

Can anybody explain what I am missing and how I can make the  
macintosh and

Linux boxes talk to each other over wireless..
regards,
pallab




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Joshua Bernstein

Hmm,

	On another angle, could this be a name resolution issue? Perhaps apex-backpack 
isn't able to resolve fuji.local and visa versa. Can you ping between the two of 
them using their hostnames rather then their IPs?


-Joshua Bernstein
Senior Software Engineer
Penguin Computing

Pallab Datta wrote:

Yes it came up when i put the verbose mode in i.e. the debug output..
yes i knew its privileged so thats why i explicity asked it to connect to
a higher port but still it blocks there..:(


On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote:


Yes I had tried that initially it (apex-backpack) was trying to
connect
the Mac (10.11.14.203) at port number 4 which is too low. So that's
why I
made the port range higher..

Port 4?  OMPI should never connect at port 4; it's privileged.  Was
that in the debug output?

--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-10 Thread Joshua Bernstein



Jeff Squyres wrote:

On Dec 9, 2009, at 4:36 PM, Jeff Squyres wrote:
Given that we haven't moved this patch to the v1.4 branch yet (i.e., it's not
yet in a nightly v1.4 tarball), probably the easiest thing to do is to apply
the attached patch to a v1.4 tarball.  I tried it with my PGI 10.0 install
and it seems to work.  So -- forget everything about autogen.sh and just
apply the attached patch.


Is there a reason why it hasn't moved into 1.4 yet or wasn't included with the 
1.4 release?


Can I toss my two cents in here and request it be made available in a mainline 
release, or at least in a snapshot sooner rather then later? I'd like to get it 
included in our build in time for our next release.


-Joshua Bernstein
Senior Software Engineer
Penguin Computing


Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-28 Thread Joshua Bernstein



Jeff Squyres wrote:

Sorry -- I neglected to update the list yesterday: I got the RM approval and
committed the fix to the v1.4 branch.  So the PGI fix should be in last
night's 1.4 snapshot.

Could someone out in the wild give it a whirl and let me know if it works for
you?  (it works for *me*)


Jeff, The Dec 17th Snapshot, posted here:

http://www.open-mpi.org/nightly/v1.4/openmpi-1.4a1r22335.tar.gz

Builds nicely with PGI v10.0 on both Redhat 5u4 and 4u8. Is there a plan to roll 
this up into a 1.4.1 release? I'd to not have to ship a snapshot version.


-Joshua Bernstein
Senior Software Engineer
Penguin Computing


Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-28 Thread Joshua Bernstein



Ralph Castain wrote:

You definitely shouldn't ship that one - it may build, but it doesn't work. We 
are looking at a bug in that code branch prior to releasing.


I have no plans to ship that one. Any idea when we'll see a 1.4.1 release?

-Josh


Re: [OMPI users] Problem building OpenMPI with PGI compilers

2009-12-28 Thread Joshua Bernstein

Thanks Ralph,

I'll keep an eye out...

-Josh

Ralph Castain wrote:

Sometime Jan, would be my best guess...but I am not in charge of it, so don't 
take that as any kind of commitment.


On Dec 28, 2009, at 5:59 PM, Joshua Bernstein wrote:



Ralph Castain wrote:

You definitely shouldn't ship that one - it may build, but it doesn't work. We 
are looking at a bug in that code branch prior to releasing.

I have no plans to ship that one. Any idea when we'll see a 1.4.1 release?

-Josh
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Problem compiling 1.4.0 snap with PGI 10.0-1 and openib flags turned on ...

2009-12-29 Thread Joshua Bernstein

Hi Richard,

	I've built our OpenMPI with PGI 10.0 and included OpenIB support. I've verified 
it works. You'll notice we build with UDAPL and OpenIB, but generally only 
OpenIB is used.


Our complete configure flags, and OpenMPI is included with Scyld ClusterWare is 
shown here:


./configure --prefix=%{_prefix} \
--bindir="%{_prefix}/${compiler}/bin" \
--datarootdir="%{_prefix}/${compiler}/share" \
--mandir="%{_prefix}/man" \
--sysconfdir="%{_sysconfdir}" \
--libdir="%{_libdir}/${compiler}" \
--includedir="%{_includedir}" \
--with-mx=/opt/open-mx \
--with-udapl \
--without-bproc \
--with-tm \
--with-openib \
--disable-dlopen \
${EXTRA_CONFIG_OPTIONS} \
--without-xgrid --without-slurm --without-loadleveler --without-gm 
--without-lsf \


-Joshua Bernstein
Senior Software Engineer
Penguin Computing

Richard Walsh wrote:

All,

Not overwhelmed with responses here ... ;-) ...  No one using PGI 10.0 yet?
We need it to make use of the GPU compiler directives they are supporting.
Can some perhaps comment on whether this is the correct way to configure
for an IB system?  Everything works with Intel and/or if I compile without the
IB flags.

Sent the same report to PGI, but seems like the support team there is on
break for the Holidays.

Someone else must have seen this as well ... No ... ??

rbw

   Richard Walsh
   Parallel Applications and Systems Manager
   CUNY HPC Center, Staten Island, NY
   718-982-3319
   612-382-4620

   Mighty the Wizard
   Who found me at sunrise
   Sleeping, and woke me
   And learn'd me Magic!

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
Richard Walsh [richard.wa...@csi.cuny.edu]
Sent: Saturday, December 19, 2009 12:18 PM
To: us...@open-mpi.org
Subject: [OMPI users] Problem compiling 1.4.0 snap with PGI 10.0-1 and openib 
flags turned on ...

All,

Succeeded in overcoming the 'libtool' failure with PGI using
the patched snap (thanks Jeff), but now I am running
into a down stream problem compiling for our IB clusters.
I am using the latest PGI compiler (10.0-1) and the 12-14-09
snap of OpenMPI of version 1.4.0.

My configure line looks like this:

$ ./configure CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 --enable-openib-ibcm 
--with-openib \
--prefix=/share/apps/openmpi-pgi/1.4.0 --with-tm=/share/apps/pbs/10.1.0.91350

The error I get during the make at about line 8078 is:

libtool: compile:  pgcc -DHAVE_CONFIG_H -I. -I../../../../opal/include 
-I../../../../orte/include -I../../../../ompi/include 
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. 
-D_REENTRANT -O -DNDEBUG -c connect/btl_openib_connect_xoob.c  -fpic -DPIC -o 
connect/.libs/btl_openib_connect_xoob.o
source='connect/btl_openib_connect_ibcm.c' 
object='connect/btl_openib_connect_ibcm.lo' libtool=yes \
DEPDIR=.deps depmode=none /bin/sh ../../../../config/depcomp \
/bin/sh ../../../../libtool --tag=CC   --mode=compile pgcc -DHAVE_CONFIG_H -I. 
-I../../../../opal/include -I../../../../orte/include 
-I../../../../ompi/include 
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa   -I../../../..  
-D_REENTRANT  -O -DNDEBUG   -c -o connect/btl_openib_connect_ibcm.lo 
connect/btl_openib_connect_ibcm.c
libtool: compile:  pgcc -DHAVE_CONFIG_H -I. -I../../../../opal/include 
-I../../../../orte/include -I../../../../ompi/include 
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. 
-D_REENTRANT -O -DNDEBUG -c connect/btl_openib_connect_ibcm.c  -fpic -DPIC -o 
connect/.libs/btl_openib_connect_ibcm.o
PGC-S-0040-Illegal use of symbol, __le64 
(/usr/include/linux/byteorder/little_endian.h: 43)
PGC-W-0156-Type not specified, 'int' assumed 
(/usr/include/linux/byteorder/little_endian.h: 43)
PGC-S-0039-Use of undeclared variable __le64 
(/usr/include/linux/byteorder/little_endian.h: 45)
PGC-S-0104-Non-numeric operand for multiplicative operator 
(/usr/include/linux/byteorder/little_endian.h: 45)
PGC-S-0040-Illegal use of symbol, __le64 
(/usr/include/linux/byteorder/little_endian.h: 47)
PGC-S-0040-Illegal use of symbol, __be64 
(/usr/include/linux/byteorder/little_endian.h: 67)
PGC-W-0156-Type not specified, 'int' assumed 
(/usr/include/linux/byteorder/little_endian.h: 67)
PGC-S-0040-Illegal use of symbol, __be64 
(/usr/include/linux/byteorder/little_endian.h: 69)
PGC-W-0156-Type not specified, 'int' assumed 
(/usr/include/linux/byteorder/little_endian.h: 69)PGC-S-0040-Illegal use of 
symbol, __be64 (/usr/include/linux/byteorder/little_endian.h: 71)
PGC-W-0156-Type not specified, 'int' assumed 
(/usr/include/linux/byteorder/little_endian.h: 71)PGC/x86-64 Linux 10.0-1: 
compilation completed with severe errors
make[2]: *** [connect/btl_openib_connect_ibcm.lo] Error 1

Re: [OMPI users] Seg fault with PBS Pro 10.2

2010-02-15 Thread Joshua Bernstein

Well,

	We all wish the Altair guys would at least try to maintain backwards  
compatibility with the community, but they have a big habit of  
breaking things. This isn't the first time they've broken a more  
customer facing function like tm_spawn. (The also like breaking  
pbs_statjob too!).


	I have access to PBS Pro and I can raise the issue with Altair if it  
would help. Just let me know how I can be helpful.


-Joshua Bernstein
Senior Software Engineer
Penguin Computing

On Feb 15, 2010, at 8:23 AM, Jeff Squyres wrote:


Bummer!

If it helps, could you put us in touch with the PBS Pro people?  We  
usually only have access to Torque when developing the TM-launching  
stuff (PBS Pro and Torque supposedly share the same TM interface,  
but we don't have access to PBS Pro, so we don't know if it has  
diverged over time).



On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote:


Ralph,

This is my first build of OpenMPI so I haven't had this working  
before.  I'm pretty confident that PATH and LD_LIBRARY_PATH issues  
are not the cause, otherwise launches outside of PBS would fail  
too.  Also, I tried compiling everything statically with the same  
result.


Some additional info...  (1) I did a diff on tm.h for PBS 10.2 and  
from version 8.0 that we had - they are identical, and (2) I've  
tried this with both the Intel 11.1 and GCC compilers and gotten  
the exact same run-time errors.


For now, I've got a a work-around setup that launches over ssh and  
still attaches the processes to PBS.


Thanks for your help.

Steve


From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On Behalf Of Ralph Castain

Sent: Friday, February 12, 2010 8:29 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2

Afraid compilers don't help when the param is a void*...

It looks like this is consistent, but I've never tried it under  
that particular environment. Did prior versions of OMPI work, or  
are you trying this for the first time?


One thing you might check is that you have the correct PATH and  
LD_LIBRARY_PATH set to point to this version of OMPI and the  
corresponding PBS Pro libs you used to build it. Most Linux distros  
come with OMPI installed, and that can cause surprises.


We run under Torque at major installations every day, so it - 
should- work...unless PBS Pro has done something unusual.



On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:

Yes, the failure seems to be in mpirun, it never even gets to my  
application.


The proto for tm_init looks like this:
int tm_init(void *info, struct tm_roots *roots);

where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x  
tm_task_id


If the API was different, wouldn't the compiler most likely  
generate an error at compile-time?


Thanks!

Steve


From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On Behalf Of Ralph Castain

Sent: Friday, February 12, 2010 3:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2

I'm a tad confused - this trace would appear to indicate that  
mpirun is failing, yes? Not your application?


The reason it works for local procs is that tm_init isn't called  
for that case - mpirun just fork/exec's the procs directly. When  
remote nodes are required, mpirun must connect to Torque. This is  
done with a call to:


   ret = tm_init(NULL, &tm_root);

My guess is that something changed in PBS Pro 10.2 to that API.  
Can you check the tm header file and see? I have no access to  
PBSany more, so I'll have to rely on your eyes to see a diff.


Thanks
Ralph

On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:


Hello,

I'm having problems running Open MPI jobs under PBS Pro 10.2.   
I've configured and built OpenMPI 1.4.1 with the Intel 11.1  
compiler on Linux and with --with-tm support and the build runs  
fine.  I've also built with static libraries per the FAQ  
suggestion since libpbs is static.  However, my test application  
keep failing with a segmentation fault, but ONLY when trying to  
select more than 1 node.  Running on a single node withing PBS  
works fine.  Also, running outside of PBS vis ssh runs fine as  
well, even across multiple nodes.  OpenIB support is also  
enabled, but that doesn't seem to affect the error because I've  
also tried running with the --mca btl tcp,self flag and it still  
doesn't work.  Here is the error I'm getting:


[n34:26892] *** Process received signal ***
[n34:26892] Signal: Segmentation fault (11)
[n34:26892] Signal code: Address not mapped (1)
[n34:26892] Failing at address: 0x3f
[n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
[n34:26892] [ 1] /part0/apps/MPI/intel/openmpi-1.4.1/bin/ 
pbs_mpirun(discui_+0x84) [0x476a50]
[n34:26892] [ 2] /part0/apps/MPI/intel/openmpi-1.4.1/bin/ 
pbs_mpirun(diswsi+0xc3) [0x474063]
[n34:26892] [ 3] /part0

Re: [OMPI users] Open MPI performance on Amazon Cloud

2010-03-19 Thread Joshua Bernstein

Hi Hammad,

	Before we launched the Penguin Computing On-Demand service we  
conducted several tests that compared the latencies of EC2 with a  
traditional HPC type setup (much like we have with our POD service). I  
have a whole suite of tests that I'd be happy to share with you, but  
to sum it up the EC2 latencies were absolutely terrible. For starters,  
the EC2 PingPong latencies for a zero byte message was around ~150ms,  
compared to an completely untuned, Gigabit Ethernet link of 32ms. For  
something actually useful, say a packet of 4K, EC2 was roughly ~265ms,  
where as a standard GigE link was a more reasonable (but still high)  
71ms. One "real-world" application that was very sensitive to latency  
took almost 30 times longer to run on EC2 then a real cluster  
configuration such as POD.


	I have benchmarks from several complete IMB runs, as well as other  
types of benchmarks such as STREAM and some iobench. If you are  
interested in any particular type, please let me know as I'd be happy  
to share.


	If you really need an on-demand type system where latency is  
an issue, you should look towards our POD offering. We even offer  
Inifniband! On the compute side nothing is virtualized so your  
application runs on the hardware without the overhead of a VM.


-Joshua Bernstein
Senior Software Engineer
Penguin Computing


On Mar 19, 2010, at 11:19 AM, Jeff Squyres wrote:

Yes, it is -- sometimes we get so caught up in other issues that  
user emails slip through the cracks.  Sorry about that!


I actually have little experience with EC2 -- other than knowing  
that it works, I don't know much about the performance that you can  
extract from it.  I have heard issues about non-uniform latency  
between MPI processes because you really don't know where the  
individual MPI processes may land (network- / VM-wise).  It suggests  
to me that EC2 might be best suited for compute-bound jobs (vs.  
latency-bound jobs).


Amusingly enough, the first time someone reported an issue with Open  
MPI on EC2, I tried to submit a help ticket to EC2 support saying,  
"I'm one of the Open MPI developers ... blah blah blah ... is there  
anything I can do to help?" The answer I got back was along the  
lines of, "You need to have a paid EC2 support account before we can  
help you." I think they missed the point, but oh well.  :-)




On Mar 12, 2010, at 12:10 AM, Hammad Siddiqi wrote:


Dear All,
Is this the correct forum for sending these kind of emails. please  
let me know if there is some other mailing list.

Thank
Best Regards,
Hammad Siddiqi
System Administrator,
Centre for High Performance Scientific Computing,
School of Electrical Engineering and Computer Science,
National University of Sciences and Technology,
H-12, Islamabad.
Office : +92 (51) 90852207
Web: http://hpc.seecs.nust.edu.pk/~hammad/


On Sat, Feb 27, 2010 at 10:07 PM, Hammad Siddiqi > wrote:

Dear All,

I am facing very wierd results of OpenMPI 1.4.1 on Amazon EC2. I have
used Small Instance and  and High CPU medium instance for  
benchmarking

latency and bandwidth. The OpenMPI was configured with the default
options. when the code is run in the cluster mode the latency and
bandwidth  of Amazon EC2 Small instance is really less than that of
Amazon EC2 High CPU medium instance. To my understanding the
difference should not be that much. The following are the links to
graphs ad their data:

Data: http://hpc.seecs.nust.edu.pk/~hammad/OpenMPI,Latency-BandwidthData.jpg
Graphs: http://hpc.seecs.nust.edu.pk/~hammad/OpenMPI,Latency-Bandwidth.jpg


Please have a look on them.

Is anyone else facing the same problem. Any guidance in this regard
will highly be appreciated.

Thank you.


--
Best Regards,
Hammad Siddiqi
System Administrator,
Centre for High Performance Scientific Computing,
School of Electrical Engineering and Computer Science,
National University of Sciences and Technology,
H-12, Islamabad.
Office : +92 (51) 90852207

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] libnuma under ompi 1.3

2009-03-04 Thread Joshua Bernstein



Terry Frankcombe wrote:

Having just downloaded and installed Open MPI 1.3 with ifort and gcc, I
merrily went off to compile my application.

In my final link with mpif90 I get the error:

/usr/bin/ld: cannot find -lnuma

Adding --showme reveals that

-I/home/terry/bin/Local/include -pthread -I/home/terry/bin/Local/lib

is added to the compile early in the aggregated ifort command, and 


-L/home/terry/bin/Local/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte
-lopen-pal -lpbs -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

is added to the end.

I note than when compiling Open MPI -lnuma was visible in the gcc
arguments, with no added -L.

On this system libnuma.so exists in /usr/lib64.  My (somewhat long!)
configure command was


You shouldn't have to. The runtime loader should look inside of /usr/lib64 by 
itself. Unless of course, you've built either your application or OpenMPI using 
a 32-bit Intel complier instead (say fc instead of fce). In that case the 
runtime loader would look inside of /usr/lib to find libnuma, rather then 
/usr/lib64.


Are you sure you are using the 64-bit version of the Intel compilier? If you 
intend to use the 32-bit version of the compilier, and OpenMPI is 32-bits you 
may just need to install the numactl.i386 and numactl.x86_64 RPMS.


-Joshua Bernstein
Senior Software Engineer
Penguin Computing


Re: [OMPI users] Installation Problems with Openmpi-1.2.9

2009-03-12 Thread Joshua Bernstein

Hi Amos,

	It looks like you do not have permission to make the directory /usr/ 
local/etc. Either you need to run the make all install as root, so  
you have permission to that directory, or you need to use the -- 
prefix= option to configure so that the installation gets  
installed into a path where you have permission.


-Joshua Bernstein
Senior Software Engineer
Penguin Computing

On Mar 12, 2009, at 12:13 PM, Amos Leffler wrote:


Hello Forum,
  Attached is a file of my installation and trying examples
for openmpi-1.2.9 which were not successful. Hopefully the problem is
a simple one and obvious to a more experienced user.

I am trying to install and test openmpi-1.2.9. I found that I
could not use the Intel 11.0/.081 C++
and Fortran compilers although I think the problem is with these
compilers not openmpi.  The openmpi-
1.2.9 did compile successfully with the internal compilers of SuSE
10.2.  However, at the end of the
"make all install" command output I noted that some of the make
commands did not run  properly as
shown below.
I tried to run some of the simple examples and was not successful.
 For hello_c.c I received the
message "mpicc not found".  Is there a simple workaround?

make[3]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/ 
libltdl'
make[2]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/ 
libltdl'

Making install in asm
make[2]: Entering directory `/home/amos/Desktop/openmpi-1.2.9/opal/ 
asm'
make[3]: Entering directory `/home/amos/Desktop/openmpi-1.2.9/opal/ 
asm'

make[3]: Nothing to be done for `install-exec-am'.
make[3]: Nothing to be done for `install-data-am'.
make[3]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/asm'
make[2]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/asm'
Making install in etc
make[2]: Entering directory `/home/amos/Desktop/openmpi-1.2.9/opal/ 
etc'
make[3]: Entering directory `/home/amos/Desktop/openmpi-1.2.9/opal/ 
etc'

test -z "/usr/local/etc" || /bin/mkdir -p "/usr/local/etc"
/bin/mkdir: cannot create directory `/usr/local/etc': Permission  
denied

make[3]: *** [install-sysconfDATA] Error 1
make[3]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/etc'
make[2]: *** [install-am] Error 2
make[2]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/etc'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal'
make: *** [install-recursive] Error 1

Any help would be appreciated.
   Amos  
Leffler___

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] OpenMPI 1.3.2 with PathScale 3.2

2009-05-14 Thread Joshua Bernstein

Greetings All,

	I'm trying to build OpenMPI 1.3.2 with the Pathscale compiler, version 3.2. A 
bit of the way through the build the compiler dies with what it things is a bad 
optimization. Has anybody else seen this, or know a work around for it? I'm 
going to take it up with Pathscale of course, but I thought I'd throw it out here:


---SNIP---
/opt/pathscale/bin/pathCC -DHAVE_CONFIG_H -I. -I../.. -I../../extlib/otf/otflib 
-I../../extlib/otf/otflib -I../../vtlib/ -I../../vtlib  -D_GNU_SOURCE -mp 
-DVT_OMP -O3 -DNDEBUG -finline-functions -pthread -MT vtfilter-vt_tracefilter.o 
-MD -MP -MF .deps/vtfilter-vt_tracefilter.Tpo -c -o vtfilter-vt_tracefilter.o 
`test -f 'vt_tracefilter.cc' || echo './'`vt_tracefilter.cc

Signal: Segmentation fault in Global Optimization -- Dead Store Elimination 
phase.
Error: Signal Segmentation fault in phase Global Optimization -- Dead Store 
Elimination -- processing aborted

*** Internal stack backtrace:
pathCC INTERNAL ERROR: /opt/pathscale/lib/3.2/be died due to signal 4
Please report this problem to .
Problem report saved as /root/.ekopath-bugs/pathCC_error_LvXsJk.ii
Please review the above file and, if possible, attach it to your problem report.

bash-3.00# /opt/pathscale/bin/pathCC -version
PathScale(TM) Compiler Suite: Version 3.2
Built on: 2008-06-16 16:45:36 -0700
Thread model: posix
GNU gcc version 3.3.1 (PathScale 3.2 driver)

Copyright 2000, 2001 Silicon Graphics, Inc.  All Rights Reserved.
Copyright 2002, 2003, 2004, 2005, 2006 PathScale, Inc.  All Rights Reserved.
Copyright 2006, 2007 QLogic Corporation.  All Rights Reserved.
Copyright 2007, 2008 PathScale LLC.  All Rights Reserved.
See complete copyright, patent and legal notices in the
/opt/pathscale/share/doc/pathscale-compilers-3.2/LEGAL.pdf file.
---END SNIP---

-Joshua Bernstein
Software Engineer
Penguin Computing


Re: [OMPI users] OpenMPI 1.3.2 with PathScale 3.2

2009-05-18 Thread Joshua Bernstein

Well,

	I spoke Gautam Chakrabarti at Pathscale. It seems the long and short of it is 
that using OpenMP with C++ with a GNU3.3 (RHEL4) frontend creates some 
limitations inside of pathCC. On a RHEL4 system, the compilier activates the 
proper frontend for GCC 3.3, this is what creates the crash. As suggested I 
forced the compilier to use the newer frontend with the -gnu4 option and the 
build completes without an issue. Sad though that they aren't trying to be 
backwards compatible, or even testing on RHEL4 systems. I imagine there is still 
large group of people using RHEL4.


Perhaps this is an OMPI FAQ entry?

The full response from Pathscale appears below:

---SNIP---
It appears you are using the compiler on a relatively old linux distribution 
which has a default GCC compiler based on version 3.3. Our compiler has a 
front-end that is activated on such systems, and a different newer improved 
front-end which is activated on the newer GCC4-based systems. Our compiler is 
tested on GCC-based systems with versions up to 4.2. I see that you are using 
OpenMP (using -mp). C++ OpenMP has limitations when being used with the GNU3.3 
based front-end, and is only fully supported when on a GNU4 based system.


You can invoke the newer front-end by the option -gnu4 on a GNU3 based system. 
While compiling this particular file may work with -gnu4 on a GNU3 based system, 
it is generally not safe to use this option for C++ on a GNU3 based system due 
to incompatibility issues.


The ideal fix would be to try your compilation on a GNU4 based linux 
distribution.
---END SNIP---

-Joshua Bernstein
Software Engineer
Penguin Computing

Jeff Squyres wrote:
FWIW, I'm able to duplicate the error.  Looks definitely like a[nother] 
pathscale bug to me.


Perhaps David's suggestions to disable some of the optimizations may 
help; otherwise, you can disable that entire chunk of code with the 
following:


   --enable-contrib-no-build=vt

(as Ralph mentioned, this VampirTrace code is an add-on to Open MPI; 
it's not part of core OMPI itself)



On May 15, 2009, at 9:17 AM, David O. Gunter wrote:


Pathscale supports -O3 (at least as of the 3.1 line).  Here are some
suggestions from the 3.2 Users Manual you may also want to try.

-david


If there are numerical problems with -O3 -OPT:Ofast, then try either 
of the

following:

  -O3 -OPT:Ofast:ro=1
  -O3 -OPT:Ofast:div_split=OFF

Note that ’ro’ is short for roundoff.

-Ofast is equivalent to -O3 -ipa -OPT:Ofast -fno-math-errno -ffast-math
so similar cautions apply to it as to -O3 -OPT:Ofast.

To use interprocedural analysis without the "Ofast-type" optimizations,
use either of the following:
  -O3 -ipa
  -O2 -ipa

Testing different optimizations can be automated by pathopt2. This 
program

compiles and runs your program with a variety of compiler options and
creates a sorted list of the execution times for each run.

--
David Gunter
Los Alamos National Laboratory

> Last I checked when we were building here, I'm not sure Pathscale
> supports -O3. IIRC, O2 is the max supported value, though it has been
> awhile since I played with it.
>
> Have you checked the man page for it?
>
> It could also be something in the VampirTrace code since that is where
> you are failing. That is a contributed code - not part of OMPI itself
> - so we would have to check with those developers.
>
>
> On May 14, 2009, at 2:49 PM, Åke Sandgren wrote:
>
>> On Thu, 2009-05-14 at 13:35 -0700, Joshua Bernstein wrote:
>>> Greetings All,
>>>
>>> I'm trying to build OpenMPI 1.3.2 with the Pathscale compiler,
>>> version 3.2. A
>>> bit of the way through the build the compiler dies with what it
>>> things is a bad
>>> optimization. Has anybody else seen this, or know a work around for
>>> it? I'm
>>> going to take it up with Pathscale of course, but I thought I'd
>>> throw it out here:
>>>
>>> ---SNIP---
>>> /opt/pathscale/bin/pathCC -DHAVE_CONFIG_H -I. -I../.. -I../../
>>> extlib/otf/otflib
>>> -I../../extlib/otf/otflib -I../../vtlib/ -I../../vtlib  -
>>> D_GNU_SOURCE -mp
>>> -DVT_OMP -O3 -DNDEBUG -finline-functions -pthread -MT vtfilter-
>>> vt_tracefilter.o
>>> -MD -MP -MF .deps/vtfilter-vt_tracefilter.Tpo -c -o vtfilter-
>>> vt_tracefilter.o
>>> `test -f 'vt_tracefilter.cc' || echo './'`vt_tracefilter.cc
>>> Signal: Segmentation fault in Global Optimization -- Dead Store
>>> Elimination phase.
>>> Error: Signal Segmentation fault in phase Global Optimization --
>>> Dead Store
>>> Elimination -- processing aborted
>>> *** Internal stack backtrace:
>>> pathCC INTERNAL ERROR: /opt/pathscale/lib/3.2/be died due to signal 4
>>
>> 

Re: [OMPI users] OpenMPI 1.3.2 with PathScale 3.2

2009-05-19 Thread Joshua Bernstein



Jeff Squyres wrote:

Hah; this is probably at least tangentially related to


http://www.open-mpi.org/faq/?category=building#pathscale-broken-with-mpi-c++-api 


This looks related, perhaps something about suggesting using the -gnu4 option 
might be nice to add, if not there then maybe someplace else?


I'll be kind and say that Pathscale has been "unwilling to help on these 
kinds of issues" with me in the past as well.  :-)


They've been very responsive for me and their suggestions do generally do the 
trick. There is no doubt the compiler is smoking fast, its just about 
compatibility. :-)


It's not entirely clear from the text, but I guess that sounds like 
Pathscale is unsupported on GCC 3.x systems...?  Is that what you parse 
his answer to mean?


From Pathscale: "PathScale does not support C++ OpenMP support on RHEL4. As I 
noted earlier, RHEL4 is otherwise supported for all other compiler features, and 
it's tested as well."


-Joshua Bernstein
Software Engineer
Penguin Computing