Re: [OMPI users] OPENMPI 1.2.7 & PGI compilers: configure option --disable-ptmalloc2-opt-sbrk

2008-10-17 Thread Francesco Iannone
Hi Jeff

Sorry to disturb you

I send you the Stack Frame captured with Totalview.

The example program "callocrash" goes in Segmentation Violation on sYMALLOc
function:

set_head(remainder, remainder_size | PREV_INUSE);


The Stack frame is

Function "sYSMALLOc":
  nb:0x00025216d050 (9967161424)
  av:0x2a95c1ef00 (&main_arena) -> (struct
malloc_state)
Local variables:
  old_top:   0x0b8bc110 -> (struct malloc_chunk)
  old_size:  0x00020ef0 (134896)
  old_end:   0x0b8dd000 -> ""
  size:  0x00025218def0 (9967296240)
  correction:0x (0)
  brk:   0x0b8dd000 -> ""
  snd_brk:   0x -> 
  front_misalign:0x (0)
  end_misalign:  0x0b8dd000 (193843200)
  aligned_brk:   0x00507000 -> ""
  p: 0x0b8bc110 -> (struct malloc_chunk)
  remainder: 0x25da29160 -> 
(struct malloc_chunk)
  remainder_size:0x00020ea0 (134816)
  sum:   0x3828b000 (942190592)
  pagemask:  0x0fff (4095)


On 16/10/08 14:05, "Francesco Iannone" 
wrote:

> Hi Jeff
> I used the configure option:
> 
> --enable-ptmalloc2-opt-sbrk
> 
> To solve a segmentation fault in memory allocation with openmpi.1.2.x and
> PGI 7.1-4 and 7.2.
> 
> I have a simple source code (Callocrash.c) as example of this (see belowe).
> 
> Could you test this code on a node with 8 Gbyte of RAM and RedHat enterprise
> 4+ openmpi 1.2.x, PGI 7.1-4.
> 
> I compiled it with:
> 
>  pgcc -o Callocrash Callocreash.c   (it's ok)
>  gnu4 -o Callocrash Callocreash.c   (it's ok)
>  mpicc -o Callocrash Callocreash.c   (Segmentation fault in sysMALLOC when
> it has to allocate 622947588 bytes)
> 
> However thanks in advance
> 
> greetings
> 
> 
> Callocrash.c
> 
> 
> #include 
> #include 
> 
> int main( int argc, char *argv[])
> {
> /*
>  *  memory allocations simulation for ~50M nonzeros:
>  *  nd=180 md=350 mdy=420
>  *
>  *  if this program crashes, there is a compiler problem
>  */
> printf("memory allocations simulation for ~50M nonzeros:  nd=180
> md=350 mdy=420\n");
> printf("if this program crashes, there check your
> compiler/environment configuration\n");
> 
> printf("sizeof(int)%d\n",sizeof(int));
> printf("sizeof(int*)   %d\n",sizeof(int*));
> printf("sizeof(size_t) %d\n",sizeof(size_t));
> 
> if( sizeof(size_t)<8 || sizeof(int*)<8 )
> {
> printf("please compile this program for a 64 bit
> environment!\n");
> return -1;
> }
> 
> int *p;
> 
> printf("allocation 1/4..\n");
> p = calloc(47109185,16);
> if(!p)printf("..failed.\n");
> printf("allocation 2/4..\n");
> p = calloc(47109185,4);
> if(!p)printf("..failed.\n");
> printf("allocation 3/4..\n");
> p = calloc(47109185,4);
> if(!p)printf("..failed.\n");
> printf("allocation 4/4..\n");
>   
> p = calloc(622947588,16);
> if(!p)printf("..failed.\n");
> if(!p) return -1;
> 
> printf("allocations test passed (no crash)\n");
> return 0;
> }
> 
> 
> On 15/10/08 19:42, "Jeff Squyres"  wrote:
> 
>> On Oct 15, 2008, at 9:35 AM, Francesco Iannone wrote:
>> 
>>> I have a cluster of 16 nodes DualCPU DualCore AMD  RAM 16 GB with
>>> InfiniBand
>>> CISCO HCA and switch InfiniBand.
>>> It uses Linux RH Enterprise 4  64 bit , OpenMPI 1.2.7, PGI 7.1-4 and
>>> openib-1.2-7.
>>> 
>>> Hence it means that the option ‹disable-ptmalloc2 is catastrophic in
>>> the
>>> above configuration.
>> 
>> Actually, I notice that in your original message, you said "--disable-
>> ptmalloc2-opt-sbrk", but here you said "--disable-ptmalloc2".  The
>> former is:
>> 
>>Only trigger callbacks when sbrk is used
>> for small
>>allocations, rather than every call to
>> malloc/free.
>>(default: enabled)
>> 
>> So it should be fine to disable; it shouldn't affect overall MPI
>> performance too much.
>> 
>> The latter disables ptmalloc2 entirely (and you'll likely get lower
>> benchmark bandwidth for large messages).
>> 
>> I'm unaware of either of these options leading to problems with the
>> PGI compiler suite; I have tested OMPI v1.2.x with several versions of
>> the PGI compiler without problem (although my latest version is PGI
>> 7.1-4).
> 
> Dr. Francesco Iannone
> Associazione EURATOM-ENEA sulla Fusione
> C.R. ENEA Frascati
> Via E. Fermi 45
> 00044 Frascati (Roma) Italy
> phone 00-39-06-9400-5124
> fax 00-39-06-9400-5524
> mailto:francesco.iann...@frascati.enea.it
> http://www.afs.enea.it/iannone
> 
> 
> 
> 

[OMPI users] OPAL_PREFIX is not passed to remote node in pls_rsh_module.c

2008-10-17 Thread Teng Lin

Hi All,

We have bundled Open MPI with our product and shipped it to the  
customer. According to http://www.open-mpi.org/faq/?category=building#installdirs 
,


Below is the command we used to launch MPI program:
env OPAL_PREFIX=/path/to/openmpi \
/path/to/openmpi/bin//orterun --prefix /path/to/openmpi -x PATH -x  
LD_LIBRARY_PATH -x OPAL_PREFIX -np 2 --host host1,host2 ring_c


The interesting fact is that it always works on csh/tcsh. But  quite a  
few users told us that they runs into below errors:


[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182

--
Sorry!  You were supposed to get help about:
  orte_init:startup:internal-failure
from the file:
  help-orte-runtime
But I couldn't find any file matching that name.  Sorry!

--
[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52

--
Sorry!  You were supposed to get help about:
  orted:init-failure
from the file:
  help-orted.txt
But I couldn't find any file matching that name.  Sorry!


Jeff did mention in http://www.open-mpi.org/community/lists/users/2008/09/6582.php 
 that OPAL_PREFIX was propagated for him automatically. I bet Jeff  
uses csh/tcsh.

Anyway, it can be traced back to how the daemon is launched.

sh/bash:

[x:25369] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x
OPAL_PREFIX=/opt/openmpi-1.2.4 ;
PATH=/opt/openmpi-1.2.4/bin:$PATH
; export PATH ;
LD_LIBRARY_PATH=/opt/openmpi-1.2.4/lib:$LD_LIBRARY_PATH ; export  
LD_LIBRARY_PATH ;


csh/tcsh:
[x:09886] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x
setenv OPAL_PREFIX /opt/openmpi-1.2.4 ;


It seems to work after I patched pls_rsh_module.c


--- pls_rsh_module.c.orig   2008-10-16 17:15:32.0 -0400
+++ pls_rsh_module.c2008-10-16 17:15:51.0 -0400
@@ -989,7 +989,7 @@
  "%s/%s/%s",
  (opal_prefix != NULL ?  
"OPAL_PREFIX=" : ""),
  (opal_prefix != NULL ?  
opal_prefix : ""),

-  (opal_prefix != NULL ? " ;" : ""),
+  (opal_prefix != NULL ? " ; export  
OPAL_PREFIX ; " : ""),

  prefix_dir, bin_base,
  prefix_dir, lib_base,
  prefix_dir, bin_base,

Another workaround is to add
export OPAL_PREFIX
into $HOME/.bashrc.

Jeff, is this a bug in the code? Or  there is a reason that  
OPAL_PREFIX is not exported for sh/bash?


Teng


[OMPI users] Problems with OpenMPI running with Rmpi

2008-10-17 Thread Simone Giannerini
Dear all,

I managed to install successfully Rmpi 0.5-5 on a quad opteron machine (8
cores overall) running on OpenSUSE 11.0 and Open MPI 1.5.2.

this is what I get

> library(Rmpi)
[gauss:24207] mca: base: component_find: unable to open osc pt2pt: file not
found (ignored)
libibverbs: Fatal: couldn't read uverbs ABI version.
--
[0,0,0]: OpenIB on host gauss was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
--
--

WARNING: Failed to open "OpenIB-cma"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "OpenIB-cma-1"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "OpenIB-cma-2"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "OpenIB-cma-3"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "OpenIB-bond"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "ofa-v2-ib0"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "ofa-v2-ib1"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "ofa-v2-ib2"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--
[0,0,0]: uDAPL on host gauss was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--
> mpi.spawn.Rslaves()
1 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 2 is running on: gauss
slave1 (rank 1, comm 1) of size 2 is running on: gauss

as you can see, just 1 cpu per session (2 cores) is recognized and used.

and this is the content of my etc/conf.dat

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl

Re: [OMPI users] OpenMPI portability problems: debug info isn'thelpful

2008-10-17 Thread Mike Hanby
Some further clarification, I read a post over on the SGE mailing list
that said the --with-sge is part of ompi 1.3, not 1.2.x.

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Aleksej Saushev
Sent: Thursday, October 16, 2008 12:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI portability problems: debug info
isn'thelpful

Jeff Squyres  writes:

> On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote:
>
>> $ ompi_info | grep oob
>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)
>
> Good!
>
>>> $ mpirun --mca rml_base_debug 100 -np 2 skosfile
>> [asau.local:09060] mca: base: components_open: Looking for rml
>> components
>> [asau.local:09060] mca: base: components_open: distilling rml
>> components
>> [asau.local:09060] mca: base: components_open: accepting all
>> rml  components
>> [asau.local:09060] mca: base: components_open: opening rml components
>> [asau.local:09060] mca: base: components_open: found loaded
>> component oob
>> [asau.local:09060] mca: base: components_open: component oob
>> open  function successful
>> [asau.local:09060] orte_rml_base_select: initializing rml
>> component  oob
>> [asau.local:09060] orte_rml_base_select: init returned failure
>
> Ah ha -- this is progress.  For some reason, your "oob" RML
> plugin is  declining to run.  I see that its
> query/initialization function is  actually quite short:
>
> if(mca_oob_base_init() != ORTE_SUCCESS)
> return NULL;
> *priority = 1;
> return &orte_rml_oob_module;
>
> So it must be failing the mca_oob_base_init() function -- this
> is what  initializes the underling "OOB" (out of band)
> communications subsystem.
>
> Of course, this doesn't fail often, so we don't have any
> run-time  switches to enable the debugging output.  :-(  Edit
> orte/mca/oob/base/ oob_base_open.c line 43 and change the value
> of mca_oob_base_output  from -1 to 0.  Let's see that output --
> I'm particularly interested in  the output from querying the tcp
> oob component.  I suspect that it's  declining to run as well.
>
> I wonder if this is going to end up being an opal_if() issue --
> where  we are traversing all the IP network interfaces from the
> kernel...   I'll bet even money that it is.

[asau.local:04648] opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=6
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_rml_base_select failed
  --> Returned value -13 instead of ORTE_SUCCESS


--
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52

--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -13 instead of ORTE_SUCCESS.

--

Why don't you use strerror(3) to print errno value explanation?

>From :
#define ENXIO   6   /* Device not configured */

It seems that I have to debug network interface probing,
how should I use *_output subroutines so that they do print?
I tried these changes but in vain:

--- opal/util/if.c.orig 2008-08-25 23:16:50.0 +0400
+++ opal/util/if.c  2008-10-15 23:55:07.0 +0400
@@ -242,6 +242,8 @@
 if(ifr->ifr_addr.sa_family != AF_INET)
 continue;

+   opal_output(0, "opal_ifinit: checking netif %s", ifr->ifr_name);
+   /* HERE IT FAILS!! */
 if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
 opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed
with errno=%d", errno);
 continue;
--- opal/util/if.c.orig 2008-08-25 23:16:50.0 +0400
+++ opal/util/if.c  2008-10-15 23:55:07.0 +0400
@@ -242,6 +242,8 @@
 if(ifr->ifr_addr.sa_family != AF_INET)
 continue;

+   fprintf(stderr, "opal_ifinit: checking netif %s\n",
ifr->ifr_name);
+   /* HERE IT FAILS!! */
 if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
 opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed
with errno=%d", errno);
 continue;
--- opal/util/output.c.orig 2008-08-25 23:16:50.0 +0400
+++ opal/util/output.c  2008-10-16 19:58:49.

[OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Raymond Wan


Hi all,

I'm very new to MPI and am trying to install it on to a Debian Etch 
system.  I did have mpich installed and I believe that is causing me 
problems.  I completely uninstalled it and then ran:


update-alternatives --remove-all mpicc

Then, I installed the following packages:

libibverbs1 openmpi-bin openmpi-common openmpi-libs0 openmpi-dbg openmpi-dev

And it now says:

>> update-alternatives --display mpicc
mpicc - status is auto.
link currently points to /usr/bin/mpicc.openmpi
/usr/bin/mpicc.openmpi - priority 40
slave mpif90: /usr/bin/mpif90.openmpi
slave mpiCC: /usr/bin/mpic++.openmpi
slave mpic++: /usr/bin/mpic++.openmpi
slave mpif77: /usr/bin/mpif77.openmpi
slave mpicxx: /usr/bin/mpic++.openmpi
Current `best' version is /usr/bin/mpicc.openmpi.

which seems ok to me...  So, I tried to compile something (I had sample 
code from a book I purchased a while back, but for mpich), however, I 
can run the program as-is, but I think I should be running it with 
mpirun -- the FAQ suggests there is one?  But, there is no mpirun 
anywhere.  It's not in /usr/bin.  I updated the filename database 
(updatedb) and tried a "locate mpirun", and I get only one hit:


/usr/include/openmpi/ompi/runtime/mpiruntime.h

Is there a package that I neglected to install?  I did an "aptitude 
search openmpi" and installed everything listed...  :-)  Or perhaps I 
haven't removed all trace of mpich?


Thank you in advance!

Ray




Re: [OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Terry Frankcombe
Er, shouldn't this be in the Debian support list?  A correctly installed
OpenMPI will give you mpirun.  If their openmpi-bin package doesn't,
then surely it's broken?  (Or is there a straight openmpi package?)



On Sat, 2008-10-18 at 00:16 +0900, Raymond Wan wrote:
> Hi all,
> 
> I'm very new to MPI and am trying to install it on to a Debian Etch 
> system.  I did have mpich installed and I believe that is causing me 
> problems.  I completely uninstalled it and then ran:
> 
> update-alternatives --remove-all mpicc
> 
> Then, I installed the following packages:
> 
> libibverbs1 openmpi-bin openmpi-common openmpi-libs0 openmpi-dbg openmpi-dev
> 
> And it now says:
> 
>  >> update-alternatives --display mpicc
> mpicc - status is auto.
>  link currently points to /usr/bin/mpicc.openmpi
> /usr/bin/mpicc.openmpi - priority 40
>  slave mpif90: /usr/bin/mpif90.openmpi
>  slave mpiCC: /usr/bin/mpic++.openmpi
>  slave mpic++: /usr/bin/mpic++.openmpi
>  slave mpif77: /usr/bin/mpif77.openmpi
>  slave mpicxx: /usr/bin/mpic++.openmpi
> Current `best' version is /usr/bin/mpicc.openmpi.
> 
> which seems ok to me...  So, I tried to compile something (I had sample 
> code from a book I purchased a while back, but for mpich), however, I 
> can run the program as-is, but I think I should be running it with 
> mpirun -- the FAQ suggests there is one?  But, there is no mpirun 
> anywhere.  It's not in /usr/bin.  I updated the filename database 
> (updatedb) and tried a "locate mpirun", and I get only one hit:
> 
> /usr/include/openmpi/ompi/runtime/mpiruntime.h
> 
> Is there a package that I neglected to install?  I did an "aptitude 
> search openmpi" and installed everything listed...  :-)  Or perhaps I 
> haven't removed all trace of mpich?
> 
> Thank you in advance!
> 
> Ray
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] The --with-sge option

2008-10-17 Thread Jeff Squyres

On Oct 16, 2008, at 12:06 PM, Mike Hanby wrote:

I’m compiling 1.2.8 on a system with SGE 6.1u4 and came across the  
“--with-sge” option on a Grid Engine posting.


A couple questions:
1.  I don’t see --with-sge mentioned in the “./configure --help"  
output, nor can I find much reference to it on the open-mpi site, is  
this option really implemented? What does it do?


Sorry -- this is an option for OMPI v1.3 and later; it doesn't exist  
in the v1.2 series.


[8:31] svbu-mpi:~/svn/ompi4 % ./configure --help |& grep sge
  --with-sge  Build SGE or Grid Engine support (default:  
no)


So in the v1.3 series, using --without-sge will disable OMPI from  
understanding SGE host lists, etc.


2.  After compiling openmpi providing the --with-sge switch I ran  
the ompi_info binary and grep’d for sge in the output, there isn’t  
any reference, should there be if the option was successfully passed  
to configure?


From your second mail:


I did find the following in ompi_info:

MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)

However I see that in an ompi_info built without using the --with- 
sge switch.


Per above, that should be ok in the 1.2 series.

Also, since I'm building 1.2.8, shouldn't those versions after  
Component reflect 1.2.8?


Yes, actually, they should...  That's somewhat concerning.

I set the PATH and LD_LIBRARY_PATH to point to the temp location of  
my new build and it still reports 1.2.7.



You might want to double check your setup.  Since OMPI uses plugins,  
it can be each to accidentally mix versions by installing one over  
another, etc.


Note that the output from configure will also indicate whether it's  
going to build SGE support, as well.  Look in the stdout of configure  
and search for "gridengine".


--
Jeff Squyres
Cisco Systems




[OMPI users] OpenMPI 1.2.8 on Solaris: configure problems

2008-10-17 Thread Paul Kapinos

Hi guys,

did you test OpenMPI 1.2.8 on Solaris at all?!

We tried to compile OpenMPI 1.2.8 on Solaris on Sparc and on Opteron for 
both GCC and SUN Studio compiler, in 32bit and 64bit versions, at all 
2*2*2=8 versions, in the very same maneer we have installed 1.2.5 and 
1.2.6 versions.



The configuring processes runs through, but if "gmake all" called, it 
seems to be so, that the configure stage restarts or being resumed:


..
orte/mca/smr/bproc/Makefile.am:47: Libtool library used but `LIBTOOL' is 
undefined
orte/mca/smr/bproc/Makefile.am:47:   The usual way to define `LIBTOOL' 
is to add `AC_PROG_LIBTOOL'
orte/mca/smr/bproc/Makefile.am:47:   to `configure.ac' and run `aclocal' 
and `autoconf' again.
orte/mca/smr/bproc/Makefile.am:47:   If `AC_PROG_LIBTOOL' is in 
`configure.ac', make sure
orte/mca/smr/bproc/Makefile.am:47:   its definition is in aclocal's 
search path.

test/support/Makefile.am:29: library used but `RANLIB' is undefined
test/support/Makefile.am:29:   The usual way to define `RANLIB' is to 
add `AC_PROG_RANLIB'

test/support/Makefile.am:29:   to `configure.ac' and run `autoconf' again.

. and breaks.


If "gmake all" again we also see error messages like:



*** Fortran 77 compiler
checking for gfortran... gfortran
checking whether we are using the GNU Fortran 77 compiler... yes
checking whether gfortran accepts -g... yes
checking if Fortran 77 compiler works... yes
checking gfortran external symbol convention... ./configure: line 26340: 
./conftest.o: Permission denied

./configure: line 26342: ./conftest.o: Permission denied
./configure: line 26344: ./conftest.o: Permission denied
./configure: line 26346: ./conftest.o: Permission denied
./configure: line 26348: ./conftest.o: Permission denied
configure: error: Could not determine Fortran naming convention.





Considered the configure script we see on these lines in ./configire:


if $NM conftest.o | grep foo_bar__ >/dev/null 2>&1 ; then
  ompi_cv_f77_external_symbol="double underscore"
  elif $NM conftest.o | grep foo_bar_ >/dev/null 2>&1 ; 
then

  ompi_cv_f77_external_symbol="single underscore"
  elif $NM conftest.o | grep FOO_bar >/dev/null 2>&1 ; then
  ompi_cv_f77_external_symbol="mixed case"
  elif $NM conftest.o | grep foo_bar >/dev/null 2>&1 ; then
  ompi_cv_f77_external_symbol="no underscore"
  elif $NM conftest.o | grep FOO_BAR >/dev/null 2>&1 ; then
  ompi_cv_f77_external_symbol="upper case"
  else
  $NM conftest.o >conftest.out 2>&1




and searching through ./configire says us, that $NM is never set 
(neither in ./configure nor in our environment)



So, we think that somewhat is not OK with ./configure script. Attend to 
the fact, that we were able to install 1.2.5 and 1.2.6 some time ago on 
same boxes without problems.


Or maybe we do somewhat wrong?

best regards,
Paul Kapinos
HPC Group RZ RWTH Aachen






P.S. Folks, does somebody compiled OpenMPI 1.2.8 on someone Solaris 
sucessfully?






This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.

It was created by Open MPI configure 1.2.8, which was
generated by GNU Autoconf 2.61.  Invocation command line was

  $ ./configure --with-devel-headers CFLAGS=-O2 -m64 CXXFLAGS=-O2 -m64 FFLAGS=-O2 -m64 FCFLAGS=-O2 -m64 LDFLAGS=-O2 -m64 --prefix=/rwthfs/rz/SW/MPI/openmpi-1.2.8/solx8664/gcc CC=gcc CXX=g++ FC=gfortran --enable-ltdl-convenience --no-create --no-recursion

## - ##
## Platform. ##
## - ##

hostname = sunoc63.rz.RWTH-Aachen.DE
uname -m = i86pc
uname -r = 5.10
uname -s = SunOS
uname -v = Generic_137112-06

/usr/bin/uname -p = i386
/bin/uname -X = System = SunOS
Node = sunoc63.rz.RWTH-Aachen.DE
Release = 5.10
KernelID = Generic_137112-06
Machine = i86pc
BusType = 
Serial = 
Users = 
OEM# = 0
Origin# = 1
NumCPU = 4

/bin/arch  = i86pc
/usr/bin/arch -k   = i86pc
/usr/convex/getsysinfo = unknown
/usr/bin/hostinfo  = unknown
/bin/machine   = unknown
/usr/bin/oslevel   = unknown
/bin/universe  = unknown

PATH: /home/pk224850/bin
PATH: /rwthfs/rz/SW/UTIL.common/gcc/4.2/i386-pc-solaris2.10/bin
PATH: /home/pk224850/bin
PATH: /home/pk224850/bin
PATH: /usr/local_host/sbin
PATH: /usr/local_host/bin
PATH: /usr/local_rwth/sbin
PATH: /usr/local_rwth/bin
PATH: /usr/bin
PATH: /usr/sbin
PATH: /sbin
PATH: /usr/dt/bin
PATH: /usr/X11/bin
PATH: /usr/java/bin
PATH: /usr/openwin/bin
PATH: /usr/ccs/bin
PATH: /usr/ucb
PATH: /opt/SUNWexplo/bin
PATH: /usr/sfw/bin
PATH: /opt/sfw/bin
PATH: /usr/local/bin
PATH: /usr/local/sbin
PATH: /opt/csw/bin
PATH: .


## --- ##
## Core tests. ##
## --- ##

configure:2817: checking for a BSD-compatible install
configure:2873: result: /usr/local_rwth/bin/ginstall -c
configure:2884: checking whet

Re: [OMPI users] Problem launching onto Bourne shell

2008-10-17 Thread Jeff Squyres
Doh; yes we did.  This was a minor glitch in porting the 1.2 series  
fix to the trunk/v1.3 (i.e., the fix in v1.2.8 is ok -- whew!).


Fixed on the trunk in r19758; thanks for noticing.  I'll file a CMR  
for v1.3.



On Oct 16, 2008, at 7:05 PM, Mostyn Lewis wrote:


Jeff,

You broke my ksh (and I expect something else)
Today's SVN 1.4a1r19757
orte/mca/plm/rsh/plm_rsh_module.c
line 471:
   tmp = opal_argv_split("( test ! -r ./.profile  
|| . ./.profile;", ' ');

  ^
  ARGHH
No (
   tmp = opal_argv_split(" test ! -r ./.profile  
|| . ./.profile;", ' ');

and all is well again :)

Regards,
Mostyn

On Thu, 9 Oct 2008, Jeff Squyres wrote:

FWIW, the fix has been pushed into the trunk, 1.2.8, and 1.3 SVN  
branches. So I'll probably take down the hg tree (we use those as  
temporary branches).


On Oct 9, 2008, at 2:32 PM, Hahn Kim wrote:


Hi,
Thanks for providing a fix, sorry for the delay in response.  Once  
I found out about -x, I've been busy working on the rest of our  
code, so I haven't had the time to try out the fix.  I'll take a  
look at it soon as I can and will let you know how it works out.

Hahn
On Oct 7, 2008, at 5:41 PM, Jeff Squyres wrote:

On Oct 7, 2008, at 4:19 PM, Hahn Kim wrote:
you probably want to set the LD_LIBRARY_PATH (and PATH, likely,  
and

possibly others, such as that LICENSE key, etc.) regardless of
whether it's an interactive or non-interactive login.

Right, that's exactly what I want to do.  I was hoping that mpirun
would run .profile as the FAQ page stated, but the -x fix works  
for

now.
If you're using Bash, it should be running .bashrc.  But it looks  
like

you did identify a bug that we're *not* running .profile.  I have a
Mercurial branch up with a fix if you want to give it a spin:

 http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/sh-profile-fixes/
I just realized that I'm using .bash_profile on the x86 and need  
to
move its contents into .bashrc and call .bashrc  
from .bash_profile,

since eventually I will also be launching MPI jobs onto other x86
processors.
Thanks to everyone for their help.
Hahn
On Oct 7, 2008, at 2:16 PM, Jeff Squyres wrote:

On Oct 7, 2008, at 12:48 PM, Hahn Kim wrote:
Regarding 1., we're actually using 1.2.5.  We started using  
Open MPI
last winter and just stuck with it.  For now, using the -x  
flag with
mpirun works.  If this really is a bug in 1.2.7, then I think  
we'll

stick with 1.2.5 for now, then upgrade later when it's fixed.
It looks like this behavior has been the same throughout the  
entire

1.2 series.
Regarding 2., are you saying I should run the commands you  
suggest
from the x86 node running bash, so that ssh logs into the Cell  
node

running Bourne?
I'm saying that if "ssh othernode env" gives different answers  
than
"ssh othernode"/"env", then your .bashrc or .profile or  
whatever is
dumping out early depending on whether you have an interactive  
login
or not.  This is the real cause of the error -- you probably  
want to
set the LD_LIBRARY_PATH (and PATH, likely, and possibly others,  
such
as that LICENSE key, etc.) regardless of whether it's an  
interactive

or non-interactive login.

When I run "ssh othernode env" from the x86 node, I get the
following vanilla environment:
USER=ha17646
HOME=/home/ha17646
LOGNAME=ha17646
SHELL=/bin/sh
PWD=/home/ha17646
When I run "ssh othernode" from the x86 node, then run "env"  
on the

Cell, I get the following:
USER=ha17646
LD_LIBRARY_PATH=/opt/cell/toolchain/lib/gcc/ppu/4.1.1/32
HOME=/home/ha17646
MCS_LICENSE_PATH=/opt/MultiCorePlus/mcf.key
LOGNAME=ha17646
TERM=xterm-color
PATH=/usr/local/bin:/usr/bin:/sbin:/bin:/tools/openmpi-1.2.5/ 
bin:/

tools/cmake-2.4.7/bin:/tools
SHELL=/bin/sh
PWD=/home/ha17646
TZ=EST5EDT
Hahn
On Oct 7, 2008, at 12:07 PM, Jeff Squyres wrote:

Ralph and I just talked about this a bit:
1. In all released versions of OMPI, we *do* source  
the .profile

file
on the target node if it exists (because vanilla Bourne  
shells do

not
source anything on remote nodes -- Bash does, though, per the  
FAQ).
However, looking in 1.2.7, it looks like it might not be  
executing

that code -- there *may* be a bug in this area.  We're checking
into it.
2. You might want to check your configuration to see if
your .bashrc
is dumping out early because it's a non-interactive shell.   
Check

the
output of:
ssh othernode env
vs.
ssh othernode
env
(i.e., a non-interactive running of "env" vs. an interactive  
login

and
running "env")
On Oct 7, 2008, at 8:53 AM, Ralph Castain wrote:
I am unaware of anything in the code that would  
"source .profile"

for you. I believe the FAQ page is in error here.
Ralph
On Oct 6, 2008, at 7:47 PM, Hahn Kim wrote:
Great, that worked, thanks!  However, it still concerns me  
that

the
FAQ page says that mpirun will execute .profile which doesn't
seem
to work for me.  Are there any configuration issues that  
could

possibly be preventing mpirun from doing this?  It wou

Re: [OMPI users] Problems with OpenMPI running with Rmpi

2008-10-17 Thread Dirk Eddelbuettel

On 17 October 2008 at 12:42, Simone Giannerini wrote:
| Dear all,
| 
| I managed to install successfully Rmpi 0.5-5 on a quad opteron machine (8
| cores overall) running on OpenSUSE 11.0 and Open MPI 1.5.2.
| 
| this is what I get
| 
| > library(Rmpi)
| [gauss:24207] mca: base: component_find: unable to open osc pt2pt: file not
| found (ignored)
| libibverbs: Fatal: couldn't read uverbs ABI version.
| --
| [0,0,0]: OpenIB on host gauss was unable to find any HCAs.
| Another transport will be used instead, although this may result in
| lower performance.
| --

I am surprised that your googling did lead to you stumbling on dozens of
posts on this telling you that the config file

/etc/openmpi/openmpi-mca-params.conf(location for Debian etc)

can be changed to explicitly setting btl to 'no openib' as in

# Disable the use of InfiniBand
#   btl = ^openib
btl = ^openib

which will suppress the warning by suppressing the load of IB.  Better still,
newer Open MPI release do this by default. 

| I have searched the archives and found that the following suggestion was
| given for a similar problem:
| 
| > Open MPI has Infiniband module compiled but there is no IB device found
| > on your host. Try to add "--mca btl ^openib" string to your command
| > line.

That's one way of suppressing it, but not the only one.

| Since I am not calling mpi directly but through Rmpi  I do not know where to
| put that flag, I might contact the Rmpi mantainer, in any case, I would be
| grateful if you had further suggestions.

There is nothing Rmpi can do there so contacting Dr Yu, while generally a
good idea with actual Rmpi issues, is not really advised here.

Cheers, Dirk

-- 
Three out of two people have difficulties with fractions.


Re: [OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Ashley Pittman
On Sat, 2008-10-18 at 00:16 +0900, Raymond Wan wrote:
> 
> Is there a package that I neglected to install?  I did an "aptitude 
> search openmpi" and installed everything listed...  :-)  Or perhaps I 
> haven't removed all trace of mpich?

According to packages.debian.org there isn't a openmpi pacakge which
contains mpirun which as you note isn't expected.  There is a orterun
however which you could use instead.

The Etch version of openmpi is very old, openmpi has made a lot of
progress since 1.1-2.3, I'd recommend building from source if you are
able to.

Ashley.



Re: [OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Dirk Eddelbuettel

On 18 October 2008 at 00:16, Raymond Wan wrote:
| 
| Hi all,
| 
| I'm very new to MPI and am trying to install it on to a Debian Etch 
| system.  I did have mpich installed and I believe that is causing me 

Etch is getting old, and its Open MPI 1.1 package were in suboptimal shape.
A few of us get together as a new Open MPI team within Debian, and 1.2.*
packages are in much better shape. So please try to get 1.2 packages.

| problems.  I completely uninstalled it and then ran:
| 
| update-alternatives --remove-all mpicc
| 
| Then, I installed the following packages:
| 
| libibverbs1 openmpi-bin openmpi-common openmpi-libs0 openmpi-dbg openmpi-dev
| 
| And it now says:
| 
|  >> update-alternatives --display mpicc
| mpicc - status is auto.
|  link currently points to /usr/bin/mpicc.openmpi
| /usr/bin/mpicc.openmpi - priority 40
|  slave mpif90: /usr/bin/mpif90.openmpi
|  slave mpiCC: /usr/bin/mpic++.openmpi
|  slave mpic++: /usr/bin/mpic++.openmpi
|  slave mpif77: /usr/bin/mpif77.openmpi
|  slave mpicxx: /usr/bin/mpic++.openmpi
| Current `best' version is /usr/bin/mpicc.openmpi.
| 
| which seems ok to me...  So, I tried to compile something (I had sample 
| code from a book I purchased a while back, but for mpich), however, I 
| can run the program as-is, but I think I should be running it with 
| mpirun -- the FAQ suggests there is one?  But, there is no mpirun 
| anywhere.  It's not in /usr/bin.  I updated the filename database 
| (updatedb) and tried a "locate mpirun", and I get only one hit:

Well when I use Open MPI I go with the new convention and call orterun
instead of mpirun. I think you should have.  Maybe a local alias in your
~/.bashrc can do the trick.

Current packages do have mpirun.openmpi but we were unable to devise a
bullet-proof scheme between lam, mpich and Open MPI for sharing / updating /
... the alternatives links as there are sublte differences that prevent us
from switching all these aliases consistently.

Hope this helps, Dirk

| 
| /usr/include/openmpi/ompi/runtime/mpiruntime.h
| 
| Is there a package that I neglected to install?  I did an "aptitude 
| search openmpi" and installed everything listed...  :-)  Or perhaps I 
| haven't removed all trace of mpich?
| 
| Thank you in advance!
| 
| Ray
| 
| 
| ___
| users mailing list
| us...@open-mpi.org
| http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Three out of two people have difficulties with fractions.


Re: [OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Terry Frankcombe

> Well when I use Open MPI I go with the new convention and call orterun
> instead of mpirun. I think you should have.  Maybe a local alias in your
> ~/.bashrc can do the trick.
> 
> Current packages do have mpirun.openmpi but we were unable to devise a
> bullet-proof scheme between lam, mpich and Open MPI for sharing / updating /
> ... the alternatives links as there are sublte differences that prevent us
> from switching all these aliases consistently.

Eh?  Surely it's a simple case of conflict?  If you want multiple
packages providing similar functionality, it's up to you to specify how
the user should chose which one they want to run.  Breaking any
particular package (or all packages) seems like a particularly poor
choice, but that's only my opinion.

I would argue that orterun is a very long way from a "new convention".
I'd draw attention to section 8.8 of the MPI 2.1 standard.

But again, this is a discussion for the Debian list.





Re: [OMPI users] OpenMPI 1.2.8 on Solaris: configure problems

2008-10-17 Thread Ethan Mallove
On Fri, Oct/17/2008 05:53:07PM, Paul Kapinos wrote:
> Hi guys,
>
> did you test OpenMPI 1.2.8 on Solaris at all?!

We built 1.2.8 on Solaris successfully a few days ago:

  http://www.open-mpi.org/mtt/index.php?do_redir=869

But due to hardware/software/man-hour resource limitations,
there are often combinations of configure options, mpirun
options, etc. that end up going untested. E.g., I see you're
using some configure options we haven't tried:

 * --enable-ltdl-convenience 
 * --no-create 
 * --no-recursion
 * GCC on Solaris 


> We tried to compile OpenMPI 1.2.8 on Solaris on Sparc and on Opteron for 
> both GCC and SUN Studio compiler, in 32bit and 64bit versions, at all 
> 2*2*2=8 versions, in the very same maneer we have installed 1.2.5 and 1.2.6 
> versions.
>
>
> The configuring processes runs through, but if "gmake all" called, it seems 
> to be so, that the configure stage restarts or being resumed:
>
> ..
> orte/mca/smr/bproc/Makefile.am:47: Libtool library used but `LIBTOOL' is 
> undefined
> orte/mca/smr/bproc/Makefile.am:47:   The usual way to define `LIBTOOL' is 
> to add `AC_PROG_LIBTOOL'
> orte/mca/smr/bproc/Makefile.am:47:   to `configure.ac' and run `aclocal' 
> and `autoconf' again.
> orte/mca/smr/bproc/Makefile.am:47:   If `AC_PROG_LIBTOOL' is in 
> `configure.ac', make sure
> orte/mca/smr/bproc/Makefile.am:47:   its definition is in aclocal's search 
> path.
> test/support/Makefile.am:29: library used but `RANLIB' is undefined
> test/support/Makefile.am:29:   The usual way to define `RANLIB' is to add 
> `AC_PROG_RANLIB'
> test/support/Makefile.am:29:   to `configure.ac' and run `autoconf' again.
>
> . and breaks.

I'm confused why aclocal (or are these automake errors?) is
getting invoked in "gmake all". Did you try running
"aclocal" and "autoconf" in the top-level directory? (You
shouldn't have to do that, but it might resolve this
problem.) Make sure "ranlib" is in your PATH, mine's at
/usr/ccs/bin/ranlib.

(Also, we don't have a sys/bproc.h file on our lab machine,
so the above might be an untested scenario.)

>
> If "gmake all" again we also see error messages like:
>
> *** Fortran 77 compiler
> checking for gfortran... gfortran
> checking whether we are using the GNU Fortran 77 compiler... yes
> checking whether gfortran accepts -g... yes
> checking if Fortran 77 compiler works... yes
> checking gfortran external symbol convention... ./configure: line 26340: 
> ./conftest.o: Permission denied
> ./configure: line 26342: ./conftest.o: Permission denied
> ./configure: line 26344: ./conftest.o: Permission denied
> ./configure: line 26346: ./conftest.o: Permission denied
> ./configure: line 26348: ./conftest.o: Permission denied
> configure: error: Could not determine Fortran naming convention.
>

We didn't test 1.2.8 with GCC/Solaris. Let me see if we can
reproduce this, and get back to you.

>
> Considered the configure script we see on these lines in ./configire:
>
> if $NM conftest.o | grep foo_bar__ >/dev/null 2>&1 ; then
>   ompi_cv_f77_external_symbol="double underscore"
> elif $NM conftest.o | grep foo_bar_ >/dev/null 2>&1 ; then
> ompi_cv_f77_external_symbol="single underscore"
> elif $NM conftest.o | grep FOO_bar >/dev/null 2>&1 ; then
> ompi_cv_f77_external_symbol="mixed case"
> elif $NM conftest.o | grep foo_bar >/dev/null 2>&1 ; then
> ompi_cv_f77_external_symbol="no underscore"
> elif $NM conftest.o | grep FOO_BAR >/dev/null 2>&1 ; then
> ompi_cv_f77_external_symbol="upper case"
> else
> $NM conftest.o >conftest.out 2>&1
>
> and searching through ./configire says us, that $NM is never set 
> (neither in ./configure nor in our environment)
>

Is "nm" in your path? I have this in my config.log file:

  NM='/usr/ccs/bin/nm -p'

Thanks,
Ethan


>
> So, we think that somewhat is not OK with ./configure script. Attend to the 
> fact, that we were able to install 1.2.5 and 1.2.6 some time ago on same 
> boxes without problems.
>
> Or maybe we do somewhat wrong?
>

> best regards,
> Paul Kapinos
> HPC Group RZ RWTH Aachen
>
> P.S. Folks, does somebody compiled OpenMPI 1.2.8 on someone Solaris 
> sucessfully?
>
>
> This file contains any messages produced by compilers while
> running configure, to aid debugging if configure makes a mistake.
> 
> It was created by Open MPI configure 1.2.8, which was
> generated by GNU Autoconf 2.61.  Invocation command line was
> 
>   $ ./configure --with-devel-headers CFLAGS=-O2 -m64 CXXFLAGS=-O2 -m64 
> FFLAGS=-O2 -m64 FCFLAGS=-O2 -m64 LDFLAGS=-O2 -m64 
> --prefix=/rwthfs/rz/SW/MPI/openmpi-1.2.8/solx8664/gcc CC=gcc CXX=g++ 
> FC=gfortran --enable-ltdl-convenience --no-create --no-recursion
> 
> ## - ##
> ## Platform. ##
> ## - ##
> 
> hostname = sunoc63.rz.RWTH-Aachen.DE
> uname -m = i86pc
> uname -r = 5.10
> uname -s = SunOS
> uname -v = Generic_137112-06
> 
> /usr/bin/uname -p = i386
> /bin/uname -X = System = SunOS
> Node = su

Re: [OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Dirk Eddelbuettel

On 18 October 2008 at 03:30, Terry Frankcombe wrote:
| 
| > Well when I use Open MPI I go with the new convention and call orterun
| > instead of mpirun. I think you should have.  Maybe a local alias in your
| > ~/.bashrc can do the trick.
| > 
| > Current packages do have mpirun.openmpi but we were unable to devise a
| > bullet-proof scheme between lam, mpich and Open MPI for sharing / updating /
| > ... the alternatives links as there are sublte differences that prevent us
| > from switching all these aliases consistently.
| 
| Eh?  Surely it's a simple case of conflict?  If you want multiple

It is not simple or else we'd do it. Trust us, several folks tried. 

IIRC one of the issues was that among mpich, lam and Open MPI, the set of
supplied and potential conflicting apps (and their manual pages etc) is not
perfectly overlapping.

| packages providing similar functionality, it's up to you to specify how
| the user should chose which one they want to run.  Breaking any
| particular package (or all packages) seems like a particularly poor
| choice, but that's only my opinion.
| 
| I would argue that orterun is a very long way from a "new convention".
| I'd draw attention to section 8.8 of the MPI 2.1 standard.
| 
| But again, this is a discussion for the Debian list.

In particularly for the 'package Open MPI maintainers' list at

http://lists.alioth.debian.org/mailman/listinfo/pkg-openmpi-maintainers

so if you want to continue this discussion, please take there.  

We can also point you to a couple of discussion in the Debian bug tracking
system, for example

http://bugs.debian.org/452047

where Manuel actually goes through the motions.  If you think you have fixes
for this 'simple case of conflict', as you call, do not hold back and tell
us, but please over on that list.

Thank you,  Dirk

-- 
Three out of two people have difficulties with fractions.


Re: [OMPI users] OpenMPI 1.2.8 on Solaris: configure problems

2008-10-17 Thread George Bosilca


On Oct 17, 2008, at 12:59 PM, Ethan Mallove wrote:


* --enable-ltdl-convenience
* --no-create
* --no-recursion
* GCC on Solaris


A user is not usually supposed to add these options. They are added by  
default when the build system detect that one of the configure files  
(configure.ac or one of the m4 files) have been modified, and that the  
regeneration of configure is required.


I did had in the past such errors. I figure out that they were  
generated due to a mismatch between the original version of autotools  
(used to create the first configure and the cache files) and the one  
used by the build system when it had to rebuild the configure.


If you use NFS you should check that your NTP is doing what is it  
supposed to do, a wrong time-stamp on one of the m4 files might be the  
reason for this.


  george.


We tried to compile OpenMPI 1.2.8 on Solaris on Sparc and on  
Opteron for

both GCC and SUN Studio compiler, in 32bit and 64bit versions, at all
2*2*2=8 versions, in the very same maneer we have installed 1.2.5  
and 1.2.6

versions.


The configuring processes runs through, but if "gmake all" called,  
it seems

to be so, that the configure stage restarts or being resumed:

..
orte/mca/smr/bproc/Makefile.am:47: Libtool library used but  
`LIBTOOL' is

undefined
orte/mca/smr/bproc/Makefile.am:47:   The usual way to define  
`LIBTOOL' is

to add `AC_PROG_LIBTOOL'
orte/mca/smr/bproc/Makefile.am:47:   to `configure.ac' and run  
`aclocal'

and `autoconf' again.
orte/mca/smr/bproc/Makefile.am:47:   If `AC_PROG_LIBTOOL' is in
`configure.ac', make sure
orte/mca/smr/bproc/Makefile.am:47:   its definition is in aclocal's  
search

path.
test/support/Makefile.am:29: library used but `RANLIB' is undefined
test/support/Makefile.am:29:   The usual way to define `RANLIB' is  
to add

`AC_PROG_RANLIB'
test/support/Makefile.am:29:   to `configure.ac' and run `autoconf'  
again.


. and breaks.


I'm confused why aclocal (or are these automake errors?) is
getting invoked in "gmake all". Did you try running
"aclocal" and "autoconf" in the top-level directory? (You
shouldn't have to do that, but it might resolve this
problem.) Make sure "ranlib" is in your PATH, mine's at
/usr/ccs/bin/ranlib.

(Also, we don't have a sys/bproc.h file on our lab machine,
so the above might be an untested scenario.)



If "gmake all" again we also see error messages like:

*** Fortran 77 compiler
checking for gfortran... gfortran
checking whether we are using the GNU Fortran 77 compiler... yes
checking whether gfortran accepts -g... yes
checking if Fortran 77 compiler works... yes
checking gfortran external symbol convention... ./configure: line  
26340:

./conftest.o: Permission denied
./configure: line 26342: ./conftest.o: Permission denied
./configure: line 26344: ./conftest.o: Permission denied
./configure: line 26346: ./conftest.o: Permission denied
./configure: line 26348: ./conftest.o: Permission denied
configure: error: Could not determine Fortran naming convention.



We didn't test 1.2.8 with GCC/Solaris. Let me see if we can
reproduce this, and get back to you.



Considered the configure script we see on these lines in ./configire:

   if $NM conftest.o | grep foo_bar__ >/dev/null 2>&1 ; then
 ompi_cv_f77_external_symbol="double underscore"
   elif $NM conftest.o | grep foo_bar_ >/dev/null 2>&1 ; then
   ompi_cv_f77_external_symbol="single underscore"
   elif $NM conftest.o | grep FOO_bar >/dev/null 2>&1 ; then
   ompi_cv_f77_external_symbol="mixed case"
   elif $NM conftest.o | grep foo_bar >/dev/null 2>&1 ; then
   ompi_cv_f77_external_symbol="no underscore"
   elif $NM conftest.o | grep FOO_BAR >/dev/null 2>&1 ; then
   ompi_cv_f77_external_symbol="upper case"
   else
   $NM conftest.o >conftest.out 2>&1

and searching through ./configire says us, that $NM is never set
(neither in ./configure nor in our environment)



Is "nm" in your path? I have this in my config.log file:

 NM='/usr/ccs/bin/nm -p'

Thanks,
Ethan




So, we think that somewhat is not OK with ./configure script.  
Attend to the
fact, that we were able to install 1.2.5 and 1.2.6 some time ago on  
same

boxes without problems.

Or maybe we do somewhat wrong?




best regards,
Paul Kapinos
HPC Group RZ RWTH Aachen

P.S. Folks, does somebody compiled OpenMPI 1.2.8 on someone Solaris
sucessfully?


This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.

It was created by Open MPI configure 1.2.8, which was
generated by GNU Autoconf 2.61.  Invocation command line was

 $ ./configure --with-devel-headers CFLAGS=-O2 -m64 CXXFLAGS=-O2 - 
m64 FFLAGS=-O2 -m64 FCFLAGS=-O2 -m64 LDFLAGS=-O2 -m64 --prefix=/ 
rwthfs/rz/SW/MPI/openmpi-1.2.8/solx8664/gcc CC=gcc CXX=g++  
FC=gfortran --enable-ltdl-convenience --no-create --no-recursion


## - ##
## Platform. ##
## - ##

hostname = sunoc63.rz.RWTH-Aachen.DE
uname -m = i86p

[OMPI users] MPI_ERR_TRUNCATE

2008-10-17 Thread Nick Collier

Hi,

I'm getting an error I don't quite understand. The code:

MPI_Irecv(recv->data, recv->count, recv->datatype, recv->sender_id,  
recv->agent_type, MPI_COMM_WORLD,

&recv->request);

...

recv = (AgentRequestRecv*) item->data;
MPI_Wait(&recv->request, &status);
receive_complete(process, recv);

And under some conditions, I get the error:

[3] [belafonte.home:04938] *** An error occurred in MPI_Wait
[3] [belafonte.home:04938] *** on communicator MPI_COMM_WORLD
[3] [belafonte.home:04938] *** MPI_ERR_TRUNCATE: message truncated
[3] [belafonte.home:04938] *** MPI_ERRORS_ARE_FATAL (goodbye)

When I do get the error, tracking the send and receive counts shows  
them as equal. And what I don't understand is that the  
receive_complete function in the above executes and the recv Struct  
actually contains the data that was sent. So, I'm confused about the  
error and what its trying to tell me as it looks like everything  
worked OK.


This is on OSX 10.5.5 with OpenMPI 1.2.6.

thanks,

Nick



Re: [OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Raymond Wan


Hi all,


Dirk Eddelbuettel wrote:

On 18 October 2008 at 03:30, Terry Frankcombe wrote:
| 
| But again, this is a discussion for the Debian list.


In particularly for the 'package Open MPI maintainers' list at

http://lists.alioth.debian.org/mailman/listinfo/pkg-openmpi-maintainers

so if you want to continue this discussion, please take there.  
  



Thanks a lot, Dirk; I'll take my Debian problems over to that list 
then!  I didn't realize that this had to be a Debian-specific problem; I 
know so little, I was even open to a response like, "No, there is no 
mpirun anymore".


Of course, if mpirun is just an alias to orterun, then I will just do 
that (use orterun instead).  The system administrator of one of the 
machines I'll use prefers to stick to Debian packages, despite their 
age, so unless I can find a good reason (serious security flaw), I guess 
doing this is far easier (politically) than installing from source.


Thank you all for your help!

Ray








[OMPI users] Bus Error in ompi_free_list_grow

2008-10-17 Thread Allen Barnett
Hi: A customer is running our parallel application on an SGI Altix
machine. They compiled OMPI 1.2.8 themselves. The Altix uses IB
interfaces and they recently upgraded to OFED 1.3 (in SGI Propack 6).
They are receiving a bus error in ompi_free_list_grow:

[r1i0n0:01321] *** Process received signal ***
[r1i0n0:01321] Signal: Bus error (7)
[r1i0n0:01321] Signal code:  (2)
[r1i0n0:01321] Failing at address: 0x2b04ba07c4a0
[r1i0n0:01321] [ 0] /lib64/libpthread.so.0 [0x2b04b00cfc00]
[r1i0n0:01321] [ 1] 
/usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/libmpi.so.0(ompi_free_list_grow+0x14a)
 
[0x2b04af7dc058]
[r1i0n0:01321] [ 2] 
/usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_btl_sm.so(mca_btl_sm_alloc+0x321)
 
[0x2b04b38c8e35]
[r1i0n0:01321] [ 3] 
/usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d)
 
[0x2b04b3378f91]
[r1i0n0:01321] [ 4] 
/usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x546)
 
[0x2b04b3370c7e]
[r1i0n0:01321] [ 5] 
/usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/libmpi.so.0(MPI_Send+0x28)
 
[0x2b04af814098]

Here is some more information about the machine:

SGI Altix ICE 8200 cluster; each node has two quad core Xeons with 16GB
SUSE Linux Enterprise Server 10 Service Pack 2
GNU C Library stable release version 2.4 (20080421)
gcc (GCC) 4.1.2 20070115 (SUSE Linux)
SGI Propack 6 (just upgraded from Propack 5 SP3: changed from 
OFED 1.2 to 1.3)

The output from ompi_info is attached.

I would appreciate any help debugging this.

Thanks,
Allen

-- 
Allen Barnett
E-Mail: al...@transpireinc.com
Skype:  allenbarnett
Ph: 518-887-2930



ompi_info.txt.bz2
Description: application/bzip