Re: [OMPI users] Problems on large clusters

2011-06-27 Thread Thorsten Schuett
As I said, there seems to be a problem starting the app on all nodes. I am 
planning to do some test with orte-ps. I hope that I can get some information 
why the app didn't start.

Thorsten

On Saturday, June 25, 2011, Jeff Squyres wrote:
> Did this issue get resolved?  You might also want to look at our FAQ
> category for large clusters:
> 
> http://www.open-mpi.org/faq/?category=large-clusters
> 
> On Jun 22, 2011, at 9:43 AM, Thorsten Schuett wrote:
> > Thanks for the tip. I can't tell yet whether it helped or not. However,
> > with your settings I get the following warning:
> > WARNING: Open MPI will create a shared memory backing file in a
> > directory that appears to be mounted on a network filesystem.
> > 
> > I repeated the run with my settings and I noticed that on at least one
> > node my app didn't came up. I can see an orted daemon on this node, but
> > no other process. And this was 30 minutes after the app started.
> > 
> > orted -mca ess tm -mca orte_ess_jobid 125894656 -mca orte_ess_vpid 63 -mc
> > a orte_ess_num_procs 255 --hnp-uri ...
> > 
> > Thorsten
> > 
> > On Wednesday, June 22, 2011, Gilbert Grosdidier wrote:
> >> Bonjour Thorsten,
> >> 
> >>  I'm not surprised about the cluster type, indeed,
> >> 
> >> but I do not remember getting such specific hang up you mention.
> >> 
> >>  Anyway, I suspect SGI Altix is a little bit special for OpenMPI,
> >> 
> >> and I usually run with the following setup:
> >> - there is need to create for each job a specific tmp area,
> >> like "/scratch/ggg/uuu/run/tmp/pbs.${PBS_JOBID}"
> >> - then use something like that:
> >> 
> >> setenv TMPDIR "/scratch/ggg/uuu/run/tmp/pbs.${PBS_JOBID}"
> >> setenv OMPI_PREFIX_ENV "/scratch/ggg/uuu/run/tmp/pbs.${PBS_JOBID}"
> >> setenv OMPI_MCA_mpi_leave_pinned_pipeline 1
> >> 
> >> - then, for running, many of these -mca options are probably useless
> >> with your app,
> >> while many of them may show to be useful. Your own way ...
> >> 
> >> mpiexec -mca coll_tuned_use_dynamic_rules 1 -hostfile $PBS_NODEFILE -
> >> mca rmaps seq -mca btl_openib_rdma_pipeline_send_length 65536 -mca
> >> btl_openib_rdma_pipeline_frag_size 65536 -mca
> >> btl_openib_min_rdma_pipeline_size 65536 -mca
> >> btl_self_rdma_pipeline_send_length 262144 -mca
> >> btl_self_rdma_pipeline_frag_size 262144 -mca plm_rsh_num_concurrent
> >> 4096 -mca mpi_paffinity_alone 1 -mca mpi_leave_pinned_pipeline 1 -mca
> >> btl_sm_max_send_size 128 -mca
> >> coll_tuned_pre_allocate_memory_comm_size_limit 1048576 -mca
> >> btl_openib_cq_size 128 -mca btl_ofud_rd_num 128 -mca
> >> mpi_preconnect_mpi 0 -mca mpool_sm_min_size 131072 -mca btl
> >> sm,openib,self -mca btl_openib_want_fork_support 0 -mca
> >> opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1 -mca
> >> osc_rdma_no_locks 1 YOUR_APP
> >> 
> >>  (Watch the step : only one line only ...)
> >>  
> >>  This should be suitable for up to 8k cores.
> >>  
> >>  
> >>  HTH,   Best,G.
> >> 
> >> Le 22 juin 11 à 09:13, Thorsten Schuett a écrit :
> >>> Sure. It's an SGI ICE cluster with dual-rail IB. The HCAs are Mellanox
> >>> ConnectX IB DDR.
> >>> 
> >>> This is a 2040 cores job. I use 255 nodes with one MPI task on each
> >>> node and
> >>> use 8-way OpenMP.
> >>> 
> >>> I don't need -np and -machinefile, because mpiexec picks up this
> >>> information
> >>> from PBS.
> >>> 
> >>> Thorsten
> >>> 
> >>> On Tuesday, June 21, 2011, Gilbert Grosdidier wrote:
>  Bonjour Thorsten,
>  
>  Could you please be a little bit more specific about the cluster
>  
>  itself ?
>  
>  G.
>  
>  Le 21 juin 11 à 17:46, Thorsten Schuett a écrit :
> > Hi,
> > 
> > I am running openmpi 1.5.3 on a IB cluster and I have problems
> > starting jobs
> > on larger node counts. With small numbers of tasks, it usually
> > works. But now
> > the startup failed three times in a row using 255 nodes. I am using
> > 255 nodes
> > with one MPI task per node and the mpiexec looks as follows:
> > 
> > mpiexec --mca btl self,openib --mca mpi_leave_pinned 0 ./a.out
> > 
> > After ten minutes, I pulled a stracktrace on all nodes and killed
> > the job,
> > because there was no progress. In the following, you will find the
> > stack trace
> > generated with gdb thread apply all bt. The backtrace looks
> > basically the same
> > on all nodes. It seems to hang in mpi_init.
> > 
> > Any help is appreciated,
> > 
> > Thorsten
> > 
> > Thread 3 (Thread 46914544122176 (LWP 28979)):
> > #0  0x2b6ee912d9a2 in select () from /lib64/libc.so.6
> > #1  0x2b6eeabd928d in service_thread_start (context= > optimized out>)
> > at btl_openib_fd.c:427
> > #2  0x2b6ee835e143 in start_thread () from /lib64/
> > libpthread.so.0
> > #3  0x2b6ee9133b8d in clone () from /lib64/libc.so.6
> > #4  0x in ?? ()
> > 
> > Thread 2 (Thread 46916594338112 (LWP 28980)):

Re: [OMPI users] Building OpenMPI v. 1.4.3 in VS2008

2011-06-27 Thread Shiqing Fan

Hi Alan,



Thanks Shiqing,

It turns out that I was able to get the build to work in VS2008, by 
stepping back to CMake 2.6 to build it.  Not sure why that did the 
trick but I'm not complaining...




Which version did you use before?

Another question:  Is it possible to run an OpenMPI-based executable 
on a single process, without invoking mpiexec?  Say for example that I 
compiled a simple "Hello world" program such as 
src/examples/hello_cxx.cc that calls MPI_Init(...).  I'd like to start 
it up (as a serial process) by just calling it from the command line, 
without reference to mpiexec.  Can this be done?  If so, could you 
provide an example syntax?  I'm used to doing this with MPICH2, where 
it works with no problem.




Yes, just run your compiled hello_cxx.exe in the command line prompt, 
nothing special.



Regards,
Shiqing


Thanks,

Alan Nichols

AWR - STAAR

11520 N. Port Washington Rd.

Mequon, WI 53092

P: 1.262.240.0291 x 103

F: 1.262.240.0294

E: anich...@awrcorp.com 

http://www.awrcorp.com 

*From:*Shiqing Fan [mailto:f...@hlrs.de]
*Sent:* Tuesday, June 21, 2011 3:10 AM
*To:* Alan Nichols
*Cc:* Open MPI Users
*Subject:* Re: [OMPI users] Building OpenMPI v. 1.4.3 in VS2008


Hi Alan,

I was able to test it again on a machine that has a VS2008 installed. 
But everything worked just fine for me. I looked into the generated 
config file(build_dir/opal/include/opal_config.h), and the CMake build 
system didn't find stdint.h, but it still compiled.


So it was probably some other issues on your platform. It would be 
very helpful for me to figure out the problem, if you can provide more 
information, e.g. configure log, compilation error messages and so on.



Regards,
Shiqing

On 2011-06-10 8:34 PM, Alan Nichols wrote:

Hi Shiquing,

OK I'll give this a try... however, I realized after some Google 
searching in the aftermath of my previous attempt to build on VS2008 
that the file that I'm missing on that platform is shipped with VS2010.


So I suspect that building on VS2010 will go smoothly as you said. My 
problem is that my current effort is part of a much larger project 
that is being built on VS2008. On the one hand I don't want at all to 
shift that larger code base from VS2008 to VS2010 (and fight the 
numerous problems that always follow an upheaval of that sort); on the 
other hand I'm dubious about trying to build my parallel support 
library on VS2010 and the rest of the code  on VS2008.


Is there a way to do what I really want to do, which is build the 
openmpi source on VS2008?


Alan Nichols

AWR - STAAR

11520 N. Port Washington Rd.

Mequon, WI 53092

P: 1.262.240.0291 x 103

F: 1.262.240.0294

E: anich...@awrcorp.com 

http://www.awrcorp.com 

*From:*Shiqing Fan [mailto:f...@hlrs.de]
*Sent:* Thursday, June 09, 2011 6:43 PM
*To:* Open MPI Users
*Cc:* Alan Nichols
*Subject:* Re: [OMPI users] Building OpenMPI v. 1.4.3 in VS2008


Hi Alan,

It looks like a problem of using a wrong generator in CMake GUI. I 
double tested a fresh new downloaded 1.4.3 on my win7 machine with 
VS2010, everything worked well.


Please check:
1.  a proper CMake generator is used.
2.  the CMAKE_BUILD_TYPE in CMake GUI and the build type in VS are 
both Release


If the error still happens, please provide me the file name and  line 
number where triggers the error when compiling it.


Regards,
Shiqing

On 2011-06-07 5:37 PM, Alan Nichols wrote:

Hello,

I'm currently trying to build OpenMPI v. 1.4.3 from source, in 
VS2008.  Platform is Win7, SP1 installed ( I realize that this is 
possibly not an ideal approach as v. 1.5.3 has installers for Windows 
binaries.  However for compatibility with other programs I need to use 
v. 1.4.3 if at all possible;  also as I have many other libraries 
build under VS2008, I need to use the VS2008 compiler if at all possible).


Following the README.WINDOWS file I found, I used CMake to build a 
Windows .sln file.  I accepted the default CMake settings, with the 
exception that I only created a Release build of OpenMPI.  Upon my 
first attempt to build the solution, I got an error about a missing 
file stdint.h.  I was able to fix this by including the stdint.h from 
VS2010.  However I now get new errors referencing


__attribute__((__always_inline__))

__asmvolatile__("": : :"memory")

These look to me like linux-specific problems -- is it even possible 
to do what I'm attempting, or are the code bases and compiler 
fundamentally at odds here?  If it is possible can you explain where 
my error lies?


Thanks for your help,

Alan Nichols

  
  
___

users mailing list
us...@open-mpi.org  
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)

[OMPI users] OpenMPI with NAG compiler and gcc 4.6

2011-06-27 Thread Ning Li

Hello,

I built OpenMPI 1.5.3 using NAG compiler v 5.2 on a new system running 
Fedora 15 (with gcc 4.6). OpenMPI can be built successfully, but when I 
compile a Fortran MPI application I got an error at link stage:


gcc: error: unrecognized option '--export-dynamic'

Note that NAG Fortran compiler generates intermediate C code and 
actually calls gcc to build the application.


GCC 4.6 release note contains the following: "GCC now has stricter 
checks for invalid command-line options. In particular, when |gcc| was 
called to link object files rather than compile source code, it would 
previously accept and ignore all options starting with |--|, including 
linker options such as |--as-needed| and |--export-dynamic|, although 
such options would result in errors if any source code was compiled. 
Such options, if unknown to the compiler, are now rejected in all cases; 
if the intent was to pass them to the linker, options such as 
|-Wl,--as-needed| should be used."


My next step was to track down where the illegal syntax was generated, 
using the '-showme' option provided by the OpenMPI compiler wrapper and 
'-dryrun' option provided by NAG compiler.


[lining@combe pi]$ /home/lining/software/openmpi/1.5.3/nag/bin/mpif90 
--showme pi.f90 -o pi.exe_nag
nagfor pi.f90 -o pi.exe_nag 
-I/home/lining/software/openmpi/1.5.3/nag/include -pthread 
-I/home/lining/software/openmpi/1.5.3/nag/lib 
-L/home/lining/software/openmpi/1.5.3/nag/lib -lmpi_f90 -lmpi_f77 -lmpi 
-lnsl -lutil -lm -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl


[lining@combe pi]$ /home/lining/software/openmpi/1.5.3/nag/bin/mpif90 
-dryrun pi.f90 -o tpi.exe_nag

NAG Fortran Compiler Release 5.2(721)
Option warning: Unrecognised option -pthread passed to loader
/home/lining/software/NAG_Fortran/lib/forcomp -checkversion 5.2 721 
-I/home/lining/software/openmpi/1.5.3/nag/include 
-I/home/lining/software/openmpi/1.5.3/nag/lib -library 
/home/lining/software/NAG_Fortran/lib -o /tmp/pi.02.c pi.f90
/usr/bin/gcc -I/home/lining/software/NAG_Fortran/lib -c -DANSI_C 
-DINT64=long long -funsigned-char -march=i686 -Wno-pointer-sign -o pi.o 
/tmp/pi.02.c
/usr/bin/gcc -o pi.exe_nag 
/home/lining/software/NAG_Fortran/lib/quickfit.o pi.o -pthread 
-L/home/lining/software/openmpi/1.5.3/nag/lib -lmpi_f90 -lmpi_f77 -lmpi 
-lnsl -lutil -lm -ldl -lnsl -lutil -lm -ldl 
-Wl,-rpath,/home/lining/software/NAG_Fortran/lib 
/home/lining/software/NAG_Fortran/lib/libf52.so 
/home/lining/software/NAG_Fortran/lib/libf52.a -lm --export-dynamic


So OpenMPI generates the '-Wl,--export-dynamic' flag. When this is 
passed to NAG compiler, NAG compiler interprets this as "passing the 
'--export-dynamic' flag to the linker (gcc)" (which I believe is the 
correct behaviour). But gcc 4.6 expects to see '-Wl,--export-dynamic'.


My temporary solution as supplied by NAG compiler developers is to edit  
share/openmpi/*-wrapper-data.txt and put flag 
'-Wl,-Wl,,--export-dynamic' there.


Ning

--
Ning Li
Technical Consultant
Numerical Algorithms Group




The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 


Re: [OMPI users] File seeking with shared filepointer issues

2011-06-27 Thread pascal . deveze

Christian,

Suppose you have N processes calling the first MPI_File_get_position_shared
().

Some of them are running faster and could execute the call to
MPI_File_seek_shared() before all the other have got their file position.
(Note that the "collective" primitive is not a synchronization. In that
case, all parameters are broadcast to the process 0 and checked by process
0. All
the other processes are not blocked).

So the slow processes can get the file position  that has just been
modified by the faster.

That is the reason why, in your program, It is necessary to synchronize all
processes just before the call to MPI_File_seek_shared().

Pascal

users-boun...@open-mpi.org a écrit sur 25/06/2011 12:54:32 :

> De : Jeff Squyres 
> A : Open MPI Users 
> Date : 25/06/2011 12:55
> Objet : Re: [OMPI users] File seeking with shared filepointer issues
> Envoyé par : users-boun...@open-mpi.org
>
> I'm not super-familiar with the IO portions of MPI, but I think that
> you might be running afoul of the definition of "collective."
> "Collective," in MPI terms, does *not* mean "synchronize."  It just
> means that all functions must invoke it, potentially with the same
> (or similar) parameters.
>
> Hence, I think you're seeing cases where MPI processes are showing
> correct values, but only because the updates have not completed in
> the background.  Using a barrier is forcing those updates to
> complete before you query for the file position.
>
> ...although, as I type that out, that seems weird.  A barrier should
> not (be guaranteed to) force the completion of collectives (file-
> based or otherwise).  That could be a side-effect of linear message
> passing behind the scenes, but that seems like a weird interface.
>
> Rob -- can you comment on this, perchance?  Is this a bug in ROMIO,
> or if not, how is one supposed to use this interface can get
> consistent answers in all MPI processes?
>
>
> On Jun 23, 2011, at 10:04 AM, Christian Anonymous wrote:
>
> > I'm having some issues with MPI_File_seek_shared. Consider the
> following small test C++ program
> >
> >
> > #include 
> > #include 
> >
> >
> > #define PATH "simdata.bin"
> >
> > using namespace std;
> >
> > int ThisTask;
> >
> > int main(int argc, char *argv[])
> > {
> > MPI_Init(&argc,&argv); /* Initialize MPI */
> > MPI_Comm_rank(MPI_COMM_WORLD,&ThisTask);
> >
> > MPI_File fh;
> > int success;
> > MPI_File_open(MPI_COMM_WORLD,(char *)
> PATH,MPI_MODE_RDONLY,MPI_INFO_NULL,&fh);
> >
> > if(success != MPI_SUCCESS){ //Successfull open?
> > char err[256];
> > int err_length, err_class;
> >
> > MPI_Error_class(success,&err_class);
> > MPI_Error_string(err_class,err,&err_length);
> > cout << "Task " << ThisTask << ": " << err << endl;
> > MPI_Error_string(success,err,&err_length);
> > cout << "Task " << ThisTask << ": " << err << endl;
> >
> > MPI_Abort(MPI_COMM_WORLD,success);
> > }
> >
> >
> > /* START SEEK TEST */
> > MPI_Offset cur_filepos, eof_filepos;
> >
> > MPI_File_get_position_shared(fh,&cur_filepos);
> >
> > //MPI_Barrier(MPI_COMM_WORLD);
> > MPI_File_seek_shared(fh,0,MPI_SEEK_END); /* Seek is collective */
> >
> > MPI_File_get_position_shared(fh,&eof_filepos);
> >
> > //MPI_Barrier(MPI_COMM_WORLD);
> > MPI_File_seek_shared(fh,0,MPI_SEEK_SET);
> >
> > cout << "Task " << ThisTask << " reports a filesize of " <<
> eof_filepos << "-" << cur_filepos << "=" << eof_filepos-cur_filepos <<
endl;
> > /* END SEEK TEST */
> >
> > /* Finalizing */
> > MPI_File_close(&fh);
> > MPI_Finalize();
> > return 0;
> > }
> >
> > Note the comments before each MPI_Barrier. When the program is run
> by mpirun -np N (N strictly greater than 1), task 0 reports the
> correct filesize, while every other process reports either 0, minus
> the filesize or the correct filesize. Uncommenting the MPI_Barrier
> makes each process report the correct filesize. Is this working as
> intended? Since MPI_File_seek_shared is a collective, blocking
> function each process have to synchronise at the return point of the
> function, but not when the function is called. It seems that the use
> of MPI_File_seek_shared without an MPI_Barrier call first is very
> dangerous, or am I missing something?
> >
> > ___
> > Care2 makes it easy for everyone to live a healthy, green
> lifestyle and impact the causes you care about most. Over 12
Millionmembers!
> http://www.care2.com Feed a child by searching the web! Learn how
>
http://www.care2.com/toolbar___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users






[OMPI users] Problems with Mpi Accept - ORTE_ERROR_LOG

2011-06-27 Thread Rodrigo Oliveira
Hi there.

I am developing a server/client application using Open MPI 1.5.3. In a
point of the server code I open a port to receive connections from a
client. After that, I call the function MPI_Comm_accept and on the
client side I call MPI_Comm_connect. Sometimes I get an
ORTE_ERROR_LOG, as showed bellow.

before accept in host hydra9 port name =
4108386304.0;tcp://150.164.3.204:48761;tcp://192.168.63.9:48761+4108386305.0tcp://150.164.3.204:49211;tcp://192.168.63.9:49211:300
[hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file
base/grpcomm_base_allgather.c at line 220
[hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file
base/grpcomm_base_modex.c at line 116
[hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file
grpcomm_bad_module.c at line 608
[hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file
dpm_orte.c at line 379
MPI 2 C++ exception throwing is disabled, MPI::mpi_errno has the error
code
after accept in host hydra9 error code = 17
MPI 2 C++ exception throwing is disabled, MPI::mpi_errno has the error code

The mpi_errno is 17 and I could not find a clear explanation about
this error. It occurs sporadically. Sometimes the application works,
sometimes does not.


Any ideas?

Thanks