Sure. It's an SGI ICE cluster with dual-rail IB. The HCAs are Mellanox ConnectX IB DDR.
This is a 2040 cores job. I use 255 nodes with one MPI task on each node and use 8-way OpenMP. I don't need -np and -machinefile, because mpiexec picks up this information from PBS. Thorsten On Tuesday, June 21, 2011, Gilbert Grosdidier wrote: > Bonjour Thorsten, > > Could you please be a little bit more specific about the cluster > itself ? > > G. > > Le 21 juin 11 à 17:46, Thorsten Schuett a écrit : > > Hi, > > > > I am running openmpi 1.5.3 on a IB cluster and I have problems > > starting jobs > > on larger node counts. With small numbers of tasks, it usually > > works. But now > > the startup failed three times in a row using 255 nodes. I am using > > 255 nodes > > with one MPI task per node and the mpiexec looks as follows: > > > > mpiexec --mca btl self,openib --mca mpi_leave_pinned 0 ./a.out > > > > After ten minutes, I pulled a stracktrace on all nodes and killed > > the job, > > because there was no progress. In the following, you will find the > > stack trace > > generated with gdb thread apply all bt. The backtrace looks > > basically the same > > on all nodes. It seems to hang in mpi_init. > > > > Any help is appreciated, > > > > Thorsten > > > > Thread 3 (Thread 46914544122176 (LWP 28979)): > > #0 0x00002b6ee912d9a2 in select () from /lib64/libc.so.6 > > #1 0x00002b6eeabd928d in service_thread_start (context=<value > > optimized out>) > > at btl_openib_fd.c:427 > > #2 0x00002b6ee835e143 in start_thread () from /lib64/libpthread.so.0 > > #3 0x00002b6ee9133b8d in clone () from /lib64/libc.so.6 > > #4 0x0000000000000000 in ?? () > > > > Thread 2 (Thread 46916594338112 (LWP 28980)): > > #0 0x00002b6ee912b8b6 in poll () from /lib64/libc.so.6 > > #1 0x00002b6eeabd7b8a in btl_openib_async_thread (async=<value > > optimized > > out>) at btl_openib_async.c:419 > > #2 0x00002b6ee835e143 in start_thread () from /lib64/libpthread.so.0 > > #3 0x00002b6ee9133b8d in clone () from /lib64/libc.so.6 > > #4 0x0000000000000000 in ?? () > > > > Thread 1 (Thread 47755361533088 (LWP 28978)): > > #0 0x00002b6ee9133fa8 in epoll_wait () from /lib64/libc.so.6 > > #1 0x00002b6ee87745db in epoll_dispatch (base=0xb79050, arg=0xb558c0, > > tv=<value optimized out>) at epoll.c:215 > > #2 0x00002b6ee8773309 in opal_event_base_loop (base=0xb79050, > > flags=<value > > optimized out>) at event.c:838 > > #3 0x00002b6ee875ee92 in opal_progress () at runtime/ > > opal_progress.c:189 > > #4 0x0000000039f00001 in ?? () > > #5 0x00002b6ee87979c9 in std::ios_base::Init::~Init () at > > ../../.././libstdc++-v3/src/ios_init.cc:123 > > #6 0x00007fffc32c8cc8 in ?? () > > #7 0x00002b6ee9d20955 in orte_grpcomm_bad_get_proc_attr (proc=<value > > optimized out>, attribute_name=0x2b6ee88e5780 " \020322351n+", > > val=0x2b6ee875ee92, size=0x7fffc32c8cd0) at grpcomm_bad_module.c:500 > > #8 0x00002b6ee86dd511 in ompi_modex_recv_key_value (key=<value > > optimized > > out>, source_proc=<value optimized out>, value=0xbb3a00, dtype=14 > > '\016') at > > runtime/ompi_module_exchange.c:125 > > #9 0x00002b6ee86d7ea1 in ompi_proc_set_arch () at proc/proc.c:154 > > #10 0x00002b6ee86db1b0 in ompi_mpi_init (argc=15, argv=0x7fffc32c92f8, > > requested=<value optimized out>, provided=0x7fffc32c917c) at > > runtime/ompi_mpi_init.c:699 > > #11 0x00007fffc32c8e88 in ?? () > > #12 0x00002b6ee77f8348 in ?? () > > #13 0x00007fffc32c8e60 in ?? () > > #14 0x00007fffc32c8e20 in ?? () > > #15 0x0000000009efa994 in ?? () > > #16 0x0000000000000000 in ?? () > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > *---------------------------------------------------------------------* > Gilbert Grosdidier gilbert.grosdid...@in2p3.fr > LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 > Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 > B.P. 34, F-91898 Orsay Cedex (FRANCE) > *---------------------------------------------------------------------*