Re: [OMPI users] Problems on large clusters

Jeff Squyres Sat, 25 Jun 2011 06:57:36 -0400

Did this issue get resolved?  You might also want to look at our FAQ category 
for large clusters:


    http://www.open-mpi.org/faq/?category=large-clusters



On Jun 22, 2011, at 9:43 AM, Thorsten Schuett wrote:

> Thanks for the tip. I can't tell yet whether it helped or not. However, with 
> your settings I get the following warning:
> WARNING: Open MPI will create a shared memory backing file in a
> directory that appears to be mounted on a network filesystem.
> 
> I repeated the run with my settings and I noticed that on at least one node 
> my 
> app didn't came up. I can see an orted daemon on this node, but no other 
> process. And this was 30 minutes after the app started.
> 
> orted -mca ess tm -mca orte_ess_jobid 125894656 -mca orte_ess_vpid 63 -mc
> a orte_ess_num_procs 255 --hnp-uri ...
> 
> Thorsten
> 
> On Wednesday, June 22, 2011, Gilbert Grosdidier wrote:
>> Bonjour Thorsten,
>> 
>>  I'm not surprised about the cluster type, indeed,
>> but I do not remember getting such specific hang up you mention.
>> 
>>  Anyway, I suspect SGI Altix is a little bit special for OpenMPI,
>> and I usually run with the following setup:
>> - there is need to create for each job a specific tmp area,
>> like "/scratch/ggg/uuu/run/tmp/pbs.${PBS_JOBID}"
>> - then use something like that:
>> 
>> setenv TMPDIR "/scratch/ggg/uuu/run/tmp/pbs.${PBS_JOBID}"
>> setenv OMPI_PREFIX_ENV "/scratch/ggg/uuu/run/tmp/pbs.${PBS_JOBID}"
>> setenv OMPI_MCA_mpi_leave_pinned_pipeline 1
>> 
>> - then, for running, many of these -mca options are probably useless
>> with your app,
>> while many of them may show to be useful. Your own way ...
>> 
>> mpiexec -mca coll_tuned_use_dynamic_rules 1 -hostfile $PBS_NODEFILE -
>> mca rmaps seq -mca btl_openib_rdma_pipeline_send_length 65536 -mca
>> btl_openib_rdma_pipeline_frag_size 65536 -mca
>> btl_openib_min_rdma_pipeline_size 65536 -mca
>> btl_self_rdma_pipeline_send_length 262144 -mca
>> btl_self_rdma_pipeline_frag_size 262144 -mca plm_rsh_num_concurrent
>> 4096 -mca mpi_paffinity_alone 1 -mca mpi_leave_pinned_pipeline 1 -mca
>> btl_sm_max_send_size 128 -mca
>> coll_tuned_pre_allocate_memory_comm_size_limit 1048576 -mca
>> btl_openib_cq_size 128 -mca btl_ofud_rd_num 128 -mca
>> mpi_preconnect_mpi 0 -mca mpool_sm_min_size 131072 -mca btl
>> sm,openib,self -mca btl_openib_want_fork_support 0 -mca
>> opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1 -mca
>> osc_rdma_no_locks 1 YOUR_APP
>> 
>>  (Watch the step : only one line only ...)
>> 
>>  This should be suitable for up to 8k cores.
>> 
>> 
>>  HTH,   Best,    G.
>> 
>> Le 22 juin 11 à 09:13, Thorsten Schuett a écrit :
>>> Sure. It's an SGI ICE cluster with dual-rail IB. The HCAs are Mellanox
>>> ConnectX IB DDR.
>>> 
>>> This is a 2040 cores job. I use 255 nodes with one MPI task on each
>>> node and
>>> use 8-way OpenMP.
>>> 
>>> I don't need -np and -machinefile, because mpiexec picks up this
>>> information
>>> from PBS.
>>> 
>>> Thorsten
>>> 
>>> On Tuesday, June 21, 2011, Gilbert Grosdidier wrote:
>>>> Bonjour Thorsten,
>>>> 
>>>> Could you please be a little bit more specific about the cluster
>>>> 
>>>> itself ?
>>>> 
>>>> G.
>>>> 
>>>> Le 21 juin 11 à 17:46, Thorsten Schuett a écrit :
>>>>> Hi,
>>>>> 
>>>>> I am running openmpi 1.5.3 on a IB cluster and I have problems
>>>>> starting jobs
>>>>> on larger node counts. With small numbers of tasks, it usually
>>>>> works. But now
>>>>> the startup failed three times in a row using 255 nodes. I am using
>>>>> 255 nodes
>>>>> with one MPI task per node and the mpiexec looks as follows:
>>>>> 
>>>>> mpiexec --mca btl self,openib --mca mpi_leave_pinned 0 ./a.out
>>>>> 
>>>>> After ten minutes, I pulled a stracktrace on all nodes and killed
>>>>> the job,
>>>>> because there was no progress. In the following, you will find the
>>>>> stack trace
>>>>> generated with gdb thread apply all bt. The backtrace looks
>>>>> basically the same
>>>>> on all nodes. It seems to hang in mpi_init.
>>>>> 
>>>>> Any help is appreciated,
>>>>> 
>>>>> Thorsten
>>>>> 
>>>>> Thread 3 (Thread 46914544122176 (LWP 28979)):
>>>>> #0  0x00002b6ee912d9a2 in select () from /lib64/libc.so.6
>>>>> #1  0x00002b6eeabd928d in service_thread_start (context=<value
>>>>> optimized out>)
>>>>> at btl_openib_fd.c:427
>>>>> #2  0x00002b6ee835e143 in start_thread () from /lib64/
>>>>> libpthread.so.0
>>>>> #3  0x00002b6ee9133b8d in clone () from /lib64/libc.so.6
>>>>> #4  0x0000000000000000 in ?? ()
>>>>> 
>>>>> Thread 2 (Thread 46916594338112 (LWP 28980)):
>>>>> #0  0x00002b6ee912b8b6 in poll () from /lib64/libc.so.6
>>>>> #1  0x00002b6eeabd7b8a in btl_openib_async_thread (async=<value
>>>>> optimized
>>>>> out>) at btl_openib_async.c:419
>>>>> #2  0x00002b6ee835e143 in start_thread () from /lib64/
>>>>> libpthread.so.0
>>>>> #3  0x00002b6ee9133b8d in clone () from /lib64/libc.so.6
>>>>> #4  0x0000000000000000 in ?? ()
>>>>> 
>>>>> Thread 1 (Thread 47755361533088 (LWP 28978)):
>>>>> #0  0x00002b6ee9133fa8 in epoll_wait () from /lib64/libc.so.6
>>>>> #1  0x00002b6ee87745db in epoll_dispatch (base=0xb79050,
>>>>> arg=0xb558c0,
>>>>> tv=<value optimized out>) at epoll.c:215
>>>>> #2  0x00002b6ee8773309 in opal_event_base_loop (base=0xb79050,
>>>>> flags=<value
>>>>> optimized out>) at event.c:838
>>>>> #3  0x00002b6ee875ee92 in opal_progress () at runtime/
>>>>> opal_progress.c:189
>>>>> #4  0x0000000039f00001 in ?? ()
>>>>> #5  0x00002b6ee87979c9 in std::ios_base::Init::~Init () at
>>>>> ../../.././libstdc++-v3/src/ios_init.cc:123
>>>>> #6  0x00007fffc32c8cc8 in ?? ()
>>>>> #7  0x00002b6ee9d20955 in orte_grpcomm_bad_get_proc_attr
>>>>> (proc=<value
>>>>> optimized out>, attribute_name=0x2b6ee88e5780 " \020322351n+",
>>>>> val=0x2b6ee875ee92, size=0x7fffc32c8cd0) at grpcomm_bad_module.c:500
>>>>> #8  0x00002b6ee86dd511 in ompi_modex_recv_key_value (key=<value
>>>>> optimized
>>>>> out>, source_proc=<value optimized out>, value=0xbb3a00, dtype=14
>>>>> '\016') at
>>>>> runtime/ompi_module_exchange.c:125
>>>>> #9  0x00002b6ee86d7ea1 in ompi_proc_set_arch () at proc/proc.c:154
>>>>> #10 0x00002b6ee86db1b0 in ompi_mpi_init (argc=15,
>>>>> argv=0x7fffc32c92f8,
>>>>> requested=<value optimized out>, provided=0x7fffc32c917c) at
>>>>> runtime/ompi_mpi_init.c:699
>>>>> #11 0x00007fffc32c8e88 in ?? ()
>>>>> #12 0x00002b6ee77f8348 in ?? ()
>>>>> #13 0x00007fffc32c8e60 in ?? ()
>>>>> #14 0x00007fffc32c8e20 in ?? ()
>>>>> #15 0x0000000009efa994 in ?? ()
>>>>> #16 0x0000000000000000 in ?? ()
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> --
>>>> *---------------------------------------------------------------------*
>>>> 
>>>>  Gilbert Grosdidier                 gilbert.grosdid...@in2p3.fr
>>>>  LAL / IN2P3 / CNRS                 Phone : +33 1 6446 8909
>>>>  Faculté des Sciences, Bat. 200     Fax   : +33 1 6446 8546
>>>>  B.P. 34, F-91898 Orsay Cedex (FRANCE)
>>>> 
>>>> *---------------------------------------------------------------------*
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> --
>> *---------------------------------------------------------------------*
>>   Gilbert Grosdidier                 gilbert.grosdid...@in2p3.fr
>>   LAL / IN2P3 / CNRS                 Phone : +33 1 6446 8909
>>   Faculté des Sciences, Bat. 200     Fax   : +33 1 6446 8546
>>   B.P. 34, F-91898 Orsay Cedex (FRANCE)
>> *---------------------------------------------------------------------*
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Problems on large clusters

Reply via email to