Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus
At those sizes it is possible you are running into resource exhastion issues. Some of the resource exhaustion code paths still lead to hangs. If the code does not need to be fully connected I would suggest not using mpi_preconnect_mpi but instead track down why the initial MPI_Allreduce hangs. I would suggest the stack trace analysis tool (STAT). I might help you narrow down where the problem is occuring. -Nathan Hjelm HPC-5, LANL On Tue, Oct 21, 2014 at 01:12:21PM +1100, Marshall Ward wrote: > Thanks, it's at least good to know that the behaviour isn't normal! > > Could it be some sort of memory leak in the call? The code in > > ompi/runtime/ompi_mpi_preconnect.c > > looks reasonably safe, though maybe doing thousands of of isend/irecv > pairs is causing problems with the buffer used in ptp messages? > > I'm trying to see if valgrind can see anything, but nothing from > ompi_init_preconnect_mpi is coming up (although there are some other > warnings). > > > On Sun, Oct 19, 2014 at 2:37 AM, Ralph Castain wrote: > > > >> On Oct 17, 2014, at 3:37 AM, Marshall Ward wrote: > >> > >> I currently have a numerical model that, for reasons unknown, requires > >> preconnection to avoid hanging on an initial MPI_Allreduce call. > > > > That is indeed odd - it might take a while for all the connections to form, > > but it shouldn’t hang > > > >> But > >> when we try to scale out beyond around 1000 cores, we are unable to > >> get past MPI_Init's preconnection phase. > >> > >> To test this, I have a basic C program containing only MPI_Init() and > >> MPI_Finalize() named `mpi_init`, which I compile and run using `mpirun > >> -mca mpi_preconnect_mpi 1 mpi_init`. > > > > I doubt preconnect has been tested in a rather long time as I’m unaware of > > anyone still using it (we originally provided it for some legacy code that > > otherwise took a long time to initialize). However, I could give it a try > > and see what happens. FWIW: because it was so targeted and hasn’t been used > > in a long time, the preconnect algo is really not very efficient. Still, > > shouldn’t have anything to do with memory footprint. > > > >> > >> This preconnection seems to consume a large amount of memory, and is > >> exceeding the available memory on our nodes (~2GiB/core) as the number > >> gets into the thousands (~4000 or so). If we try to preconnect to > >> around ~6000, we start to see hangs and crashes. > >> > >> A failed 5600 core preconnection gave this warning (~10k times) while > >> hanging for 30 minutes: > >> > >>[warn] opal_libevent2021_event_base_loop: reentrant invocation. > >> Only one event_base_loop can run on each event_base at once. > >> > >> A failed 6000-core preconnection job crashed almost immediately with > >> the following error. > >> > >>[r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in > >> file ras_tm_module.c at line 159 > >>[r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in > >> file ras_tm_module.c at line 85 > >>[r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in > >> file base/ras_base_allocate.c at line 187 > > > > This doesn’t have anything to do with preconnect - it indicates that mpirun > > was unable to open the Torque allocation file. However, it shouldn’t have > > “crashed”, but instead simply exited with an error message. > > > >> > >> Should we expect to use very large amounts of memory for > >> preconnections of thousands of CPUs? And can these > >> > >> I am using Open MPI 1.8.2 on Linux 2.6.32 (centOS) and FDR infiniband > >> network. This is probably not enough information, but I'll try to > >> provide more if necessary. My knowledge of implementation is > >> unfortunately very limited. > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2014/10/25527.php > > > > ___ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/10/25536.php > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/10/25541.php pgppOqIbY5yjy.pgp Description: PGP signature
[OMPI users] which info is needed for SIGSEGV in Java for openmpi-dev-124-g91e9686 on Solaris
Hi, I installed openmpi-dev-124-g91e9686 on Solaris 10 Sparc with gcc-4.9.1 to track down the error with my small Java program. I started single stepping in orterun.c at line 1081 and continued until I got the segmentation fault. I get "jdata = 0x0" in version openmpi-1.8.2a1r31804, which is the last one which works with Java in my environment, while I get "jdata = 0x100125250" in this version. Unfortunately I don't know which files or variables are important to look at. Perhaps somebody can look at the following lines of code and tell me, which information I should provide to solve the problem. I know that Solaris isn't any longer on your list of supported systems, but perhaps we can get it working again, if you tell me what you need and I do the debugging. /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec GNU gdb (GDB) 7.6.1 ... (gdb) run -np 1 java InitFinalizeMain Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec \ -np 1 java InitFinalizeMain [Thread debugging using libthread_db enabled] [New Thread 1 (LWP 1)] [New LWP2] # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7ea3c7f0, pid=13064, tid=2 ... [LWP2 exited] [New Thread 2] [Switching to Thread 1 (LWP 1)] sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to satisfy query (gdb) thread 1 [Switching to thread 1 (LWP1)] #0 0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1 (gdb) b orterun.c:1081 Breakpoint 1 at 0x170dc: file ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c, line 1081. (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 java InitFinalizeMain [Thread debugging using libthread_db enabled] [New Thread 1 (LWP 1)] [New LWP2] [Switching to Thread 1 (LWP 1)] Breakpoint 1, orterun (argc=5, argv=0x7fffe0d8) at ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1081 1081rc = orte_plm.spawn(jdata); (gdb) print jdata $1 = (orte_job_t *) 0x100125250 (gdb) s rsh_launch (jdata=0x100125250) at ../../../../../openmpi-dev-124-g91e9686/orte/mca/plm/rsh/plm_rsh_module.c:876 876 if (ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_RESTART)) { (gdb) s 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT); (gdb) orte_util_print_name_args (name=0x100118380 ) at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122 122 if (NULL == name) { (gdb) 142 job = orte_util_print_jobids(name->jobid); (gdb) orte_util_print_jobids (job=2502885376) at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:170 170 ptr = get_print_name_buffer(); (gdb) get_print_name_buffer () at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:92 92 if (!fns_init) { (gdb) 101 ret = opal_tsd_getspecific(print_args_tsd_key, (void**)&ptr); (gdb) opal_tsd_getspecific (key=1, valuep=0x7fffd990) at ../../openmpi-dev-124-g91e9686/opal/threads/tsd.h:163 163 *valuep = pthread_getspecific(key); (gdb) 164 return OPAL_SUCCESS; (gdb) 165 } (gdb) get_print_name_buffer () at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:102 102 if (OPAL_SUCCESS != ret) return NULL; (gdb) 104 if (NULL == ptr) { (gdb) 113 return (orte_print_args_buffers_t*) ptr; (gdb) 114 } (gdb) orte_util_print_jobids (job=2502885376) at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:172 172 if (NULL == ptr) { (gdb) 178 if (ORTE_PRINT_NAME_ARG_NUM_BUFS == ptr->cntr) { (gdb) 182 if (ORTE_JOBID_INVALID == job) { (gdb) 184 } else if (ORTE_JOBID_WILDCARD == job) { (gdb) 187 tmp1 = ORTE_JOB_FAMILY((unsigned long)job); (gdb) 188 tmp2 = ORTE_LOCAL_JOBID((unsigned long)job); (gdb) 189 snprintf(ptr->buffers[ptr->cntr++], (gdb) 193 return ptr->buffers[ptr->cntr-1]; (gdb) 194 } (gdb) orte_util_print_name_args (name=0x100118380 ) at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:143 143 vpid = orte_util_print_vpids(name->vpid); (gdb) orte_util_print_vpids (vpid=0) at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:260 260 ptr = get_print_name_buffer(); (gdb) get_print_name_buffer () at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:92 92 if (!fns_init) { (gdb) 101 ret = opal_tsd_getspecific(print_args_tsd_key, (void**)&ptr); (gdb) opal_tsd_getspecific (key=1, valuep=0x7fffd9a0) at ../../openmpi-dev-124-g91e9686/opal/threads/tsd.h:163 163 *valuep = pthread_getspecific(key); (gdb) 164 return OPAL_SUCCESS; (gdb) 165 } (gdb) get_print_name_buffer () at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:102 102 if (OPAL_SUCCESS != ret) return NULL; (gdb) 104 if (NULL == ptr
[OMPI users] New ib locked pages behavior?
I've setup several clusters over the years with OpenMPI. I often get the below error: WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. ... http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: c2-31 Registerable memory: 32768 MiB Total memory:64398 MiB I'm well aware of the normal fixes, and have implemented them in puppet to ensure compute nodes get the changes. To be paranoid I've implemented all the changes, and they all worked under ubuntu 13.10. However with ubuntu 14.04 it seems like it's not working, thus the above message. As recommended by the faq's I've implemented: 1) ulimit -l unlimited in /etc/profile.d/slurm.sh 2) PropagateResourceLimitsExcept=MEMLOCK in slurm.conf 3) UsePAM=1 in slurm.conf 4) in /etc/security/limits.conf * hard memlock unlimited * soft memlock unlimited * hard stack unlimited * soft stack unlimited My changes seem to be working, of I submit this to slurm: #!/bin/bash -l ulimit -l hostname mpirun bash -c ulimit -l mpirun ./relay 1 131072 I get: unlimited c2-31 unlimited unlimited unlimited unlimited Is there some new kernel parameter, ofed parameter, or similar that controls locked pages now? The kernel is 3.13.0-36 and the libopenmpi-dev package is 1.6.5. Since the ulimit -l is getting to both the slurm launched script and also to the mpirun launched binaries I'm pretty puzzled. Any suggestions?
Re: [OMPI users] New ib locked pages behavior?
Hi Bill Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ? http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem I hope this helps, Gus Correa On 10/21/2014 06:36 PM, Bill Broadley wrote: I've setup several clusters over the years with OpenMPI. I often get the below error: WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. ... http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: c2-31 Registerable memory: 32768 MiB Total memory:64398 MiB I'm well aware of the normal fixes, and have implemented them in puppet to ensure compute nodes get the changes. To be paranoid I've implemented all the changes, and they all worked under ubuntu 13.10. However with ubuntu 14.04 it seems like it's not working, thus the above message. As recommended by the faq's I've implemented: 1) ulimit -l unlimited in /etc/profile.d/slurm.sh 2) PropagateResourceLimitsExcept=MEMLOCK in slurm.conf 3) UsePAM=1 in slurm.conf 4) in /etc/security/limits.conf * hard memlock unlimited * soft memlock unlimited * hard stack unlimited * soft stack unlimited My changes seem to be working, of I submit this to slurm: #!/bin/bash -l ulimit -l hostname mpirun bash -c ulimit -l mpirun ./relay 1 131072 I get: unlimited c2-31 unlimited unlimited unlimited unlimited Is there some new kernel parameter, ofed parameter, or similar that controls locked pages now? The kernel is 3.13.0-36 and the libopenmpi-dev package is 1.6.5. Since the ulimit -l is getting to both the slurm launched script and also to the mpirun launched binaries I'm pretty puzzled. Any suggestions? ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/10/25544.php
Re: [OMPI users] New ib locked pages behavior?
On 10/21/2014 04:18 PM, Gus Correa wrote: > Hi Bill > > Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ? > > http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem Ah, that helped. Although: /lib/modules/3.13.0-36-generic/kernel/drivers/net/ethernet/mellanox/mlx4$ modinfo mlx4_core | grep "^parm" Lists some promising looking parameters: parm: log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) (int) The FAQ recommends log_num_mtt or num_mtt and NOT log_mtts_per_seg, sadly: $ modinfo mlx4_core | grep "^parm" | grep mtt parm: log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) (int) $ Looks like the best I can do is bump log_mtts_per_seg. I tried: $ cat /etc/modprobe.d/mlx4_core.conf options mlx4_core log_num_mtt=24 $ But: [6.691959] mlx4_core: unknown parameter 'log_num_mtt' ignored I ended up with: options mlx4_core log_mtts_per_seg=2 I'm hoping that doubles the registerable memory, although I did see a recommendation to raise it to double the system ram (in this case 64GB ram/128GB locakable. Maybe an update to the FAQ is needed?
Re: [OMPI users] New ib locked pages behavior?
Hi Bill I have 2.6.X CentOS stock kernel. I set both parameters. It works. Maybe the parameter names may changed in 3.X kernels? (Which is really bad ...) You could check if there is more information in: /sys/module/mlx4_core/parameters/ There seems to be a thread on the list about this (but apparently no solution): http://www.open-mpi.org/community/lists/users/2013/02/21430.php Maybe Mellanox has more information about this? Gus Correa On 10/21/2014 08:15 PM, Bill Broadley wrote: On 10/21/2014 04:18 PM, Gus Correa wrote: Hi Bill Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ? http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem Ah, that helped. Although: /lib/modules/3.13.0-36-generic/kernel/drivers/net/ethernet/mellanox/mlx4$ modinfo mlx4_core | grep "^parm" Lists some promising looking parameters: parm: log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) (int) The FAQ recommends log_num_mtt or num_mtt and NOT log_mtts_per_seg, sadly: $ modinfo mlx4_core | grep "^parm" | grep mtt parm: log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) (int) $ Looks like the best I can do is bump log_mtts_per_seg. I tried: $ cat /etc/modprobe.d/mlx4_core.conf options mlx4_core log_num_mtt=24 $ But: [6.691959] mlx4_core: unknown parameter 'log_num_mtt' ignored I ended up with: options mlx4_core log_mtts_per_seg=2 I'm hoping that doubles the registerable memory, although I did see a recommendation to raise it to double the system ram (in this case 64GB ram/128GB locakable. Maybe an update to the FAQ is needed?
[OMPI users] low CPU utilization with OpenMPI
Because of permission reason (OpenMPI can not write temporary file to the default /tmp directory), I change the TMPDIR to my local directory (export TMPDIR=/home/user/tmp ) and then the MPI program can run. But the CPU utilization is very low under 20% (8 MPI rank running in Intel Xeon 8-core CPU). And I also got some message when I run with OpenMPI: [cn3:28072] 9 more processes have sent help message help-opal-shmem-mmap.txt / mmap on nfs [cn3:28072] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Any idea? Thanks VIncent
Re: [OMPI users] low CPU utilization with OpenMPI
Doing special files on NFS can be weird, try the other /tmp/ locations: /var/tmp/ /dev/shm (ram disk careful!) Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 > On Oct 21, 2014, at 10:18 PM, Vinson Leung wrote: > > Because of permission reason (OpenMPI can not write temporary file to the > default /tmp directory), I change the TMPDIR to my local directory (export > TMPDIR=/home/user/tmp ) and then the MPI program can run. But the CPU > utilization is very low under 20% (8 MPI rank running in Intel Xeon 8-core > CPU). > > And I also got some message when I run with OpenMPI: > [cn3:28072] 9 more processes have sent help message help-opal-shmem-mmap.txt > / mmap on nfs > [cn3:28072] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help > / error messages > > Any idea? > Thanks > > VIncent > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/10/25548.php
Re: [OMPI users] which info is needed for SIGSEGV in Java for openmpi-dev-124-g91e9686 on Solaris
Hi Siegmar, mpiexec and java run as distinct processes. Your JRE message says java process raises SEGV. So you should trace the java process, not the mpiexec process. And more, your JRE message says the crash happened outside the Java Virtual Machine in native code. So usual Java program debugger is useless. You should trace native code part of the java process. Unfortunately I don't know how to debug such one. The log file output by JRE may help you. > # An error report file with more information is saved as: > # /home/fd1026/work/skripte/master/parallel/prog/mpi/java/hs_err_pid13080.log Regards, Takahiro > Hi, > > I installed openmpi-dev-124-g91e9686 on Solaris 10 Sparc with > gcc-4.9.1 to track down the error with my small Java program. > I started single stepping in orterun.c at line 1081 and > continued until I got the segmentation fault. I get > "jdata = 0x0" in version openmpi-1.8.2a1r31804, which is the > last one which works with Java in my environment, while I get > "jdata = 0x100125250" in this version. Unfortunately I don't > know which files or variables are important to look at. Perhaps > somebody can look at the following lines of code and tell me, > which information I should provide to solve the problem. I know > that Solaris isn't any longer on your list of supported systems, > but perhaps we can get it working again, if you tell me what > you need and I do the debugging. > > /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec > GNU gdb (GDB) 7.6.1 > ... > (gdb) run -np 1 java InitFinalizeMain > Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec \ > -np 1 java InitFinalizeMain > [Thread debugging using libthread_db enabled] > [New Thread 1 (LWP 1)] > [New LWP2] > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7ea3c7f0, pid=13064, tid=2 > ... > [LWP2 exited] > [New Thread 2] > [Switching to Thread 1 (LWP 1)] > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be > found to satisfy query > (gdb) thread 1 > [Switching to thread 1 (LWP1)] > #0 0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1 > (gdb) b orterun.c:1081 > Breakpoint 1 at 0x170dc: file > ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c, line 1081. > (gdb) r > The program being debugged has been started already. > Start it from the beginning? (y or n) y > > Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 java > InitFinalizeMain > [Thread debugging using libthread_db enabled] > [New Thread 1 (LWP 1)] > [New LWP2] > [Switching to Thread 1 (LWP 1)] > > Breakpoint 1, orterun (argc=5, argv=0x7fffe0d8) > at ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1081 > 1081rc = orte_plm.spawn(jdata); > (gdb) print jdata > $1 = (orte_job_t *) 0x100125250 > (gdb) s > rsh_launch (jdata=0x100125250) > at > ../../../../../openmpi-dev-124-g91e9686/orte/mca/plm/rsh/plm_rsh_module.c:876 > 876 if (ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_RESTART)) { > (gdb) s > 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT); > (gdb) > orte_util_print_name_args (name=0x100118380 ) > at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122 > 122 if (NULL == name) { > (gdb) > 142 job = orte_util_print_jobids(name->jobid); > (gdb) > orte_util_print_jobids (job=2502885376) at > ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:170 > 170 ptr = get_print_name_buffer(); > (gdb) > get_print_name_buffer () at > ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:92 > 92 if (!fns_init) { > (gdb) > 101 ret = opal_tsd_getspecific(print_args_tsd_key, (void**)&ptr); > (gdb) > opal_tsd_getspecific (key=1, valuep=0x7fffd990) > at ../../openmpi-dev-124-g91e9686/opal/threads/tsd.h:163 > 163 *valuep = pthread_getspecific(key); > (gdb) > 164 return OPAL_SUCCESS; > (gdb) > 165 } > (gdb) > get_print_name_buffer () at > ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:102 > 102 if (OPAL_SUCCESS != ret) return NULL; > (gdb) > 104 if (NULL == ptr) { > (gdb) > 113 return (orte_print_args_buffers_t*) ptr; > (gdb) > 114 } > (gdb) > orte_util_print_jobids (job=2502885376) at > ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:172 > 172 if (NULL == ptr) { > (gdb) > 178 if (ORTE_PRINT_NAME_ARG_NUM_BUFS == ptr->cntr) { > (gdb) > 182 if (ORTE_JOBID_INVALID == job) { > (gdb) > 184 } else if (ORTE_JOBID_WILDCARD == job) { > (gdb) > 187 tmp1 = ORTE_JOB_FAMILY((unsigned long)job); > (gdb) > 188 tmp2 = ORTE_LOCAL_JOBID((unsigned long)job); > (gdb) > 189 snprintf(ptr->buffers[ptr->cntr++], > (gdb) > 193 return ptr->buffers[ptr->cntr-1]; > (gdb) > 194 } > (gdb) > orte_util_print_name_args (name=0x100118380 ) >