date:20141021

Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

2014-10-21 Thread Nathan Hjelm


At those sizes it is possible you are running into resource
exhastion issues. Some of the resource exhaustion code paths still lead
to hangs. If the code does not need to be fully connected I would
suggest not using mpi_preconnect_mpi but instead track down why the
initial MPI_Allreduce hangs. I would suggest the stack trace analysis
tool (STAT). I might help you narrow down where the problem is
occuring.

-Nathan Hjelm
HPC-5, LANL

On Tue, Oct 21, 2014 at 01:12:21PM +1100, Marshall Ward wrote:
> Thanks, it's at least good to know that the behaviour isn't normal!
> 
> Could it be some sort of memory leak in the call? The code in
> 
> ompi/runtime/ompi_mpi_preconnect.c
> 
> looks reasonably safe, though maybe doing thousands of of isend/irecv
> pairs is causing problems with the buffer used in ptp messages?
> 
> I'm trying to see if valgrind can see anything, but nothing from
> ompi_init_preconnect_mpi is coming up (although there are some other
> warnings).
> 
> 
> On Sun, Oct 19, 2014 at 2:37 AM, Ralph Castain  wrote:
> >
> >> On Oct 17, 2014, at 3:37 AM, Marshall Ward  wrote:
> >>
> >> I currently have a numerical model that, for reasons unknown, requires
> >> preconnection to avoid hanging on an initial MPI_Allreduce call.
> >
> > That is indeed odd - it might take a while for all the connections to form, 
> > but it shouldn’t hang
> >
> >> But
> >> when we try to scale out beyond around 1000 cores, we are unable to
> >> get past MPI_Init's preconnection phase.
> >>
> >> To test this, I have a basic C program containing only MPI_Init() and
> >> MPI_Finalize() named `mpi_init`, which I compile and run using `mpirun
> >> -mca mpi_preconnect_mpi 1 mpi_init`.
> >
> > I doubt preconnect has been tested in a rather long time as I’m unaware of 
> > anyone still using it (we originally provided it for some legacy code that 
> > otherwise took a long time to initialize). However, I could give it a try 
> > and see what happens. FWIW: because it was so targeted and hasn’t been used 
> > in a long time, the preconnect algo is really not very efficient. Still, 
> > shouldn’t have anything to do with memory footprint.
> >
> >>
> >> This preconnection seems to consume a large amount of memory, and is
> >> exceeding the available memory on our nodes (~2GiB/core) as the number
> >> gets into the thousands (~4000 or so). If we try to preconnect to
> >> around ~6000, we start to see hangs and crashes.
> >>
> >> A failed 5600 core preconnection gave this warning (~10k times) while
> >> hanging for 30 minutes:
> >>
> >>[warn] opal_libevent2021_event_base_loop: reentrant invocation.
> >> Only one event_base_loop can run on each event_base at once.
> >>
> >> A failed 6000-core preconnection job crashed almost immediately with
> >> the following error.
> >>
> >>[r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
> >> file ras_tm_module.c at line 159
> >>[r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
> >> file ras_tm_module.c at line 85
> >>[r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
> >> file base/ras_base_allocate.c at line 187
> >
> > This doesn’t have anything to do with preconnect - it indicates that mpirun 
> > was unable to open the Torque allocation file. However, it shouldn’t have 
> > “crashed”, but instead simply exited with an error message.
> >
> >>
> >> Should we expect to use very large amounts of memory for
> >> preconnections of thousands of CPUs? And can these
> >>
> >> I am using Open MPI 1.8.2 on Linux 2.6.32 (centOS) and FDR infiniband
> >> network. This is probably not enough information, but I'll try to
> >> provide more if necessary. My knowledge of implementation is
> >> unfortunately very limited.
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/users/2014/10/25527.php
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2014/10/25536.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/10/25541.php


pgppOqIbY5yjy.pgp
Description: PGP signature

[OMPI users] which info is needed for SIGSEGV in Java for openmpi-dev-124-g91e9686 on Solaris

2014-10-21 Thread Siegmar Gross

Hi,

I installed openmpi-dev-124-g91e9686 on Solaris 10 Sparc with
gcc-4.9.1 to track down the error with my small Java program.
I started single stepping in orterun.c at line 1081 and
continued until I got the segmentation fault. I get
"jdata = 0x0" in version openmpi-1.8.2a1r31804, which is the
last one which works with Java in my environment, while I get
"jdata = 0x100125250" in this version. Unfortunately I don't
know which files or variables are important to look at. Perhaps
somebody can look at the following lines of code and tell me,
which information I should provide to solve the problem. I know
that Solaris isn't any longer on your list of supported systems,
but perhaps we can get it working again, if you tell me what
you need and I do the debugging.

/usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
GNU gdb (GDB) 7.6.1
...
(gdb) run -np 1 java InitFinalizeMain 
Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec \
  -np 1 java InitFinalizeMain
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP2]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7ea3c7f0, pid=13064, tid=2
...
[LWP2 exited]
[New Thread 2]
[Switching to Thread 1 (LWP 1)]
sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be
  found to satisfy query
(gdb) thread 1
[Switching to thread 1 (LWP1)]
#0  0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
(gdb) b orterun.c:1081
Breakpoint 1 at 0x170dc: file 
../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c, line 1081.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 java 
InitFinalizeMain
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP2]
[Switching to Thread 1 (LWP 1)]

Breakpoint 1, orterun (argc=5, argv=0x7fffe0d8)
at ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1081
1081rc = orte_plm.spawn(jdata);
(gdb) print jdata
$1 = (orte_job_t *) 0x100125250
(gdb) s
rsh_launch (jdata=0x100125250)
at 
../../../../../openmpi-dev-124-g91e9686/orte/mca/plm/rsh/plm_rsh_module.c:876
876 if (ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_RESTART)) {
(gdb) s
881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT);
(gdb) 
orte_util_print_name_args (name=0x100118380 )
at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122
122 if (NULL == name) {
(gdb) 
142 job = orte_util_print_jobids(name->jobid);
(gdb) 
orte_util_print_jobids (job=2502885376) at 
../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:170
170 ptr = get_print_name_buffer();
(gdb) 
get_print_name_buffer () at 
../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:92
92  if (!fns_init) {
(gdb) 
101 ret = opal_tsd_getspecific(print_args_tsd_key, (void**)&ptr);
(gdb) 
opal_tsd_getspecific (key=1, valuep=0x7fffd990)
at ../../openmpi-dev-124-g91e9686/opal/threads/tsd.h:163
163 *valuep = pthread_getspecific(key);
(gdb) 
164 return OPAL_SUCCESS;
(gdb) 
165 }
(gdb) 
get_print_name_buffer () at 
../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:102
102 if (OPAL_SUCCESS != ret) return NULL;
(gdb) 
104 if (NULL == ptr) {
(gdb) 
113 return (orte_print_args_buffers_t*) ptr;
(gdb) 
114 }
(gdb) 
orte_util_print_jobids (job=2502885376) at 
../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:172
172 if (NULL == ptr) {
(gdb) 
178 if (ORTE_PRINT_NAME_ARG_NUM_BUFS == ptr->cntr) {
(gdb) 
182 if (ORTE_JOBID_INVALID == job) {
(gdb) 
184 } else if (ORTE_JOBID_WILDCARD == job) {
(gdb) 
187 tmp1 = ORTE_JOB_FAMILY((unsigned long)job);
(gdb) 
188 tmp2 = ORTE_LOCAL_JOBID((unsigned long)job);
(gdb) 
189 snprintf(ptr->buffers[ptr->cntr++], 
(gdb) 
193 return ptr->buffers[ptr->cntr-1];
(gdb) 
194 }
(gdb) 
orte_util_print_name_args (name=0x100118380 )
at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:143
143 vpid = orte_util_print_vpids(name->vpid);
(gdb) 
orte_util_print_vpids (vpid=0) at 
../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:260
260 ptr = get_print_name_buffer();
(gdb) 
get_print_name_buffer () at 
../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:92
92  if (!fns_init) {
(gdb) 
101 ret = opal_tsd_getspecific(print_args_tsd_key, (void**)&ptr);
(gdb) 
opal_tsd_getspecific (key=1, valuep=0x7fffd9a0)
at ../../openmpi-dev-124-g91e9686/opal/threads/tsd.h:163
163 *valuep = pthread_getspecific(key);
(gdb) 
164 return OPAL_SUCCESS;
(gdb) 
165 }
(gdb) 
get_print_name_buffer () at 
../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:102
102 if (OPAL_SUCCESS != ret) return NULL;
(gdb) 
104 if (NULL == ptr

[OMPI users] New ib locked pages behavior?

2014-10-21 Thread Bill Broadley


I've setup several clusters over the years with OpenMPI.  I often get the below
error:

   WARNING: It appears that your OpenFabrics subsystem is configured to only
   allow registering part of your physical memory.  This can cause MPI jobs to
   run with erratic performance, hang, and/or crash.
   ...
   http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

 Local host:  c2-31
 Registerable memory: 32768 MiB
 Total memory:64398 MiB

I'm well aware of the normal fixes, and have implemented them in puppet to
ensure compute nodes get the changes.  To be paranoid I've implemented all the
changes, and they all worked under ubuntu 13.10.

However with ubuntu 14.04 it seems like it's not working, thus the above 
message.

As recommended by the faq's I've implemented:
1) ulimit -l unlimited in /etc/profile.d/slurm.sh
2) PropagateResourceLimitsExcept=MEMLOCK in slurm.conf
3) UsePAM=1 in slurm.conf
4) in /etc/security/limits.conf
   * hard memlock unlimited
   * soft memlock unlimited
   * hard stack unlimited
   * soft stack unlimited

My changes seem to be working, of I submit this to slurm:
#!/bin/bash -l
ulimit -l
hostname
mpirun bash -c ulimit -l
mpirun ./relay 1 131072

I get:
   unlimited
   c2-31
   unlimited
   unlimited
   unlimited
   unlimited
   
   

Is there some new kernel parameter, ofed parameter, or similar that controls
locked pages now?  The kernel is 3.13.0-36 and the libopenmpi-dev package is 
1.6.5.

Since the ulimit -l is getting to both the slurm launched script and also to the
mpirun launched binaries I'm pretty puzzled.

Any suggestions?

Re: [OMPI users] New ib locked pages behavior?

2014-10-21 Thread Gus Correa


Hi Bill

Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ?

http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

I hope this helps,
Gus Correa

On 10/21/2014 06:36 PM, Bill Broadley wrote:


I've setup several clusters over the years with OpenMPI.  I often get the below
error:

WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
...
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:  c2-31
  Registerable memory: 32768 MiB
  Total memory:64398 MiB

I'm well aware of the normal fixes, and have implemented them in puppet to
ensure compute nodes get the changes.  To be paranoid I've implemented all the
changes, and they all worked under ubuntu 13.10.

However with ubuntu 14.04 it seems like it's not working, thus the above 
message.

As recommended by the faq's I've implemented:
1) ulimit -l unlimited in /etc/profile.d/slurm.sh
2) PropagateResourceLimitsExcept=MEMLOCK in slurm.conf
3) UsePAM=1 in slurm.conf
4) in /etc/security/limits.conf
* hard memlock unlimited
* soft memlock unlimited
* hard stack unlimited
* soft stack unlimited

My changes seem to be working, of I submit this to slurm:
#!/bin/bash -l
ulimit -l
hostname
mpirun bash -c ulimit -l
mpirun ./relay 1 131072

I get:
unlimited
c2-31
unlimited
unlimited
unlimited
unlimited



Is there some new kernel parameter, ofed parameter, or similar that controls
locked pages now?  The kernel is 3.13.0-36 and the libopenmpi-dev package is 
1.6.5.

Since the ulimit -l is getting to both the slurm launched script and also to the
mpirun launched binaries I'm pretty puzzled.

Any suggestions?
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/10/25544.php

Re: [OMPI users] New ib locked pages behavior?

2014-10-21 Thread Bill Broadley

On 10/21/2014 04:18 PM, Gus Correa wrote:
> Hi Bill
> 
> Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ?
> 
> http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

Ah, that helped.  Although:
/lib/modules/3.13.0-36-generic/kernel/drivers/net/ethernet/mellanox/mlx4$
modinfo mlx4_core | grep "^parm"

Lists some promising looking parameters:
parm:   log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) 
(int)

The FAQ recommends log_num_mtt or num_mtt and NOT log_mtts_per_seg, sadly:
$ modinfo mlx4_core | grep "^parm" | grep mtt
parm:   log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) 
(int)
$

Looks like the best I can do is bump log_mtts_per_seg.

I tried:
$ cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core log_num_mtt=24
$

But:
[6.691959] mlx4_core: unknown parameter 'log_num_mtt' ignored

I ended up with:
options mlx4_core log_mtts_per_seg=2

I'm hoping that doubles the registerable memory, although I did see a
recommendation to raise it to double the system ram (in this case 64GB ram/128GB
locakable.

Maybe an update to the FAQ is needed?

Re: [OMPI users] New ib locked pages behavior?

2014-10-21 Thread Gus Correa


Hi Bill

I have 2.6.X CentOS stock kernel.
I set both parameters.
It works.

Maybe the parameter names may changed in 3.X kernels?
(Which is really bad ...)
You could check if there is more information in:
/sys/module/mlx4_core/parameters/

There seems to be a thread on the list about this (but apparently
no solution):
http://www.open-mpi.org/community/lists/users/2013/02/21430.php

Maybe Mellanox has more information about this?

Gus Correa

On 10/21/2014 08:15 PM, Bill Broadley wrote:

On 10/21/2014 04:18 PM, Gus Correa wrote:

Hi Bill

Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ?

http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem


Ah, that helped.  Although:
/lib/modules/3.13.0-36-generic/kernel/drivers/net/ethernet/mellanox/mlx4$
modinfo mlx4_core | grep "^parm"

Lists some promising looking parameters:
parm:   log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) 
(int)

The FAQ recommends log_num_mtt or num_mtt and NOT log_mtts_per_seg, sadly:
$ modinfo mlx4_core | grep "^parm" | grep mtt
parm:   log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) 
(int)
$

Looks like the best I can do is bump log_mtts_per_seg.

I tried:
$ cat /etc/modprobe.d/mlx4_core.conf
options mlx4_core log_num_mtt=24
$

But:
[6.691959] mlx4_core: unknown parameter 'log_num_mtt' ignored

I ended up with:
options mlx4_core log_mtts_per_seg=2

I'm hoping that doubles the registerable memory, although I did see a
recommendation to raise it to double the system ram (in this case 64GB ram/128GB
locakable.

Maybe an update to the FAQ is needed?

[OMPI users] low CPU utilization with OpenMPI

2014-10-21 Thread Vinson Leung

Because of permission reason (OpenMPI can not write temporary file to the
default /tmp directory), I change the TMPDIR to my local directory (export
TMPDIR=/home/user/tmp ) and then the MPI program can run. But the CPU
utilization is very low under 20% (8 MPI rank running in Intel Xeon 8-core
CPU).

And I also got some message when I run with OpenMPI:
[cn3:28072] 9 more processes have sent help message
help-opal-shmem-mmap.txt / mmap on nfs
[cn3:28072] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages

Any idea?
Thanks

VIncent

Re: [OMPI users] low CPU utilization with OpenMPI

2014-10-21 Thread Brock Palen

Doing special files on NFS can be weird,  try the other /tmp/ locations:

/var/tmp/
/dev/shm  (ram disk careful!)

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



> On Oct 21, 2014, at 10:18 PM, Vinson Leung  wrote:
> 
> Because of permission reason (OpenMPI can not write temporary file to the 
> default /tmp directory), I change the TMPDIR to my local directory (export 
> TMPDIR=/home/user/tmp ) and then the MPI program can run. But the CPU 
> utilization is very low under 20% (8 MPI rank running in Intel Xeon 8-core 
> CPU). 
> 
> And I also got some message when I run with OpenMPI:
> [cn3:28072] 9 more processes have sent help message help-opal-shmem-mmap.txt 
> / mmap on nfs
> [cn3:28072] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 
> / error messages
> 
> Any idea?
> Thanks
> 
> VIncent
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/10/25548.php

Re: [OMPI users] which info is needed for SIGSEGV in Java for openmpi-dev-124-g91e9686 on Solaris

2014-10-21 Thread Kawashima, Takahiro

Hi Siegmar,

mpiexec and java run as distinct processes. Your JRE message
says java process raises SEGV. So you should trace the java
process, not the mpiexec process. And more, your JRE message
says the crash happened outside the Java Virtual Machine in
native code. So usual Java program debugger is useless.
You should trace native code part of the java process.
Unfortunately I don't know how to debug such one.

The log file output by JRE may help you.
> # An error report file with more information is saved as:
> # /home/fd1026/work/skripte/master/parallel/prog/mpi/java/hs_err_pid13080.log

Regards,
Takahiro

> Hi,
> 
> I installed openmpi-dev-124-g91e9686 on Solaris 10 Sparc with
> gcc-4.9.1 to track down the error with my small Java program.
> I started single stepping in orterun.c at line 1081 and
> continued until I got the segmentation fault. I get
> "jdata = 0x0" in version openmpi-1.8.2a1r31804, which is the
> last one which works with Java in my environment, while I get
> "jdata = 0x100125250" in this version. Unfortunately I don't
> know which files or variables are important to look at. Perhaps
> somebody can look at the following lines of code and tell me,
> which information I should provide to solve the problem. I know
> that Solaris isn't any longer on your list of supported systems,
> but perhaps we can get it working again, if you tell me what
> you need and I do the debugging.
> 
> /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
> GNU gdb (GDB) 7.6.1
> ...
> (gdb) run -np 1 java InitFinalizeMain 
> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec \
>   -np 1 java InitFinalizeMain
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP2]
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ea3c7f0, pid=13064, tid=2
> ...
> [LWP2 exited]
> [New Thread 2]
> [Switching to Thread 1 (LWP 1)]
> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be
>   found to satisfy query
> (gdb) thread 1
> [Switching to thread 1 (LWP1)]
> #0  0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
> (gdb) b orterun.c:1081
> Breakpoint 1 at 0x170dc: file 
> ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c, line 1081.
> (gdb) r
> The program being debugged has been started already.
> Start it from the beginning? (y or n) y
> 
> Starting program: /usr/local/openmpi-1.9.0_64_gcc/bin/mpiexec -np 1 java 
> InitFinalizeMain
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP2]
> [Switching to Thread 1 (LWP 1)]
> 
> Breakpoint 1, orterun (argc=5, argv=0x7fffe0d8)
> at ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1081
> 1081rc = orte_plm.spawn(jdata);
> (gdb) print jdata
> $1 = (orte_job_t *) 0x100125250
> (gdb) s
> rsh_launch (jdata=0x100125250)
> at 
> ../../../../../openmpi-dev-124-g91e9686/orte/mca/plm/rsh/plm_rsh_module.c:876
> 876 if (ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_RESTART)) {
> (gdb) s
> 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT);
> (gdb) 
> orte_util_print_name_args (name=0x100118380 )
> at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122
> 122 if (NULL == name) {
> (gdb) 
> 142 job = orte_util_print_jobids(name->jobid);
> (gdb) 
> orte_util_print_jobids (job=2502885376) at 
> ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:170
> 170 ptr = get_print_name_buffer();
> (gdb) 
> get_print_name_buffer () at 
> ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:92
> 92  if (!fns_init) {
> (gdb) 
> 101 ret = opal_tsd_getspecific(print_args_tsd_key, (void**)&ptr);
> (gdb) 
> opal_tsd_getspecific (key=1, valuep=0x7fffd990)
> at ../../openmpi-dev-124-g91e9686/opal/threads/tsd.h:163
> 163 *valuep = pthread_getspecific(key);
> (gdb) 
> 164 return OPAL_SUCCESS;
> (gdb) 
> 165 }
> (gdb) 
> get_print_name_buffer () at 
> ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:102
> 102 if (OPAL_SUCCESS != ret) return NULL;
> (gdb) 
> 104 if (NULL == ptr) {
> (gdb) 
> 113 return (orte_print_args_buffers_t*) ptr;
> (gdb) 
> 114 }
> (gdb) 
> orte_util_print_jobids (job=2502885376) at 
> ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:172
> 172 if (NULL == ptr) {
> (gdb) 
> 178 if (ORTE_PRINT_NAME_ARG_NUM_BUFS == ptr->cntr) {
> (gdb) 
> 182 if (ORTE_JOBID_INVALID == job) {
> (gdb) 
> 184 } else if (ORTE_JOBID_WILDCARD == job) {
> (gdb) 
> 187 tmp1 = ORTE_JOB_FAMILY((unsigned long)job);
> (gdb) 
> 188 tmp2 = ORTE_LOCAL_JOBID((unsigned long)job);
> (gdb) 
> 189 snprintf(ptr->buffers[ptr->cntr++], 
> (gdb) 
> 193 return ptr->buffers[ptr->cntr-1];
> (gdb) 
> 194 }
> (gdb) 
> orte_util_print_name_args (name=0x100118380 )
>

Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

[OMPI users] which info is needed for SIGSEGV in Java for openmpi-dev-124-g91e9686 on Solaris

[OMPI users] New ib locked pages behavior?

Re: [OMPI users] New ib locked pages behavior?

Re: [OMPI users] New ib locked pages behavior?

Re: [OMPI users] New ib locked pages behavior?

[OMPI users] low CPU utilization with OpenMPI

Re: [OMPI users] low CPU utilization with OpenMPI

Re: [OMPI users] which info is needed for SIGSEGV in Java for openmpi-dev-124-g91e9686 on Solaris

9 matches

Site Navigation

Mail list logo

Footer information