[OMPI users] Intel MPI Benchmark(IMB) using OpenMPI - Segmentation-fault error message.

2008-04-30 Thread Mukesh K Srivastava
Hi.

I am using IMB-3.1, an Intel MPI Benchmark tool with OpenMPI(v-1.2.5). In
/IMB-3.1/src/make_mpich file, I had only given the decalartion for MPI_HOME,
which takes care for CC, OPTFLAGS & CLINKER. Building IMB_MPI1, IMP-EXT &
IMB-IO happens succesfully.

I get proper results of IMB Benchmark with command "-np 1" as mpirun
IMB-MPI1, but for "-np 2", I get below errors -

-
[mukesh@n161 src]$ mpirun -np 2 IMB-MPI1
[n161:13390] *** Process received signal ***
[n161:13390] Signal: Segmentation fault (11)
[n161:13390] Signal code: Address not mapped (1)
[n161:13390] Failing at address: (nil)
[n161:13390] [ 0] /lib64/tls/libpthread.so.0 [0x399e80c4f0]
[n161:13390] [ 1] /home/mukesh/openmpi/prefix/lib/openmpi/mca_btl_sm.so
[0x2a9830f8b4]
[n161:13390] [ 2] /home/mukesh/openmpi/prefix/lib/openmpi/mca_btl_sm.so
[0x2a983109e3]
[n161:13390] [ 3]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0xbc)
[0x2a9830fc50]
[n161:13390] [ 4]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x4b)
[0x2a97fce447]
[n161:13390] [ 5]
/home/mukesh/openmpi/prefix/lib/libopen-pal.so.0(opal_progress+0xbc)
[0x2a958fc343]
[n161:13390] [ 6]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x22)
[0x2a962e9e22]
[n161:13390] [ 7]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x677)
[0x2a962f1aab]
[n161:13390] [ 8]
/home/mukesh/openmpi/prefix/lib/libopen-rte.so.0(mca_oob_recv_packed+0x46)
[0x2a9579d243]
[n161:13390] [ 9]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_gpr_proxy.so(orte_gpr_proxy_put+0x2f3)
[0x2a96508c8f]
[n161:13390] [10]
/home/mukesh/openmpi/prefix/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x425)
[0x2a957c391d]
[n161:13390] [11]
/home/mukesh/openmpi/prefix/lib/libmpi.so.0(ompi_mpi_init+0xa1e)
[0x2a9559f042]
[n161:13390] [12]
/home/mukesh/openmpi/prefix/lib/libmpi.so.0(PMPI_Init_thread+0xcb)
[0x2a955e1c5b]
[n161:13390] [13] IMB-MPI1(main+0x33) [0x403543]
[n161:13390] [14] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x399e11c3fb]
[n161:13390] [15] IMB-MPI1 [0x40347a]
[n161:13390] *** End of error message ***
[n161:13391] *** Process received signal ***
[n161:13391] Signal: Segmentation fault (11)
[n161:13391] Signal code: Address not mapped (1)
[n161:13391] Failing at address: (nil)
[n161:13391] [ 0] /lib64/tls/libpthread.so.0 [0x399e80c4f0]
[n161:13391] [ 1] /home/mukesh/openmpi/prefix/lib/openmpi/mca_btl_sm.so
[0x2a9830f8b4]
[n161:13391] [ 2] /home/mukesh/openmpi/prefix/lib/openmpi/mca_btl_sm.so
[0x2a983109e3]
[n161:13391] [ 3]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0xbc)
[0x2a9830fc50]
[n161:13391] [ 4]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x4b)
[0x2a97fce447]
[n161:13391] [ 5]
/home/mukesh/openmpi/prefix/lib/libopen-pal.so.0(opal_progress+0xbc)
[0x2a958fc343]
[n161:13391] [ 6]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x22)
[0x2a962e9e22]
[n161:13391] [ 7]
/home/mukesh/openmpi/prefix/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x677)
[0x2a962f1aab]
[n161:13391] [ 8]
/home/mukesh/openmpi/prefix/lib/libopen-rte.so.0(mca_oob_recv_packed+0x46)
[0x2a9579d243]
[n161:13391] [ 9] /home/mukesh/openmpi/prefix/lib/libopen-rte.so.0
[0x2a9579e910]
[n161:13391] [10]
/home/mukesh/openmpi/prefix/lib/libopen-rte.so.0(mca_oob_xcast+0x140)
[0x2a9579d824]
[n161:13391] [11]
/home/mukesh/openmpi/prefix/lib/libmpi.so.0(ompi_mpi_init+0xaf1)
[0x2a9559f115]
[n161:13391] [12]
/home/mukesh/openmpi/prefix/lib/libmpi.so.0(PMPI_Init_thread+0xcb)
[0x2a955e1c5b]
[n161:13391] [13] IMB-MPI1(main+0x33) [0x403543]
[n161:13391] [14] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x399e11c3fb]
[n161:13391] [15] IMB-MPI1 [0x40347a]
[n161:13391] *** End of error message ***

-

Query#1: Any clue for above?

Query#2:  How can I include seperate exe file and have the IMB for it, e.g,
writing a hello.c with MPI elementary API calls, compiling with mpicc and
performing IMB for the same exe.?

BR


[OMPI users] (no subject)

2008-04-30 Thread Gabriele FATIGATI
Hi,
i tried to run SkaMPI benchmark on IBM-BladeCenterLS21-BCX system with 256 
processors, but test has stopped on "AlltoAll-length" routine, with count=8192  
for some reasons. 

I have launched test with:
--mca btl_openib_eager_limit 1024

Same tests with 128 processor or less, have finished successful.

Different values of eager limit dont' solve the problem. Thanks in advance.
-- 
Gabriele Fatigati

CINECA Systems & Tecnologies Department

Supercomputing  Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:39 051 6171722

g.fatig...@cineca.it   

Re: [OMPI users] blcr_checkpoint_peer: execvp returned -1

2008-04-30 Thread Josh Hursey


On Apr 29, 2008, at 7:18 AM, Leonardo Fialho wrote:


Josh,

Yesterday at night I made some changes, checkout a new SVN version,  
and
revise completely the BLCR installation. It´s working fine. I  
suspect 2

different things:

1) cache or old files (configured with older BLCR version path) in
autom4te, configure or dependencies;
2) some miss configuration in BLCR headers file.

When I checkpoint/restart non-MPI application, such applications,
probably, uses the correct libraries, but BLCR module was probably
compiled with older headers (cache?).

I´m trying to perform the error again, but before these changes (when
it´s not working) BLCR returns the "bad file descriptor" (EBAFD)  
error,
and the blcr module don´t catch this error, only return (-1) "child  
failed".


I'll take a look at this and try to have the Open MPI BLCR module  
return something more representative of the actual error message.


-- Josh




Thanks,
Leonardo Fialho

Josh Hursey escribió:

I don't think I have ever seen this one before. :(

So you are trying to checkpoint the MPI process by hand or a non-MPI
process? Can you confirm that you can successfully checkpoint/restart
a non-MPI process on these machines? What version of the Open MPI
trunk are you using? Have you made any changes to the trunk to  
produce

this build?

Can you send me the info described here (off-list is ok):
 http://www.open-mpi.org/community/help/

-- Josh

On Apr 28, 2008, at 5:10 AM, Leonardo Fialho wrote:



Changing some parameters (blcr_checkpoint_cmd):

[aogrd01:08552] crs:blcr: checkpoint(8552, ---)
[aogrd01:08552] crs:blcr: checkpoint_peer(8552, --)
[aogrd01:08552] crs:blcr: get_checkpoint_filename(--, 8552)
[aogrd01:08552] crs:blcr: checkpoint_cmd(8552)
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: exec
:(/softs/blcr-0.6.5/bin/cr_checkpoint,
/softs/blcr-0.6.5/bin/cr_checkpoint --pid 8552 --file
/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552):
[aogrd01:08552] crs:blcr: thread_callback()
[aogrd01:08552] crs:blcr: thread_callback: Continue.
[aogrd01:08552] crs:blcr: blcr_checkpoint_peer: Thread finished with
status 2
Checkpoint failed: Bad file descriptor
chmod: cannot access `/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.
8552':
No such file or directory
[aogrd01:08552] crs:blcr: move(): Error: Unable to execute the  
command

 :
[256].
crs:blcr chmod: Resource temporarily unavailable
[aogrd01:08552] crs:blcr: checkpoint(): Error: Unable to chmod the
checkpoint file (ompi_blcr_context.8552 in the directory
(/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) :[256].
crs:blcr: checkpoint: Invalid argument
[aogrd01:08552] opal_cr: inc_core: Error: The checkpoint failed. 256

BLCR don´t generate the context file
(/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552). If I execute the
checkpoint command manually (/softs/blcr-0.6.5/bin/cr_checkpoint -- 
pid

8552 --file /tmp/opal_snapshot_0.ckpt/ompi_blcr_context.8552) it
returns
the same error: Checkpoint failed: Bad file descriptor

Thanks,
Leonardo Fialho

Leonardo Fialho escribió:


Hi All,

Does anybody experiment this error?

[aogrdini:09070] Global) Receive a command message from [[13242,0],
0].
...
[aogrd02:07642] Local) Receive a command message.
...
[aogrd01:07938] Local) Receive a command message.
...
[aogrd01:07941] App) signal_handler: Receive Checkpoint Request.
...
[aogrd02:07645] App) signal_handler: Receive Checkpoint Request.
...
[aogrd01:07941] crs:blcr: checkpoint(7941, ---)
[aogrd01:07941] crs:blcr: checkpoint_peer(7941, --)
[aogrd01:07941] crs:blcr: get_checkpoint_filename(--, 7941)
[aogrd01:07941] crs:blcr: checkpoint_cmd(7941)
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: exec : 
(cr_checkpoint,

cr_checkpoint --pid 7941 --file
/tmp/opal_snapshot_0.ckpt/ompi_blcr_context.7941):
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: Child failed to
execute :(-1):
[aogrd01:07941] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: exec : 
(cr_checkpoint,

cr_checkpoint --pid 7645 --file
/tmp/opal_snapshot_1.ckpt/ompi_blcr_context.7645):
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: Child failed to
execute :(-1):
[aogrd02:07645] crs:blcr: blcr_checkpoint_peer: execvp returned -1
...
[aogrd02:07642] Local)   Location:[/tmp/ 
opal_snapshot_1.ckpt]


The application stop here and don´t continue the execution. It´s
using  libcr version 0.6.5
$ lsof -p 7518
/softs/blcr-0.6.5/0.6.5/lib/libcr.so.0.2.1

After orte-checkpoint command the application process is duplicated
on
the nodes, like a child of the original process.
When a run an application with this version and take a checkpoint
manually, I have no problem...

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos
Phone: +34-93-581-2888
Fax: +34-93-581-2478


Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE,