Re: [OMPI users] mpirun (signal 15 Termination) urgent

2009-01-15 Thread Hana Milani
Hello Simon, 

For running the program in parallel, I write:  mpirun -np 4 ~/program  
output

It
takes a second that I receive the message: mpirun noticed that job rank
0 with PID 9477 on node linux-4pel exited on signal 15 (Terminated). 

and at the end of the output file, I receive: "3 additional processes aborted 
(not shown)"

Please
help me I am really in short of time to run this program. If there are
other things needed, please let me know to provide.

regards,
Hana





Re: [OMPI users] mpirun (signal 15 Termination)

2009-01-15 Thread Hana Milani
please tell me how to get rid of the message and how to run the parallel job?

I have another code running directly by mpirun without a problem, but this one 
that needed blacs and scalapack is palying with me.

please if there is any solution let me have it.

Regards,
hana





Re: [OMPI users] mpirun (signal 15 Termination)

2009-01-15 Thread jody
Without any details it's difficult to make a diagnosis,
but it looks like one of your processes crashes, perhaps from a
segmentation fault .

Have you run it with a debugger?

Jody

On Thu, Jan 15, 2009 at 9:39 AM, Hana Milani  wrote:
> please tell me how to get rid of the message and how to run the parallel
> job?
>
> I have another code running directly by mpirun without a problem, but this
> one that needed blacs and scalapack is palying with me.
>
> please if there is any solution let me have it.
>
> Regards,
> hana
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpirun (signal 15 Termination)

2009-01-15 Thread Jeff Squyres
Have you checked to ensure that the job manager is not killing your  
job?  As I mentioned yesterday, SIGTERM is usually when some external  
agent kills your job.



On Jan 15, 2009, at 3:39 AM, Hana Milani wrote:

please tell me how to get rid of the message and how to run the  
parallel job?


I have another code running directly by mpirun without a problem,  
but this one that needed blacs and scalapack is palying with me.


please if there is any solution let me have it.

Regards,
hana

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



[OMPI users] Timeout problem

2009-01-15 Thread Gabriele Fatigati
Dear OpenMPI developers,
I'm running my MPI application over Infiniband connection net over 128
processors. During the execution my application, i get a strange time
out error:

checkPAMRESActionTab: action 63 connecting to RES on host  timed
out after 200 seconds

Is a net problem or an application problem? How can i solve it?

Thanks in advance.


-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


[OMPI users] delay in launch?

2009-01-15 Thread Jeff Dusenberry
I'm trying to launch multiple xterms under OpenMPI 1.2.8 and the SGE job 
scheduler for purposes of running a serial debugger.  I'm experiencing 
file-locking problems on the .Xauthority file.


I tried to fix this by asking for a delay between successive launches, 
to reduce the chances of contention for the lock by:


~$ qrsh -pe mpi 4 -P CIS  /share/apps/openmpi/bin/mpiexec --mca 
pls_rsh_debug 1 --mca pls_rsh_delay 5  xterm


The 'pls_rsh_delay 5' parameter seems to have no effect.  I tried 
replacing 'pls_rsh_debug 1' with 'orte_debug 1', which gave me 
additional debugging output, but didn't fix the file locking problem.


Sometimes the above commands will work and I will get all 4 xterms, but 
more often I will get an error:


/usr/bin/X11/xauth:  error in locking authority file 
/export/home/duse/.Xauthority


followed by

X11 connection rejected because of wrong authentication.
xterm Xt error: Can't open display: localhost:11.0

and one or more of the xterms will fail to open.

Am I missing something?  Is there another debug flag I need to set?  Any 
suggestions for a better way to do this would be appreciated.


Thanks,
Jeff



Re: [OMPI users] Problem with openmpi and infiniband

2009-01-15 Thread Biagio Lucini

Jeff Squyres wrote:

On Jan 7, 2009, at 6:28 PM, Biagio Lucini wrote:


[[5963,1],13][btl_openib_component.c:2893:handle_wc] from node24 to:
node11 error polling LP CQ with status RECEIVER NOT READY RETRY
EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0


Ah! If we're dealing a RNR retry exceeded, this is *usually* a physical
layer problem on the IB fabric.

Have you run a complete layer 0 / physical set of diagnostics on the
fabric to know that it is completely working properly?



Once again, apologies for the delayed answer, but I always need to find 
a free spot to perform checks without disrupting the activity of the 
other users, who seem to be happy with the present status (this includes 
the other users of infiniband).


What I have done is to run the Intel MPI Benchmark in a stress-mode over 
40 nodes and then on exactly the same nodes my code. The errors for my 
code are attached. I do not attach the Intel benchmark file, since it is 
100k and might upset someone, but I can send it on request. If I pick a 
random test:


#-
# Benchmarking Exchange 

# #processes = 40 


#-
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec] 
Mbytes/sec
0 100019.7020.3719.87 
   0.00
1 100012.8013.6113.25 
   0.28
2 100012.9413.7313.39 
   0.56
4 100012.9313.2413.14 
   1.15
8 100012.4612.8912.65 
   2.37
   16 100014.5915.3515.00 
   3.98
   32 100012.8313.4213.26 
   9.09
   64 100013.1713.4913.31 
  18.10
  128 100013.8314.4014.20 
  33.90
  256 100016.4717.3416.89 
  56.33
  512 100022.7223.2922.99 
  83.85
 1024 100035.0936.3035.72 
 107.62
 2048 100071.2872.4671.91 
 107.81
 4096 1000   139.78   141.55   140.72 
 110.38
 8192 1000   237.86   240.13   239.10 
 130.14
16384 1000   481.37   486.15   484.10 
 128.56
32768 1000   864.89   872.48   869.35 
 143.27
65536  640  1607.97  1629.53  1620.19 
 153.42
   131072  320  3106.92  3196.91  3160.10 
 156.40
   262144  160  5970.66  6333.02  6185.35 
 157.90
   524288   80 16322.10 18509.40 17627.17 
 108.05
  1048576   40 31194.17 40981.73 37056.97 
  97.60
  2097152   20 38023.90 77308.80 61021.08 
 103.48
  4194304   10 20423.82143447.80 84832.93 
 111.54

--


As you can see, the Intel benchmark runs fine on this set
of nodes; I have been running it for a few hours without any problem. On 
the other hands, my job still has this problem. To recap:

both are compiled with openmpi, the benchmark looks fine and my job
refuses to establish communication among processes without giving any 
error message with OMPI 1.2.x (various x) while gives the attached error 
message with 1.3rc2.


I have tried ibcheckerrors, which reports:

#warn: counter SymbolErrors = 65535 (threshold 10)
#warn: counter LinkDowned = 20  (threshold 10)
#warn: counter XmtDiscards = 65535  (threshold 100)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) 
port all:  FAILED

#warn: counter SymbolErrors = 65535 (threshold 10)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) 
port 10:  FAILED

# Checked Switch: nodeguid 0x000b8c002347 with failure
#warn: counter XmtDiscards = 65535  (threshold 100)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) 
port 1:  FAILED


## Summary: 25 nodes checked, 0 bad nodes found
##  48 ports checked, 2 ports have errors beyond threshold

Admittedly, not encouraging. The output of ibnetdiscover is attached.

I should had that the cluster (including infiniband) is currently being 
used. Unfortunately, my experience with infiniband is not adequate to


Any further clue on possible problems is very welcome.

Many thanks for your attention,
Biagio

--
=

Dr. Biagio Lucini   
Department of Physics, Swansea University
Singleton Park, SA2 8PP Swansea (UK)
Tel. +44 (0)1792 602284

=
[node17:25443

Re: [OMPI users] mpirun (signal 15 Termination)

2009-01-15 Thread Hana Milani
Dear all,

1. I have not run it with debugger, could you tell me how to do it?

2. How can I make sure that it is or it is not killing my job.

siorry if my questions seems wierd. But I have to solve the problem immediately.

Thanks for helping me





Re: [OMPI users] delay in launch?

2009-01-15 Thread Reuti

Am 15.01.2009 um 16:20 schrieb Jeff Dusenberry:

I'm trying to launch multiple xterms under OpenMPI 1.2.8 and the  
SGE job scheduler for purposes of running a serial debugger.  I'm  
experiencing file-locking problems on the .Xauthority file.


I tried to fix this by asking for a delay between successive  
launches, to reduce the chances of contention for the lock by:


~$ qrsh -pe mpi 4 -P CIS  /share/apps/openmpi/bin/mpiexec --mca  
pls_rsh_debug 1 --mca pls_rsh_delay 5  xterm


The 'pls_rsh_delay 5' parameter seems to have no effect.  I tried  
replacing 'pls_rsh_debug 1' with 'orte_debug 1', which gave me  
additional debugging output, but didn't fix the file locking problem.


Sometimes the above commands will work and I will get all 4 xterms,  
but more often I will get an error:


/usr/bin/X11/xauth:  error in locking authority file /export/home/ 
duse/.Xauthority


followed by

X11 connection rejected because of wrong authentication.
xterm Xt error: Can't open display: localhost:11.0

and one or more of the xterms will fail to open.

Am I missing something?  Is there another debug flag I need to  
set?  Any suggestions for a better way to do this would be  
appreciated.


You are right that it's neither Open MPI's, nor SGE's fault, but a  
race condition in the SSH startup. You defined SSH with X11  
forwarding in SGE (qconf -mconf) - right? Then you have first a ssh  
connection from your workstation to the login-machine. Then from the  
login-machine to the node where the mpiexec runs. And then one for  
each slave node (means an additonal one on the machine where mpiexec  
is already executed).


Although it might be possible to give every started sshd an  
unique .Xauthority file, it's not straight forward to implement due  
to SGE's startup of the daemons and you would need a sophisticated  
~/.ssh/rc to create the files at different location and use it in the  
forthcoming xterm.


If you want just to open a bunch of xterms, you could also use such a  
script:


$ cat multi.sh
#!/bin/sh
. /usr/sge/default/common/settings.sh
for node in  `cat $TMPDIR/machines`; do
qrsh -inherit $node xterm &
sleep 1
done
wait

The $TMPDIR/machinefile is usually defined for the MPICH(1)'s  
parallel startup, but not for Open MPI, as it doesn't need it.  
Nevertheless you could define it for your Open MPI PE or create  
another PE with the line:


$ qconf -sp mpi
...
start_proc_args/usr/sge/mpi/startmpi.sh $pe_hostfile

When you run the script with "qrsh -pe mpi 4 ~/multi.sh" you should  
get the xterms.


(It might be advisable to define "execd_params  
ENABLE_ADDGRP_KILL=1" in your SGE configuration, to have the ability  
to kill all the created xterm processes from SGE.)


HTH - Reuti