Re: [OMPI users] mpirun (signal 15 Termination) urgent
Hello Simon, For running the program in parallel, I write: mpirun -np 4 ~/program output It takes a second that I receive the message: mpirun noticed that job rank 0 with PID 9477 on node linux-4pel exited on signal 15 (Terminated). and at the end of the output file, I receive: "3 additional processes aborted (not shown)" Please help me I am really in short of time to run this program. If there are other things needed, please let me know to provide. regards, Hana
Re: [OMPI users] mpirun (signal 15 Termination)
please tell me how to get rid of the message and how to run the parallel job? I have another code running directly by mpirun without a problem, but this one that needed blacs and scalapack is palying with me. please if there is any solution let me have it. Regards, hana
Re: [OMPI users] mpirun (signal 15 Termination)
Without any details it's difficult to make a diagnosis, but it looks like one of your processes crashes, perhaps from a segmentation fault . Have you run it with a debugger? Jody On Thu, Jan 15, 2009 at 9:39 AM, Hana Milani wrote: > please tell me how to get rid of the message and how to run the parallel > job? > > I have another code running directly by mpirun without a problem, but this > one that needed blacs and scalapack is palying with me. > > please if there is any solution let me have it. > > Regards, > hana > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] mpirun (signal 15 Termination)
Have you checked to ensure that the job manager is not killing your job? As I mentioned yesterday, SIGTERM is usually when some external agent kills your job. On Jan 15, 2009, at 3:39 AM, Hana Milani wrote: please tell me how to get rid of the message and how to run the parallel job? I have another code running directly by mpirun without a problem, but this one that needed blacs and scalapack is palying with me. please if there is any solution let me have it. Regards, hana ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] Timeout problem
Dear OpenMPI developers, I'm running my MPI application over Infiniband connection net over 128 processors. During the execution my application, i get a strange time out error: checkPAMRESActionTab: action 63 connecting to RES on host timed out after 200 seconds Is a net problem or an application problem? How can i solve it? Thanks in advance. -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
[OMPI users] delay in launch?
I'm trying to launch multiple xterms under OpenMPI 1.2.8 and the SGE job scheduler for purposes of running a serial debugger. I'm experiencing file-locking problems on the .Xauthority file. I tried to fix this by asking for a delay between successive launches, to reduce the chances of contention for the lock by: ~$ qrsh -pe mpi 4 -P CIS /share/apps/openmpi/bin/mpiexec --mca pls_rsh_debug 1 --mca pls_rsh_delay 5 xterm The 'pls_rsh_delay 5' parameter seems to have no effect. I tried replacing 'pls_rsh_debug 1' with 'orte_debug 1', which gave me additional debugging output, but didn't fix the file locking problem. Sometimes the above commands will work and I will get all 4 xterms, but more often I will get an error: /usr/bin/X11/xauth: error in locking authority file /export/home/duse/.Xauthority followed by X11 connection rejected because of wrong authentication. xterm Xt error: Can't open display: localhost:11.0 and one or more of the xterms will fail to open. Am I missing something? Is there another debug flag I need to set? Any suggestions for a better way to do this would be appreciated. Thanks, Jeff
Re: [OMPI users] Problem with openmpi and infiniband
Jeff Squyres wrote: On Jan 7, 2009, at 6:28 PM, Biagio Lucini wrote: [[5963,1],13][btl_openib_component.c:2893:handle_wc] from node24 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0 Ah! If we're dealing a RNR retry exceeded, this is *usually* a physical layer problem on the IB fabric. Have you run a complete layer 0 / physical set of diagnostics on the fabric to know that it is completely working properly? Once again, apologies for the delayed answer, but I always need to find a free spot to perform checks without disrupting the activity of the other users, who seem to be happy with the present status (this includes the other users of infiniband). What I have done is to run the Intel MPI Benchmark in a stress-mode over 40 nodes and then on exactly the same nodes my code. The errors for my code are attached. I do not attach the Intel benchmark file, since it is 100k and might upset someone, but I can send it on request. If I pick a random test: #- # Benchmarking Exchange # #processes = 40 #- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 100019.7020.3719.87 0.00 1 100012.8013.6113.25 0.28 2 100012.9413.7313.39 0.56 4 100012.9313.2413.14 1.15 8 100012.4612.8912.65 2.37 16 100014.5915.3515.00 3.98 32 100012.8313.4213.26 9.09 64 100013.1713.4913.31 18.10 128 100013.8314.4014.20 33.90 256 100016.4717.3416.89 56.33 512 100022.7223.2922.99 83.85 1024 100035.0936.3035.72 107.62 2048 100071.2872.4671.91 107.81 4096 1000 139.78 141.55 140.72 110.38 8192 1000 237.86 240.13 239.10 130.14 16384 1000 481.37 486.15 484.10 128.56 32768 1000 864.89 872.48 869.35 143.27 65536 640 1607.97 1629.53 1620.19 153.42 131072 320 3106.92 3196.91 3160.10 156.40 262144 160 5970.66 6333.02 6185.35 157.90 524288 80 16322.10 18509.40 17627.17 108.05 1048576 40 31194.17 40981.73 37056.97 97.60 2097152 20 38023.90 77308.80 61021.08 103.48 4194304 10 20423.82143447.80 84832.93 111.54 -- As you can see, the Intel benchmark runs fine on this set of nodes; I have been running it for a few hours without any problem. On the other hands, my job still has this problem. To recap: both are compiled with openmpi, the benchmark looks fine and my job refuses to establish communication among processes without giving any error message with OMPI 1.2.x (various x) while gives the attached error message with 1.3rc2. I have tried ibcheckerrors, which reports: #warn: counter SymbolErrors = 65535 (threshold 10) #warn: counter LinkDowned = 20 (threshold 10) #warn: counter XmtDiscards = 65535 (threshold 100) Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED #warn: counter SymbolErrors = 65535 (threshold 10) Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port 10: FAILED # Checked Switch: nodeguid 0x000b8c002347 with failure #warn: counter XmtDiscards = 65535 (threshold 100) Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port 1: FAILED ## Summary: 25 nodes checked, 0 bad nodes found ## 48 ports checked, 2 ports have errors beyond threshold Admittedly, not encouraging. The output of ibnetdiscover is attached. I should had that the cluster (including infiniband) is currently being used. Unfortunately, my experience with infiniband is not adequate to Any further clue on possible problems is very welcome. Many thanks for your attention, Biagio -- = Dr. Biagio Lucini Department of Physics, Swansea University Singleton Park, SA2 8PP Swansea (UK) Tel. +44 (0)1792 602284 = [node17:25443
Re: [OMPI users] mpirun (signal 15 Termination)
Dear all, 1. I have not run it with debugger, could you tell me how to do it? 2. How can I make sure that it is or it is not killing my job. siorry if my questions seems wierd. But I have to solve the problem immediately. Thanks for helping me
Re: [OMPI users] delay in launch?
Am 15.01.2009 um 16:20 schrieb Jeff Dusenberry: I'm trying to launch multiple xterms under OpenMPI 1.2.8 and the SGE job scheduler for purposes of running a serial debugger. I'm experiencing file-locking problems on the .Xauthority file. I tried to fix this by asking for a delay between successive launches, to reduce the chances of contention for the lock by: ~$ qrsh -pe mpi 4 -P CIS /share/apps/openmpi/bin/mpiexec --mca pls_rsh_debug 1 --mca pls_rsh_delay 5 xterm The 'pls_rsh_delay 5' parameter seems to have no effect. I tried replacing 'pls_rsh_debug 1' with 'orte_debug 1', which gave me additional debugging output, but didn't fix the file locking problem. Sometimes the above commands will work and I will get all 4 xterms, but more often I will get an error: /usr/bin/X11/xauth: error in locking authority file /export/home/ duse/.Xauthority followed by X11 connection rejected because of wrong authentication. xterm Xt error: Can't open display: localhost:11.0 and one or more of the xterms will fail to open. Am I missing something? Is there another debug flag I need to set? Any suggestions for a better way to do this would be appreciated. You are right that it's neither Open MPI's, nor SGE's fault, but a race condition in the SSH startup. You defined SSH with X11 forwarding in SGE (qconf -mconf) - right? Then you have first a ssh connection from your workstation to the login-machine. Then from the login-machine to the node where the mpiexec runs. And then one for each slave node (means an additonal one on the machine where mpiexec is already executed). Although it might be possible to give every started sshd an unique .Xauthority file, it's not straight forward to implement due to SGE's startup of the daemons and you would need a sophisticated ~/.ssh/rc to create the files at different location and use it in the forthcoming xterm. If you want just to open a bunch of xterms, you could also use such a script: $ cat multi.sh #!/bin/sh . /usr/sge/default/common/settings.sh for node in `cat $TMPDIR/machines`; do qrsh -inherit $node xterm & sleep 1 done wait The $TMPDIR/machinefile is usually defined for the MPICH(1)'s parallel startup, but not for Open MPI, as it doesn't need it. Nevertheless you could define it for your Open MPI PE or create another PE with the line: $ qconf -sp mpi ... start_proc_args/usr/sge/mpi/startmpi.sh $pe_hostfile When you run the script with "qrsh -pe mpi 4 ~/multi.sh" you should get the xterms. (It might be advisable to define "execd_params ENABLE_ADDGRP_KILL=1" in your SGE configuration, to have the ability to kill all the created xterm processes from SGE.) HTH - Reuti