Hi,
Am 24.12.2008 um 07:55 schrieb Sangamesh B:
Thanks Reuti. That sorted out the problem.
Now mpiblast is able to run, but only on single node. i.e. mpiformatdb
-> 4 fragments, mpiblast - 4 processes. Since each node is having 4
cores, the job will run on a single node and works fine. With 8
processes, the job fails with following error message:
I would suggest to search in the SGE mailing list archive for
"mpiblast" in the mail body - there are several entries about solving
this issue, which might also apply to your case.
-- Reuti
$ cat err.108.OMPI-Blast-Job
[0,1,7][btl_openib_component.c:1371:btl_openib_component_progress]
from compute-0-5.local to: compute-0-11.local error polling HP CQ with
status LOCAL LENGTH ERROR status number 1 for wr_id 12002616 opcode 42
[compute-0-11.local:09692] [0,0,0]-[0,1,2] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
[compute-0-11.local:09692] [0,0,0]-[0,1,4] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
4 0.674234 Bailing out with signal 15
[compute-0-5.local:10032] MPI_ABORT invoked on rank 4 in communicator
MPI_COMM_WORLD with errorcode 0
5 1.324 Bailing out with signal 15
[compute-0-5.local:10033] MPI_ABORT invoked on rank 5 in communicator
MPI_COMM_WORLD with errorcode 0
6 1.32842 Bailing out with signal 15
[compute-0-5.local:10034] MPI_ABORT invoked on rank 6 in communicator
MPI_COMM_WORLD with errorcode 0
[compute-0-11.local:09692] [0,0,0]-[0,1,3] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
0 0.674561 Bailing out with signal 15
[compute-0-11.local:09782] MPI_ABORT invoked on rank 0 in communicator
MPI_COMM_WORLD with errorcode 0
1 0.808846 Bailing out with signal 15
[compute-0-11.local:09783] MPI_ABORT invoked on rank 1 in communicator
MPI_COMM_WORLD with errorcode 0
2 0.81484 Bailing out with signal 15
[compute-0-11.local:09784] MPI_ABORT invoked on rank 2 in communicator
MPI_COMM_WORLD with errorcode 0
3 1.32249 Bailing out with signal 15
[compute-0-11.local:09785] MPI_ABORT invoked on rank 3 in communicator
MPI_COMM_WORLD with errorcode 0
I think its problem with OpenMPI. Its not able to communicate with
processes on another node.
Please help me to get it working on multiple nodes.
Thanks,
Sangamesh
On Tue, Dec 23, 2008 at 4:45 PM, Reuti <re...@staff.uni-marburg.de>
wrote:
Hi,
Am 23.12.2008 um 12:03 schrieb Sangamesh B:
Hello,
I've compiled MPIBLAST-1.5.0-pio app on Rocks 4.3,Voltaire
infiniband based Linux cluster using Open MPI-1.2.8 + intel 10
compilers.
The job is not running. Let me explain the configs:
SGE job script:
$ cat sge_submit.sh
#!/bin/bash
#$ -N OMPI-Blast-Job
#$ -S /bin/bash
#$ -cwd
#$ -e err.$JOB_ID.$JOB_NAME
#$ -o out.$JOB_ID.$JOB_NAME
#$ -pe orte 4
/opt/openmpi_intel/1.2.8/bin/mpirun -np $NSLOTS
/opt/apps/mpiblast-150-pio_OMPI/bin/mpiblast -p blastp -d
Mtub_CDC1551_.faa -i 586_seq.fasta -o test.out
The PE orte is:
$ qconf -sp orte
pe_name orte
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
you will need here:
control_slaves TRUE
job_is_first_task FALSE
-- Reuti
urgency_slots min
# /opt/openmpi_intel/1.2.8/bin/ompi_info | grep gridengine
MCA ras: gridengine (MCA v1.0, API v1.3,
Component v1.2.8)
MCA pls: gridengine (MCA v1.0, API v1.3,
Component v1.2.8)
The SGE error and output files for the job are as follows:
$ cat err.88.OMPI-Blast-Job
error: executing task of job 88 failed:
[compute-0-1.local:06151] ERROR: A daemon on node compute-0-1.local
failed to start as expected.
[compute-0-1.local:06151] ERROR: There may be more information
available
from
[compute-0-1.local:06151] ERROR: the 'qstat -t' command on the Grid
Engine tasks.
[compute-0-1.local:06151] ERROR: If the problem persists, please
restart
the
[compute-0-1.local:06151] ERROR: Grid Engine PE job
[compute-0-1.local:06151] ERROR: The daemon exited unexpectedly with
status 1.
$ cat out.88.OMPI-Blast-Job
There is nothing in output file.
The qstat shows that job is running at some node. But on that node,
there is no mpiblast processes running as seen by top command.
The ps command:
# ps -ef | grep mpiblast
locuz 4018 4017 0 16:25 ? 00:00:00
/opt/openmpi_intel/1.2.8/bin/mpirun -np 4
/opt/apps/mpiblast-150-pio_OMPI/bin/mpiblast -p blastp -d
Mtub_CDC1551_.faa -i 586_seq.fasta -o test.out
root 4120 4022 0 16:27 pts/0 00:00:00 grep mpiblast
shows this.
The ibv_rc_pingpong tests work fine. The output of lsmod:
# lsmod | grep ib
ib_sdp 57788 0
rdma_cm 38292 3 rdma_ucm,rds,ib_sdp
ib_addr 11400 1 rdma_cm
ib_local_sa 14864 1 rdma_cm
ib_mthca 157396 2
ib_ipoib 83928 0
ib_umad 20656 0
ib_ucm 21256 0
ib_uverbs 46896 8 rdma_ucm,ib_ucm
ib_cm 42536 3 rdma_cm,ib_ipoib,ib_ucm
ib_sa 28512 4 rdma_cm,ib_local_sa,ib_ipoib,ib_cm
ib_mad 43432 5
ib_local_sa,ib_mthca,ib_umad,ib_cm,ib_sa
ib_core 70544 14
rdma_ucm,rds,ib_sdp,rdma_cm,iw_cm,ib_local_sa,ib_mthca,ib_ipoib,ib_u
mad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad
ipv6 285089 23 ib_ipoib
libata 124585 1 ata_piix
scsi_mod 144529 2 libata,sd_mod
What might be the problem?
We've used Voltaire OFA Roll from rocks - Gridstack.
Thanks,
Sangamesh
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users