Thanks Reuti. That sorted out the problem.

Now mpiblast is able to run, but only on single node. i.e. mpiformatdb
-> 4 fragments, mpiblast - 4 processes. Since each node is having 4
cores, the job will run on a single node and works fine. With 8
processes, the job fails with following error message:

$ cat err.108.OMPI-Blast-Job
[0,1,7][btl_openib_component.c:1371:btl_openib_component_progress]
from compute-0-5.local to: compute-0-11.local error polling HP CQ with
status LOCAL LENGTH ERROR status number 1 for wr_id 12002616 opcode 42
[compute-0-11.local:09692] [0,0,0]-[0,1,2] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
[compute-0-11.local:09692] [0,0,0]-[0,1,4] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
4       0.674234        Bailing out with signal 15
[compute-0-5.local:10032] MPI_ABORT invoked on rank 4 in communicator
MPI_COMM_WORLD with errorcode 0
5       1.324   Bailing out with signal 15
[compute-0-5.local:10033] MPI_ABORT invoked on rank 5 in communicator
MPI_COMM_WORLD with errorcode 0
6       1.32842 Bailing out with signal 15
[compute-0-5.local:10034] MPI_ABORT invoked on rank 6 in communicator
MPI_COMM_WORLD with errorcode 0
[compute-0-11.local:09692] [0,0,0]-[0,1,3] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
0       0.674561        Bailing out with signal 15
[compute-0-11.local:09782] MPI_ABORT invoked on rank 0 in communicator
MPI_COMM_WORLD with errorcode 0
1       0.808846        Bailing out with signal 15
[compute-0-11.local:09783] MPI_ABORT invoked on rank 1 in communicator
MPI_COMM_WORLD with errorcode 0
2       0.81484 Bailing out with signal 15
[compute-0-11.local:09784] MPI_ABORT invoked on rank 2 in communicator
MPI_COMM_WORLD with errorcode 0
3       1.32249 Bailing out with signal 15
[compute-0-11.local:09785] MPI_ABORT invoked on rank 3 in communicator
MPI_COMM_WORLD with errorcode 0

I think its problem with OpenMPI. Its not able to communicate with
processes on another node.
Please help me to get it working on multiple nodes.

Thanks,
Sangamesh


On Tue, Dec 23, 2008 at 4:45 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> Hi,
>
> Am 23.12.2008 um 12:03 schrieb Sangamesh B:
>
>> Hello,
>>
>>   I've compiled MPIBLAST-1.5.0-pio app on Rocks 4.3,Voltaire
>> infiniband based Linux cluster using Open MPI-1.2.8 + intel 10
>> compilers.
>>
>>  The job is not running. Let me explain the configs:
>>
>> SGE job script:
>>
>>  $ cat sge_submit.sh
>> #!/bin/bash
>>
>> #$ -N OMPI-Blast-Job
>>
>> #$ -S /bin/bash
>>
>> #$ -cwd
>>
>> #$ -e err.$JOB_ID.$JOB_NAME
>>
>> #$ -o out.$JOB_ID.$JOB_NAME
>>
>> #$ -pe orte 4
>>
>> /opt/openmpi_intel/1.2.8/bin/mpirun -np $NSLOTS
>> /opt/apps/mpiblast-150-pio_OMPI/bin/mpiblast -p blastp -d
>> Mtub_CDC1551_.faa -i 586_seq.fasta -o test.out
>>
>> The PE orte is:
>>
>> $ qconf -sp orte
>> pe_name           orte
>> slots             999
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /bin/true
>> stop_proc_args    /bin/true
>> allocation_rule   $fill_up
>> control_slaves    FALSE
>> job_is_first_task TRUE
>
> you will need here:
>
> control_slaves    TRUE
> job_is_first_task FALSE
>
> -- Reuti
>
>
>> urgency_slots     min
>>
>> # /opt/openmpi_intel/1.2.8/bin/ompi_info | grep gridengine
>>                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
>>                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
>>
>> The SGE error and output files for the job are as follows:
>>
>> $ cat err.88.OMPI-Blast-Job
>> error: executing task of job 88 failed:
>> [compute-0-1.local:06151] ERROR: A daemon on node compute-0-1.local
>> failed to start as expected.
>> [compute-0-1.local:06151] ERROR: There may be more information available
>> from
>> [compute-0-1.local:06151] ERROR: the 'qstat -t' command on the Grid
>> Engine tasks.
>> [compute-0-1.local:06151] ERROR: If the problem persists, please restart
>> the
>> [compute-0-1.local:06151] ERROR: Grid Engine PE job
>> [compute-0-1.local:06151] ERROR: The daemon exited unexpectedly with
>> status 1.
>>
>> $ cat out.88.OMPI-Blast-Job
>>
>> There is nothing in output file.
>>
>> The qstat shows that job is running at some node. But on that node,
>> there is no mpiblast processes running  as seen by top command.
>>
>> The ps command:
>>
>> # ps -ef | grep mpiblast
>> locuz     4018  4017  0 16:25 ?        00:00:00
>> /opt/openmpi_intel/1.2.8/bin/mpirun -np 4
>> /opt/apps/mpiblast-150-pio_OMPI/bin/mpiblast -p blastp -d
>> Mtub_CDC1551_.faa -i 586_seq.fasta -o test.out
>> root      4120  4022  0 16:27 pts/0    00:00:00 grep mpiblast
>>
>> shows this.
>>
>> The ibv_rc_pingpong tests work fine. The output of lsmod:
>>
>> # lsmod | grep ib
>> ib_sdp                 57788  0
>> rdma_cm                38292  3 rdma_ucm,rds,ib_sdp
>> ib_addr                11400  1 rdma_cm
>> ib_local_sa            14864  1 rdma_cm
>> ib_mthca              157396  2
>> ib_ipoib               83928  0
>> ib_umad                20656  0
>> ib_ucm                 21256  0
>> ib_uverbs              46896  8 rdma_ucm,ib_ucm
>> ib_cm                  42536  3 rdma_cm,ib_ipoib,ib_ucm
>> ib_sa                  28512  4 rdma_cm,ib_local_sa,ib_ipoib,ib_cm
>> ib_mad                 43432  5 ib_local_sa,ib_mthca,ib_umad,ib_cm,ib_sa
>> ib_core                70544  14
>>
>> rdma_ucm,rds,ib_sdp,rdma_cm,iw_cm,ib_local_sa,ib_mthca,ib_ipoib,ib_umad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad
>> ipv6                  285089  23 ib_ipoib
>> libata                124585  1 ata_piix
>> scsi_mod              144529  2 libata,sd_mod
>>
>> What might be the problem?
>> We've used Voltaire OFA Roll from rocks - Gridstack.
>>
>> Thanks,
>> Sangamesh
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to