Can you get on one of the nodes and see the job's processes? If so can
you then attach a debugger to it and get a stack? I wonder if the
processes are stuck in MPI_Init?
--td
On 6/7/2012 6:06 AM, Duke wrote:
Hi again,
Somehow the verbose flag (-v) did not work for me. I tried
--debug-daemon and got:
[mpiuser@fantomfs40a ~]$ mpirun --debug-daemons -np 3 --machinefile
/home/mpiuser/.mpi_hostfile ./test/mpihello
Daemon was launched on hp430a - beginning to initialize
Daemon [[34432,0],1] checking in as pid 3011 on host hp430a
<stuck here>
Somehow the program got stuck when checking on hosts. The secure log
on hp430a showed that mpiuser logged in just fine:
tail /var/log/secure
Jun 7 17:07:31 hp430a sshd[3007]: Accepted publickey for mpiuser from
192.168.0.101 port 34037 ssh2
Jun 7 17:07:31 hp430a sshd[3007]: pam_unix(sshd:session): session
opened for user mpiuser by (uid=0)
Any idea where/how/what to process/check?
Thanks,
D.
On 6/7/12 4:38 PM, Duke wrote:
Hi Jingha,
On 6/7/12 4:28 PM, Jingcha Joba wrote:
Hello Duke,
Welcome to the forum.
The way openmpi schedules by default is to fill all the slots in a
host, before moving on to next host.
Check this link for some info:
http://www.open-mpi.org/faq/?category=running#mpirun-scheduling
Thanks for quick answer. I checked the FAQ, and tried with processes
more than 2, but somehow it got stalled:
[mpiuser@fantomfs40a ~]$ mpirun -v -np 4 --machinefile
/home/mpiuser/.mpi_hostfile ./test/mpihello
^Cmpirun: killing job...
I tried --host flag and it got stalled as well:
[mpiuser@fantomfs40a ~]$ mpirun -v -np 4 --host hp430a,hp430b
./test/mpihello
My configuration must be wrong somewhere. Anyidea how I can check the
system?
Thanks,
D.
--
Jingcha
On Thu, Jun 7, 2012 at 2:11 AM, Duke <duke.li...@gmx.com
<mailto:duke.li...@gmx.com>> wrote:
Hi folks,
Please be gentle to the newest member of openMPI, I am totally
new to this field. I just built a test cluster with 3 boxes on
Scientific Linux 6.2 and openMPI (Open MPI 1.5.3), and I wanted
to test how the cluster works but I cant figure out what was/is
happening. On my master node, I have the hostfile:
[mpiuser@fantomfs40a ~]$ cat .mpi_hostfile
# The Hostfile for Open MPI
fantomfs40a slots=2
hp430a slots=4 max-slots=4
hp430b slots=4 max-slots=4
To test, I used the following c code:
[mpiuser@fantomfs40a ~]$ cat test/mpihello.c
/* program hello */
/* Adapted from mpihello.f by drs */
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv)
{
int *buf, i, rank, nints, len;
char hostname[256];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
gethostname(hostname,255);
printf("Hello world! I am process number: %d on host %s\n",
rank, hostname);
MPI_Finalize();
return 0;
}
and then compiled and ran:
[mpiuser@fantomfs40a ~]$ mpicc -o test/mpihello test/mpihello.c
[mpiuser@fantomfs40a ~]$ mpirun -np 2 --machinefile
/home/mpiuser/.mpi_hostfile ./test/mpihello
Hello world! I am process number: 0 on host fantomfs40a
Hello world! I am process number: 1 on host fantomfs40a
Unfortunately the result did not show what I wanted. I expected
to see somethign like:
Hello world! I am process number: 0 on host hp430a
Hello world! I am process number: 1 on host hp430b
Anybody has any idea what I am doing wrong?
Thank you in advance,
D.
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>