Re: [O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Clement Chu
Thanks for your help. kfc is machine name and clement is the username of this machine. Do you think it is the problem? Then I tried to remove kfc machine and run again. This time I can run mpi program and there is no error message output, but it is no program output too. I think it is som

Re: [O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Jeff Squyres
One minor thing that I notice in your ompi_info output -- your build and run machines are different (kfc vs. clement). Are these both FC4 machines, or are they different OS's/distros? On Nov 10, 2005, at 10:01 AM, Clement Chu wrote: [clement@kfc TestMPI]$ mpirun -d -np 2 test [kfc:29199] pr

Re: [O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Jeff Squyres
The name of the launcher is "rsh", but it actually defaults to trying to fork/exec ssh. Unfortunately, your backtrace doesn't tell much because there are no debugging symbols. Can you recompile OMPI with debugging enabled and send a new backtrace? Use: ./configure CFLAGS=-g

Re: [O-MPI users] mpif90 error: undefined reference to `mpi_reduce0dr8`

2005-11-10 Thread Jeff Squyres
Clarification on this -- my earlier response wasn't quite right... We actually do not provide F90 bindings for MPI_Reduce (and several other collectives) because they have 2 user-provided buffers. This means that for N intrinsic types, there are N^2 possible overloads for this function (becau

Re: [O-MPI users] Infiniband performance problems (mvapi)

2005-11-10 Thread Tim S. Woodall
Mike, I believe this issue has been corrected on the trunk, and should be in the next release candidate, probably by the end of the week. Thanks, Tim Mike Houston wrote: mpirun -mca btl_mvapi_rd_min 128 -mca btl_mvapi_rd_max 256 -np 2 -hostfile /u/mhouston/mpihosts mpi_bandwidth 21 131072 13

Re: [O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Clement Chu
there is the backtrace result: (now i am using 8085) Does mpirun start rsh?? I think I need ssh instead of rsh. [clement@kfc tmp]$ gdb mpirun core.17766 GNU gdb Red Hat Linux (6.3.0.0-1.21rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public L

Re: [O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Clement Chu
[clement@kfc TestMPI]$ mpirun -d -np 2 test [kfc:29199] procdir: (null) [kfc:29199] jobdir: (null) [kfc:29199] unidir: /tmp/openmpi-sessions-clement@kfc_0/default-universe [kfc:29199] top: openmpi-sessions-clement@kfc_0 [kfc:29199] tmp: /tmp [kfc:29199] [0,0,0] setting up session dir with [kfc:291

Re: [O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Jeff Squyres
I'm sorry -- I wasn't entirely clear: 1. Are you using a 1.0 nightly tarball or a 1.1 nightly tarball? We have made a bunch of fixes to the 1.1 tree (i.e., the Subversion trunk), but have not fully vetted them yet, so they have not yet been taken to the 1.0 release branch yet. If you have no

Re: [O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Clement Chu
I have tried the latest version (rc5 8053), but the error is still here. Jeff Squyres wrote: We've actually made quite a few bug fixes since RC4 (RC5 is not available yet). Would you mind trying with a nightly snapshot tarball? (there were some SVN commits last night after the nightly snaps

Re: [O-MPI users] can't get openmpi to run across twomulti-NICmachines

2005-11-10 Thread Marty Humphrey
Here's a core I'm getting... [humphrey@zelda01 humphrey]$ mpiexec --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0 -np 2 a.out mpiexec noticed that job rank 1 with PID 20028 on node "localhost" exited on signal 11. 1 process killed (possibly by Open MPI) [humphrey@zelda01 humphrey]$ gdb

Re: [O-MPI users] can't get openmpi to run across twomulti-NICmachines

2005-11-10 Thread Marty Humphrey
By the way, it just *feels* like a race condition somewhere, because the very next invocation worked (I ctrl-C'd it)... [humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0 -np 2 a.out [zelda01.localdomain:19923] procdir: (null) [zelda01.localdomain:19

Re: [O-MPI users] can't get openmpi to run across twomulti-NICmachines

2005-11-10 Thread Marty Humphrey
I'm not seeing any cores -- I'll see if there's anything stopping them from being produced. I've attached to one of the hanging "a.out"s (this is with the "mpiexec" invocation that includes "--mca oob_tcp_include eth0") (gdb) bt #0 0x001e3007 in sched_yield () from /lib/tls/libc.so.6 #1 0x00512

Re: [O-MPI users] mpif90 error: undefined reference to `mpi_reduce0dr8`

2005-11-10 Thread Jeff Squyres
Great Leaping Lizards, Batman! Unbelievably, the MPI_Reduce interfaces were left out. I'm going to go a complete F90 audit right now to ensure that no other interfaces were unintentionally excluded; I'll commit a fix today. Thanks for catching this! On Nov 9, 2005, at 8:15 PM, Brent LEBACK

Re: [O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Jeff Squyres
We've actually made quite a few bug fixes since RC4 (RC5 is not available yet). Would you mind trying with a nightly snapshot tarball? (there were some SVN commits last night after the nightly snapshot was made; I've just initiated another snapshot build -- r8085 should be on the web site

[O-MPI users] Error on mpirun in Redhat Fedora Core 4

2005-11-10 Thread Clement Chu
Hi, I got an error when tried the mpirun on mpi program. The following is the error message: [clement@kfc TestMPI]$ mpicc -g -o test main.c [clement@kfc TestMPI]$ mpirun -np 2 test mpirun noticed that job rank 1 with PID 0 on node "localhost" exited on signal 11. [kfc:28466] ERROR: A daemon on