Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
1. info from a compute node -bash-4.1$ hostname r32i1n1 -bash-4.1$ df -h /home FilesystemSize Used Avail Use% Mounted on 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 1.2P 136T 1.1P 12% /work1 -bash-4.1$ mount devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Gus Correa
Hi Beichuan OK, it says "unclassified.html", so I presume it is not a problem. The web site says the computer is an SGI ICE X. I am not familiar to it, so what follows are guesses. The SGI site brochure suggests that the nodes/blades have local disks: https://www.sgi.com/pdfs/4330.pdf The file

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
Gus, I am using this system: http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know exactly configurations of the file system. Here is the output of "df -h": FilesystemSize Used Avail Use% Mounted on /dev/sda6 919G 16G 857G 2% / tmpfs

Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Rolf vandeVaart
There is something going wrong with the ml collective component. So, if you disable it, things work. I just reconfigured without any CUDA-aware support, and I see the same failure so it has nothing to do with CUDA. Looks like Jeff Squyres just made a bug for it. https://svn.open-mpi.org/trac/o

Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Filippo Spiga
Dear Rolf, your suggestion works! $ mpirun -np 4 --map-by ppr:1:socket -bind-to core --mca coll ^ml osu_alltoall # OSU MPI All-to-All Personalized Exchange Latency Test v4.2 # Size Avg Latency(us) 1 8.02 2 2.96 4 2.91 8

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Gus Correa
Hi Beichuan If you are using the university cluster, chances are that /home is not local, but on an NFS share, or perhaps Lustre (which you may have mentioned before, I don't remember). Maybe "df -h" will show what is local what is not. It works for NFS, it prefixes file systems with the serv

Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Rolf vandeVaart
Can you try running with --mca coll ^ml and see if things work? Rolf >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga >Sent: Monday, March 03, 2014 7:14 PM >To: Open MPI Users >Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hiera

[OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Filippo Spiga
Dear Open MPI developers, I hit an expected error running OSU osu_alltoall benchmark using Open MPI 1.7.5rc1. Here the error: $ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory failed In bcol_comm_query hmca_bcol_basesm

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Gus Correa
Hi If I remember right, there were issues in the past with setting TMPDIR on an NFS share. Maybe the same problem happens on Lustre? http://arc.liv.ac.uk/pipermail/gridengine-users/2009-November/027767.html FWIW, we leave it to the default local /tmp, and it works. Gus Correa On 03/03/2014 0

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local filesystem? I don't know how to tell a directory is local file system or network file system. -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Monday, March 03

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Jeff Squyres (jsquyres)
How about setting TMPDIR to a local filesystem? On Mar 3, 2014, at 3:43 PM, Beichuan Yan wrote: > I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent > reason; 2 job complains shared-memory file on network file system, which can > be resolved by " export TMPDIR=/home

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent reason; 2 job complains shared-memory file on network file system, which can be resolved by " export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my local directory. The default TMPDIR points to a Lustre directory. There

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Jeff Squyres (jsquyres)
On Mar 3, 2014, at 1:48 PM, Beichuan Yan wrote: > 1. After sysadmin installed libibverbs-devel package, I build Open MPI 1.7.4 > successfully with the command: > ./configure > --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3 > --with-tm=/opt/pbs/default --with-verbs=/hafs_x86

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-03-03 Thread Brice Goglin
Le 03/03/2014 23:02, Gus Correa a écrit : > I rebooted the node and ran hwloc-gather-topology again. > This turn it didn't throw any errors on the terminal window, > which may be a good sign. > > [root@node14 ~]# hwloc-gather-topology /tmp/`date > +"%Y%m%d%H%M"`.$(uname -n) > Hierarchy gathered in

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-03-03 Thread Gus Correa
Hi Brice Here are answers to your questions, and my latest attempt to solve the problem: 1) Kernel version: The nodes with new motherboards (node14 and node16) have the same kernel as the nodes with original motherboards (e.g. node15), as they were cloned from the same node image: [root@node14

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
Hi, 1. After sysadmin installed libibverbs-devel package, I build Open MPI 1.7.4 successfully with the command: ./configure --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3 --with-tm=/opt/pbs/default --with-verbs=/hafs_x86_64/devel/usr --with-verbs-libdir=/hafs_x86_64/devel/us

Re: [OMPI users] new map-by-obj has a problem

2014-03-03 Thread Ralph Castain
Your patch looks fine to me, so I'll apply it. As for this second issue - good catch. Yes, if the binding directive was provided in the default MCA param file, then the proc would attempt to bind itself on startup. The best fix actually is to just tell them not to do so. We already have that mec

Re: [OMPI users] new map-by-obj has a problem

2014-03-03 Thread tmishima
Hi Ralph, I misunderstood the point of the problem. The problem is that BIND_TO_OBJ is re-tried and done in orte_ess_base_proc_binding @ ess_base_fns.c, although you try to BIND_TO_NONE in rmaps_rr_mapper.c when it's oversubscribed. Furthermore, binding in orte_ess_base_proc_binding does not sup