1. info from a compute node
-bash-4.1$ hostname
r32i1n1
-bash-4.1$ df -h /home
FilesystemSize Used Avail Use% Mounted on
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
1.2P 136T 1.1P 12% /work1
-bash-4.1$ mount
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs
Hi Beichuan
OK, it says "unclassified.html", so I presume it is not a problem.
The web site says the computer is an SGI ICE X.
I am not familiar to it, so what follows are guesses.
The SGI site brochure suggests that the nodes/blades have local disks:
https://www.sgi.com/pdfs/4330.pdf
The file
Gus,
I am using this system:
http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know exactly
configurations of the file system. Here is the output of "df -h":
FilesystemSize Used Avail Use% Mounted on
/dev/sda6 919G 16G 857G 2% /
tmpfs
There is something going wrong with the ml collective component. So, if you
disable it, things work.
I just reconfigured without any CUDA-aware support, and I see the same failure
so it has nothing to do with CUDA.
Looks like Jeff Squyres just made a bug for it.
https://svn.open-mpi.org/trac/o
Dear Rolf,
your suggestion works!
$ mpirun -np 4 --map-by ppr:1:socket -bind-to core --mca coll ^ml osu_alltoall
# OSU MPI All-to-All Personalized Exchange Latency Test v4.2
# Size Avg Latency(us)
1 8.02
2 2.96
4 2.91
8
Hi Beichuan
If you are using the university cluster, chances are that /home is not
local, but on an NFS share, or perhaps Lustre (which you may have
mentioned before, I don't remember).
Maybe "df -h" will show what is local what is not.
It works for NFS, it prefixes file systems
with the serv
Can you try running with --mca coll ^ml and see if things work?
Rolf
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, March 03, 2014 7:14 PM
>To: Open MPI Users
>Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hiera
Dear Open MPI developers,
I hit an expected error running OSU osu_alltoall benchmark using Open MPI
1.7.5rc1. Here the error:
$ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall
In bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory failed
In bcol_comm_query hmca_bcol_basesm
Hi
If I remember right, there were issues in the past with setting TMPDIR
on an NFS share. Maybe the same problem happens on Lustre?
http://arc.liv.ac.uk/pipermail/gridengine-users/2009-November/027767.html
FWIW, we leave it to the default local /tmp, and it works.
Gus Correa
On 03/03/2014 0
How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local filesystem?
I don't know how to tell a directory is local file system or network file
system.
-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres
(jsquyres)
Sent: Monday, March 03
How about setting TMPDIR to a local filesystem?
On Mar 3, 2014, at 3:43 PM, Beichuan Yan wrote:
> I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent
> reason; 2 job complains shared-memory file on network file system, which can
> be resolved by " export TMPDIR=/home
I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent
reason; 2 job complains shared-memory file on network file system, which can
be resolved by " export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my local
directory. The default TMPDIR points to a Lustre directory.
There
On Mar 3, 2014, at 1:48 PM, Beichuan Yan wrote:
> 1. After sysadmin installed libibverbs-devel package, I build Open MPI 1.7.4
> successfully with the command:
> ./configure
> --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3
> --with-tm=/opt/pbs/default --with-verbs=/hafs_x86
Le 03/03/2014 23:02, Gus Correa a écrit :
> I rebooted the node and ran hwloc-gather-topology again.
> This turn it didn't throw any errors on the terminal window,
> which may be a good sign.
>
> [root@node14 ~]# hwloc-gather-topology /tmp/`date
> +"%Y%m%d%H%M"`.$(uname -n)
> Hierarchy gathered in
Hi Brice
Here are answers to your questions,
and my latest attempt to solve the problem:
1) Kernel version:
The nodes with new motherboards (node14 and node16) have the
same kernel as the nodes with original motherboards (e.g. node15),
as they were cloned from the same node image:
[root@node14
Hi,
1. After sysadmin installed libibverbs-devel package, I build Open MPI 1.7.4
successfully with the command:
./configure --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3
--with-tm=/opt/pbs/default --with-verbs=/hafs_x86_64/devel/usr
--with-verbs-libdir=/hafs_x86_64/devel/us
Your patch looks fine to me, so I'll apply it. As for this second issue - good
catch. Yes, if the binding directive was provided in the default MCA param
file, then the proc would attempt to bind itself on startup. The best fix
actually is to just tell them not to do so. We already have that mec
Hi Ralph, I misunderstood the point of the problem.
The problem is that BIND_TO_OBJ is re-tried and done in
orte_ess_base_proc_binding @ ess_base_fns.c, although you try to
BIND_TO_NONE in rmaps_rr_mapper.c when it's oversubscribed.
Furthermore, binding in orte_ess_base_proc_binding does not sup
18 matches
Mail list logo