Thanks everyone for all your assistance. The problem seems to be resolved
now, although I'm not entirely sure why these changes made a difference.
There were two things I changed:
(1) I had some additional `export ...` lines in .bashrc before the `export
PATH=...` and `export LD_LIBRARY_PATH=...`
Xie Bin,
According to the man page, -N is equivalent to npernode, which is
equivalent to --map-by ppr:N:node.
This is *not* equivalent to -map-by node :
The former packs tasks to the same node, and the latter scatters tasks
accross the nodes
[gilles@login ~]$ mpirun --host n0:2,n1:2 -N 2
Hi, George:
My command lines are:
1) single node
mpirun --allow-run-as-root -mca btl self,tcp(or openib) -mca
btl_tcp_if_include eth2 -mca btl_openib_if_include mlx5_0 -x
OMP_NUM_THREADS=2 -n 32 myapp
2) 2-node cluster
mpirun --allow-run-as-root -mca btl ^tcp(or ^openib) -mca
btl_tcp_if_include eth
Hi, John:
You are right on the network framework. I do have no IB switch and just
connect the servers with an IB cable. I did not even open the opensmd
service because it seems unnecessary in this situation. Can this be the
reason why IB performs poorer?
Interconnection details are in the attachm
In the initial report, the /usr/bin/ssh process was in the 'T' state
(it generally hints the process is attached by a debugger)
/usr/bin/ssh -x b09-32 orted
did behave as expected (e.g. orted was executed, exited with an error
since the command line is invalid, and error message was received)
Yes, that "T" state is quite puzzling. You didn't attach a debugger or hit the
ssh with a signal, did you?
(we had a similar situation on the devel list recently, but it only happened
with a very old version of Slurm. We concluded that it was a SLURM bug that
has since been fixed. And just t
You got that error because the orted is looking for its rank on the cmd line
and not finding it.
> On May 14, 2018, at 12:37 PM, Max Mellette wrote:
>
> Hi Gus,
>
> Thanks for the suggestions. The correct version of openmpi seems to be
> getting picked up; I also prepended .bashrc with the i
Hi Gus,
Thanks for the suggestions. The correct version of openmpi seems to be
getting picked up; I also prepended .bashrc with the installation path like
you suggested, but it didn't seemed to help:
user@b09-30:~$ cat .bashrc
export
PATH=/home/user/openmpi_install/bin:/usr/local/sbin:/usr/local/
Hi Max
Just in case, as environment mix often happens.
Could it be that you are inadvertently picking another
installation of OpenMPI, perhaps installed from packages
in /usr , or /usr/local?
That's easy to check with 'which mpiexec' or
'which mpicc', for instance.
Have you tried to prepend (as
Still looks to me like MPI_Scan is what you want. Just need three additional
communicators (one for each direction). With a recurive doubling MPI_Scan
inplementation it is O(log n) compared to O(n) in time.
> On May 14, 2018, at 8:42 AM, Pierre Gubernatis
> wrote:
>
> Thank you to all of yo
Shared memory communication is important for multi-core platforms,
especially when you have multiple processes per node. But this is only part
of your issue here.
You haven't specified how your processes will be mapped on your resources.
As a result rank 0 and 1 will be on the same node, so you ar
Thank you to all of you for your answers (I was off up to now).
Actually my question was't well posed. I stated it more clearly in this
post, with the answer:
https://stackoverflow.com/questions/50130688/mpi-cartesian-grid-cumulate-a-scalar-value-through-the-procs-of-a-given-axis-o?noredirect=1#c
John,
Thanks for the suggestions. In this case there is no cluster manager / job
scheduler; these are just a couple of individual hosts in a rack. The
reason for the generic names is that I anonymized the full network address
in the previous posts, truncating to just the host name.
My home direct
On Wed, May 9, 2018 at 9:45 PM, Howard Pritchard wrote:
>
> You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
> switch), and install that
> on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
> Note there is a bug right now
> in UCX that you may hit if
Xie Bin, I do hate to ask this. You say "in a two-node cluster (IB
direcet-connected). "
Does that mean that you have no IB switch, and that there is a single IB
cable joining up these two servers?
If so please run:ibstatusibhosts ibdiagnet
I am trying to check if the IB fabric is func
Hi, Nathan:
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now it worked
well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check performance between
tcp and openib. Besides, I will run the application in cluster, so vader is
not s
One very, very stupid question here. This arose over on the Slurm list
actually.
Those hostnames look like quite generic names, ie they are part of an HPC
cluster?
Do they happen to have independednt home directories for your userid?
Could that possibly make a difference to the MPI launcher?
On 14
17 matches
Mail list logo