Open MPI IB Gurus,
I have some slightly older InfiniBand-equipped nodes with IB which have
less RAM than we'd like, and on which we tend to run jobs that can span
16-32 nodes of this type. The jobs themselves tend to run on the heavy
side in terms of their own memory requirements.
When we used t
reminder, I responded to the firmware part of this earlier:
http://www.open-mpi.org/community/lists/users/2011/12/18014.php
Thank you,
V. Ram
--
http://www.fastmail.fm - Access your email from home and the web
Hello,
On Mon, Dec 19, 2011, at 03:30 PM, Yevgeny Kliteynik wrote:
> Hi,
>
> What's the smallest number of nodes that are needed to reproduce this
> problem? Does it happen with just two HCAs, one process per node?
I believe so, but I will work with some users to verify this.
> Let's get you to
ing from?
Thank you.
V. Ram
> On Dec 15, 2011, at 7:24 PM, V. Ram wrote:
>
> > Hi Terry,
> >
> > Thanks so much for the response. My replies are in-line below.
> >
> > On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote:
> >> IIRC, RNR
Our Slurm partitions are defined by hardware type, and we do not allow
users to run jobs across different hardware types using InfiniBand. If
they want to run embarrassingly parallel jobs across different hardware
types, we mandate that they use Ethernet only (which does work as
expected).
Thank y
#x27;t seem, based on the limited number of observable parameters I'm
aware of, to be dependent on the number of nodes involved.
It is an intermittent problem, but when it happens, it happens at job
launch, and it does occur most of the time.
Thanks,
V. Ram
> --td
> >
> > Open MP
Open MPI InfiniBand gurus and/or Mellanox: could I please get some
assistance with this? Any suggestions on tunables or debugging
parameters to try?
Thank you very much.
On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote:
> Hello,
>
> We are running a cluster that has a good number of ol
HCAs use the same InfiniBand fabric
continuously without any issue, so I don't think it's the fabric/switch.
I'm at a loss for what to do next to try and find the root cause of the
issue. I suspect something perhaps having to do with the mthca
support/drivers, but how can I track
Terry Frankcombe wrote:
> Isn't it up to the OS scheduler what gets run where?
I was under the impression that the processor affinity API was designed
to let the OS (at least Linux) know how a given task preferred to be
bound in terms of the system topology.
--
V. Ram
v_r_...@fas
ot have any easy way to tell that without a
hostfile, etc.
--
V. Ram
v_r_...@fastmail.fm
--
http://www.fastmail.fm - Or how I learned to stop worrying and
love email again
7;t see
whole sockets (all 4 cores) active at a time on this job.
Does this make more sense?
--
V. Ram
v_r_...@fastmail.fm
--
http://www.fastmail.fm - A no graphics, no pop-ups email service
such functionality is technically possible via PLPA. Is
there in fact a way to specify such a thing with 1.2.8, and if not, will
1.3 support these kinds arguments?
Thank you.
--
V. Ram
v_r_...@fastmail.fm
--
http://www.fastmail.fm - Or how I learned to stop worryin
elp anyone else experiencing the same issues.
Thanks Leonardo!
OMPI devs: does this imply bug(s) in the e1000 driver/chip? Should I
contact the driver authors?
On Fri, 10 Oct 2008 12:42:19 -0400, "V. Ram" said:
> Leonardo,
>
> These nodes are all using intel e1000 chips. As t
eth1". You should
> > try to restrict Open MPI to use only one of the available networks by
> > using the --mca btl_tcp_if_include ethx parameter to mpirun, where x
> > is the network interface that is always connected to the same logical
> > and physical network on
s connected to the same logical
> and physical network on your machine.
I was pretty sure this wasn't the problem since basically all the nodes
only have one interface configured, but I had the user try the --mca
btl_tcp_if_include parameter. The same result / crash occurred.
>
>
nstallation/software on the system? We have tried
"--debug-daemons" with no new/interesting information being revealed.
Is there a way to trap segfault messages or more detailed MPI
transaction information or anything else that could help diagnose this?
Thanks.
--
V. Ram
OMPI installation/software on the system? We have tried
"--debug-daemons" with no new/interesting information being revealed.
Is there a way to trap segfault messages or more detailed MPI
transaction information or anything else that could help diagnose this?
Thanks.
--
V. Ram
v_r_
17 matches
Mail list logo