[OMPI users] IB Memory Requirements, adjusting for reduced memory consumption

2012-01-12 Thread V. Ram
Open MPI IB Gurus, I have some slightly older InfiniBand-equipped nodes with IB which have less RAM than we'd like, and on which we tend to run jobs that can span 16-32 nodes of this type. The jobs themselves tend to run on the heavy side in terms of their own memory requirements. When we used t

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2012-01-04 Thread V. Ram
reminder, I responded to the firmware part of this earlier: http://www.open-mpi.org/community/lists/users/2011/12/18014.php Thank you, V. Ram -- http://www.fastmail.fm - Access your email from home and the web

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-20 Thread V. Ram
Hello, On Mon, Dec 19, 2011, at 03:30 PM, Yevgeny Kliteynik wrote: > Hi, > > What's the smallest number of nodes that are needed to reproduce this > problem? Does it happen with just two HCAs, one process per node? I believe so, but I will work with some users to verify this. > Let's get you to

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-19 Thread V. Ram
ing from? Thank you. V. Ram > On Dec 15, 2011, at 7:24 PM, V. Ram wrote: > > > Hi Terry, > > > > Thanks so much for the response. My replies are in-line below. > > > > On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote: > >> IIRC, RNR

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-19 Thread V. Ram
Our Slurm partitions are defined by hardware type, and we do not allow users to run jobs across different hardware types using InfiniBand. If they want to run embarrassingly parallel jobs across different hardware types, we mandate that they use Ethernet only (which does work as expected). Thank y

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-15 Thread V. Ram
#x27;t seem, based on the limited number of observable parameters I'm aware of, to be dependent on the number of nodes involved. It is an intermittent problem, but when it happens, it happens at job launch, and it does occur most of the time. Thanks, V. Ram > --td > > > > Open MP

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-14 Thread V. Ram
Open MPI InfiniBand gurus and/or Mellanox: could I please get some assistance with this? Any suggestions on tunables or debugging parameters to try? Thank you very much. On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote: > Hello, > > We are running a cluster that has a good number of ol

[OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-12 Thread V. Ram
HCAs use the same InfiniBand fabric continuously without any issue, so I don't think it's the fabric/switch. I'm at a loss for what to do next to try and find the root cause of the issue. I suspect something perhaps having to do with the mthca support/drivers, but how can I track

Re: [OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-05 Thread V. Ram
Terry Frankcombe wrote: > Isn't it up to the OS scheduler what gets run where? I was under the impression that the processor affinity API was designed to let the OS (at least Linux) know how a given task preferred to be bound in terms of the system topology. -- V. Ram v_r_...@fas

Re: [OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-05 Thread V. Ram
ot have any easy way to tell that without a hostfile, etc. -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - Or how I learned to stop worrying and love email again

Re: [OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-04 Thread V. Ram
7;t see whole sockets (all 4 cores) active at a time on this job. Does this make more sense? -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - A no graphics, no pop-ups email service

[OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-04 Thread V. Ram
such functionality is technically possible via PLPA. Is there in fact a way to specify such a thing with 1.2.8, and if not, will 1.3 support these kinds arguments? Thank you. -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - Or how I learned to stop worryin

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-24 Thread V. Ram
elp anyone else experiencing the same issues. Thanks Leonardo! OMPI devs: does this imply bug(s) in the e1000 driver/chip? Should I contact the driver authors? On Fri, 10 Oct 2008 12:42:19 -0400, "V. Ram" said: > Leonardo, > > These nodes are all using intel e1000 chips. As t

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-10 Thread V. Ram
eth1". You should > > try to restrict Open MPI to use only one of the available networks by > > using the --mca btl_tcp_if_include ethx parameter to mpirun, where x > > is the network interface that is always connected to the same logical > > and physical network on

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-10 Thread V. Ram
s connected to the same logical > and physical network on your machine. I was pretty sure this wasn't the problem since basically all the nodes only have one interface configured, but I had the user try the --mca btl_tcp_if_include parameter. The same result / crash occurred. > >

[OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-01 Thread V. Ram
nstallation/software on the system? We have tried "--debug-daemons" with no new/interesting information being revealed. Is there a way to trap segfault messages or more detailed MPI transaction information or anything else that could help diagnose this? Thanks. -- V. Ram

[OMPI users] Crash in code using OMPI 1.2.7 - Debugging assistance sought

2008-09-24 Thread V. Ram
OMPI installation/software on the system? We have tried "--debug-daemons" with no new/interesting information being revealed. Is there a way to trap segfault messages or more detailed MPI transaction information or anything else that could help diagnose this? Thanks. -- V. Ram v_r_