[slurm-dev] gmail spam filters?

2015-07-30 Thread Michael Di Domenico
is anyone else having an issue using a gmail address for the slurm mailling lists? Gmail keeps blocking all the slurm mail for my account and marking it as Spam. A little yellow box pops up and says this message is in violation of gmails bulk sender something or other

[slurm-dev] add/remove node from partition

2015-11-25 Thread Michael Di Domenico
is it possible to add or remove just a single node from a partition without having to re-establish the whole list of nodes? for example if i have nodes[001-100] and i want to remove only node 049. is there some incantation that will allow me to do that without having to say nodes[001-048,050-10

[slurm-dev] Re: add/remove node from partition

2015-11-25 Thread Michael Di Domenico
sorry for the double post. i'm not sure why gmail decided to send the message again, i didn't send it twice... On Wed, Nov 25, 2015 at 9:33 AM, Michael Di Domenico wrote: > > is it possible to add or remove just a single node from a partition > without having to re-establis

[slurm-dev] srun and openmpi

2015-12-16 Thread Michael Di Domenico
i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0. but i seem to have an srun oddity i've not seen before and i'm not exactly sure how to debug it srun -n 4 hello_world - does not run, hangs in MPI_INIT srun -n 4 -N1 hello_world - does not run, hangs in MPI_INIT srun -n 4 -N 4 - r

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico
> Hey Michael > > Check ompi_info and ensure that the PMI support built - you have to > explicitly ask for it and provide the path to pmi.h > > >> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico >> wrote: >> >> >> i just compiled and installed Slur

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico
On most slurm installs I believe the MPI type defaults to none, > have you tried adding --mpi=pmi2 or --mpi=openmpi to your srun command? > Ian > > On Wed, Dec 16, 2015 at 9:50 AM, Michael Di Domenico > > wrote: > >> >> Yes, i have PMI support included into op

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico
ensure it picked up the right > ess component. > > I can try to replicate here, but it will take me a little while to get to > it > >> On Dec 16, 2015, at 8:50 AM, Michael Di Domenico >> wrote: >> >> >> Yes, i have PMI support included into openmpi >

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico
to add some additional info, i let it sit for a long time and finally got PSM returned unhandled/unknown connect error: Operation timed out PSM EP connect error (uknown connect error) so perhaps my old friend psm and srun aren't getting along again... On 12/16/15, Michael Di Domenico

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico
Adding OMPI_MCA_mtl=^psm to my environment and re-running the 'srun -n4 hello_world' seems to fix the issue so i guess we've isolated the problem to slurm/srun and psm, but now the question is what's broke On 12/16/15, Michael Di Domenico wrote: > > to add some add

[slurm-dev] Re: Regards Postgres Plugin for SLURM

2016-03-21 Thread Michael Di Domenico
just to jump on the wagon, we would prefer a postgres option as well. i can't "Create, debug and contribute back" a plugin, but i could help in some fashion On Sun, Mar 20, 2016 at 5:41 PM, Simpson Lachlan wrote: > I think we would like a PostgreSQL plugin too - if you start building one, > ple

[slurm-dev] Re: gmail spam filters?

2016-07-08 Thread Michael Di Domenico
On Thu, Jul 30, 2015 at 7:43 AM, Michael Di Domenico wrote: > is anyone else having an issue using a gmail address for the slurm > mailling lists? Gmail keeps blocking all the slurm mail for my > account and marking it as Spam. A little yellow box pops up and says > this m

[slurm-dev] Re: gmail spam filters?

2016-07-08 Thread Michael Di Domenico
On Fri, Jul 8, 2016 at 1:22 PM, Tim Wickberg wrote: > > I've made a few minor changes to our SPF records, and fixed the reverse IP > record for the mailing list server. > > I highly recommend filtering based on the list ID (slurmdev.schedmd.com), > which has remained unchanged for a long time, an

[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-25 Thread Michael Di Domenico
I'm seeing this presently on our new cluster. I'm not sure what's going on. Did this every get resolved? I can confirm that we have compiled openmpi with the slurm options. we have other clusters which work fine, albeit this is our first mellanox based IB cluster, so i'm not sure if that has an

[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-25 Thread Michael Di Domenico
although i see this with and without slurm, so there very well maybe something wrong with my ompi compile On Thu, Aug 25, 2016 at 2:04 PM, Michael Di Domenico wrote: > > I'm seeing this presently on our new cluster. I'm not sure what's > going on. Did this every

[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-26 Thread Michael Di Domenico
On Thu, Aug 25, 2016 at 3:26 PM, r...@open-mpi.org wrote: > > Check your IB setup, Michael - you probably don’t have UD enabled on it is it off by default? we're running the default openib stack in rhel 6.7. i'm not even sure where to check for it being on/off, i've never had to specifically

[slurm-dev] Re: Gres issue

2016-11-16 Thread Michael Di Domenico
this might be nothing, but i usually call --gres with an equals srun --gres=gpu:k10:8 i'm not sure if the equals is optional or not On Wed, Nov 16, 2016 at 4:34 AM, Dmitrij S. Kryzhevich wrote: > > Hi, > > I have some issues with gres usage. I'm running slurm of 16.05.4 version and > I have

[slurm-dev] limiting scontrol

2016-11-17 Thread Michael Di Domenico
i'm a little hazy on account security controls, so i might some correction on this as i understand it users have accounts inside /etc/passwd users can also have accounts inside slurm and then there's the root account if i don't add anyone to the slurm accounts, everyone is basically at the lo

[slurm-dev] Re: Node selection for serial tasks

2016-12-08 Thread Michael Di Domenico
On Thu, Dec 8, 2016 at 5:48 AM, Nigella Sanders wrote: > > All 30 tasks run always in the first two allocated nodes (torus6001 and > torus6002). > > However, I would like to get these tasks using only the second and then > third nodes (torus6002 and torus6003). > Does anyone an idea about how to

[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Michael Di Domenico
what behaviour do you get if you leave off the exclusive and cyclic options? which selecttype are you using? On Tue, Jan 3, 2017 at 12:19 PM, Koziol, Lucas wrote: > Dear Vendor, > > > > > > What I want to do is run a large number of single-CPU tasks, and have them > distributed evenly over al

[slurm-dev] Re: Job temporary directory

2017-01-20 Thread Michael Di Domenico
On Fri, Jan 20, 2017 at 11:16 AM, John Hearns wrote: > As I remember, in SGE and in PbsPro a job has a directory created for it on > the execution host which is a temporary directory, named with he jobid. > you can define int he batch system configuration where the root of these > directories is.

[slurm-dev] RE: Slurm for render farm

2017-01-20 Thread Michael Di Domenico
On Fri, Jan 20, 2017 at 11:09 AM, John Hearns wrote: > I plan to have this wrapper script run the actual render through slurm. > The script will have to block until the job completes I think - else the > RenderPal server will report it has finished. > Is it possible to block and wait till an sbat

[slurm-dev] Re: Unable to allocate Gres by type

2017-02-06 Thread Michael Di Domenico
On Mon, Feb 6, 2017 at 10:17 AM, Hans-Nikolai Viessmann wrote: > > I had just added the DebugFlags setting to slurm.conf on the head node > and did not sychronise it with the nodes. I doubt that this could cause the > problem I described as it was occuring before I made the change to > slurm.conf

[slurm-dev] Re: Unable to allocate Gres by type

2017-02-06 Thread Michael Di Domenico
On Mon, Feb 6, 2017 at 11:50 AM, Hans-Nikolai Viessmann wrote: > Hi Michael, > > That's an interesting suggestion, and this works for you? I'm > a bit confused then because the man-page for gres.conf > states otherwise: https://slurm.schedmd.com/gres.conf.html, > indicating the it mustmatch one o

[slurm-dev] Re: Unable to allocate Gres by type

2017-02-08 Thread Michael Di Domenico
On Mon, Feb 6, 2017 at 1:55 PM, Hans-Nikolai Viessmann wrote: > Hi Michael, > > Yes, on all the compute nodes there is a gres.conf, and all the GPU nodes > except gpu08 have the following defined: > > Name=gpu Count=1 > Name=mic Count=0 > > The head node has this defined: > > Name=gpu Count=0 > N

[slurm-dev] Re: Potential CUDA Memory Allocation Issues

2017-04-27 Thread Michael Di Domenico
You can stop tracking memory by changing SelectTypeParameters=CR_Core_Memory to SelectTypeParameters=CR_Core doing this will mean slurm is no longer tracking memory at all and jobs could in theory stop on one another if they allocate to much physical memory. we haven't started tracking memory on

[slurm-dev] detecting gpu's in use

2017-07-31 Thread Michael Di Domenico
do people here running slurm with gres based gpu's, check that the gpu is actually usable before launching the job? if so, can you detail how you're doing it? my cluster is currently using slurm, but we run htcondor on the nodes in the background. when a node isn't currently allocated through s

[slurm-dev] Re: detecting gpu's in use

2017-07-31 Thread Michael Di Domenico
On Mon, Jul 31, 2017 at 11:58 AM, Sean McGrath wrote: > We do check that the GPU drivers are working on nodes before launching a job > on > them. > > Our prolog call's another in house script, (we should move to NHC to be > honest), > that does the following: > > if [[ -n $(lspci | grep

[slurm-dev] node selection

2017-10-18 Thread Michael Di Domenico
is there anyway after a job starts to determine why the scheduler choose the series of nodes it did? for some reason on an empty cluster when i spin up a large job it's staggering the allocation across a seemingly random allocation of nodes we're using backfill/cons_res + gres, and all the nodes

[slurm-dev] Re: node selection

2017-10-20 Thread Michael Di Domenico
On Thu, Oct 19, 2017 at 3:14 AM, Steffen Grunewald wrote: >> for some reason on an empty cluster when i spin up a large job it's >> staggering the allocation across a seemingly random allocation of >> nodes > > Have you looked into topology? With topology.conf, you may group nodes > by (virtually

[slurm-dev] Re: CPU/GPU Affinity Not Working

2017-10-27 Thread Michael Di Domenico
On Thu, Oct 26, 2017 at 1:39 PM, Kilian Cavalotti wrote: > and for a 4-GPU node which has a gres.conf like this (don't ask, some > vendors like their CPU ids alternating between sockets): > > NodeName=sh-114-03 name=gpuFile=/dev/nvidia[0-1] > CPUs=0,2,4,6,8,10,12,14,16,18 > NodeName=sh-114-

[slurm-dev] Long delay with SlurmctldProlog enabled

2013-05-21 Thread Michael Di Domenico
I have Slurm 2.5.3 installed and running, however, when i enable the slurmctldprolog script my jobs seem to stall for a long time. when i srun a job srun hostname Job step creation disabled, retrying -- it will sit for usually between 1-2mins while the slurmctld.log file shows a repeating synta