is anyone else having an issue using a gmail address for the slurm
mailling lists? Gmail keeps blocking all the slurm mail for my
account and marking it as Spam. A little yellow box pops up and says
this message is in violation of gmails bulk sender something or other
is it possible to add or remove just a single node from a partition
without having to re-establish the whole list of nodes?
for example
if i have nodes[001-100] and i want to remove only node 049. is there
some incantation that will allow me to do that without having to say
nodes[001-048,050-10
sorry for the double post. i'm not sure why gmail decided to send the
message again, i didn't send it twice...
On Wed, Nov 25, 2015 at 9:33 AM, Michael Di Domenico
wrote:
>
> is it possible to add or remove just a single node from a partition
> without having to re-establis
i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0. but i
seem to have an srun oddity i've not seen before and i'm not exactly
sure how to debug it
srun -n 4 hello_world
- does not run, hangs in MPI_INIT
srun -n 4 -N1 hello_world
- does not run, hangs in MPI_INIT
srun -n 4 -N 4
- r
> Hey Michael
>
> Check ompi_info and ensure that the PMI support built - you have to
> explicitly ask for it and provide the path to pmi.h
>
>
>> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico
>> wrote:
>>
>>
>> i just compiled and installed Slur
On most slurm installs I believe the MPI type defaults to none,
> have you tried adding --mpi=pmi2 or --mpi=openmpi to your srun command?
> Ian
>
> On Wed, Dec 16, 2015 at 9:50 AM, Michael Di Domenico
> > wrote:
>
>>
>> Yes, i have PMI support included into op
ensure it picked up the right
> ess component.
>
> I can try to replicate here, but it will take me a little while to get to
> it
>
>> On Dec 16, 2015, at 8:50 AM, Michael Di Domenico
>> wrote:
>>
>>
>> Yes, i have PMI support included into openmpi
>
to add some additional info, i let it sit for a long time and finally got
PSM returned unhandled/unknown connect error: Operation timed out
PSM EP connect error (uknown connect error)
so perhaps my old friend psm and srun aren't getting along again...
On 12/16/15, Michael Di Domenico
Adding
OMPI_MCA_mtl=^psm
to my environment and re-running the 'srun -n4 hello_world' seems to
fix the issue
so i guess we've isolated the problem to slurm/srun and psm, but now
the question is what's broke
On 12/16/15, Michael Di Domenico wrote:
>
> to add some add
just to jump on the wagon, we would prefer a postgres option as well.
i can't "Create, debug and contribute back" a plugin, but i could help
in some fashion
On Sun, Mar 20, 2016 at 5:41 PM, Simpson Lachlan
wrote:
> I think we would like a PostgreSQL plugin too - if you start building one,
> ple
On Thu, Jul 30, 2015 at 7:43 AM, Michael Di Domenico
wrote:
> is anyone else having an issue using a gmail address for the slurm
> mailling lists? Gmail keeps blocking all the slurm mail for my
> account and marking it as Spam. A little yellow box pops up and says
> this m
On Fri, Jul 8, 2016 at 1:22 PM, Tim Wickberg wrote:
>
> I've made a few minor changes to our SPF records, and fixed the reverse IP
> record for the mailing list server.
>
> I highly recommend filtering based on the list ID (slurmdev.schedmd.com),
> which has remained unchanged for a long time, an
I'm seeing this presently on our new cluster. I'm not sure what's
going on. Did this every get resolved?
I can confirm that we have compiled openmpi with the slurm options.
we have other clusters which work fine, albeit this is our first
mellanox based IB cluster, so i'm not sure if that has an
although i see this with and without slurm, so there very well maybe
something wrong with my ompi compile
On Thu, Aug 25, 2016 at 2:04 PM, Michael Di Domenico
wrote:
>
> I'm seeing this presently on our new cluster. I'm not sure what's
> going on. Did this every
On Thu, Aug 25, 2016 at 3:26 PM, r...@open-mpi.org wrote:
>
> Check your IB setup, Michael - you probably don’t have UD enabled on it
is it off by default? we're running the default openib stack in rhel
6.7. i'm not even sure where to check for it being on/off, i've
never had to specifically
this might be nothing, but i usually call --gres with an equals
srun --gres=gpu:k10:8
i'm not sure if the equals is optional or not
On Wed, Nov 16, 2016 at 4:34 AM, Dmitrij S. Kryzhevich wrote:
>
> Hi,
>
> I have some issues with gres usage. I'm running slurm of 16.05.4 version and
> I have
i'm a little hazy on account security controls, so i might some
correction on this
as i understand it
users have accounts inside /etc/passwd
users can also have accounts inside slurm
and then there's the root account
if i don't add anyone to the slurm accounts, everyone is basically at
the lo
On Thu, Dec 8, 2016 at 5:48 AM, Nigella Sanders
wrote:
>
> All 30 tasks run always in the first two allocated nodes (torus6001 and
> torus6002).
>
> However, I would like to get these tasks using only the second and then
> third nodes (torus6002 and torus6003).
> Does anyone an idea about how to
what behaviour do you get if you leave off the exclusive and cyclic
options? which selecttype are you using?
On Tue, Jan 3, 2017 at 12:19 PM, Koziol, Lucas
wrote:
> Dear Vendor,
>
>
>
>
>
> What I want to do is run a large number of single-CPU tasks, and have them
> distributed evenly over al
On Fri, Jan 20, 2017 at 11:16 AM, John Hearns wrote:
> As I remember, in SGE and in PbsPro a job has a directory created for it on
> the execution host which is a temporary directory, named with he jobid.
> you can define int he batch system configuration where the root of these
> directories is.
On Fri, Jan 20, 2017 at 11:09 AM, John Hearns wrote:
> I plan to have this wrapper script run the actual render through slurm.
> The script will have to block until the job completes I think - else the
> RenderPal server will report it has finished.
> Is it possible to block and wait till an sbat
On Mon, Feb 6, 2017 at 10:17 AM, Hans-Nikolai Viessmann wrote:
>
> I had just added the DebugFlags setting to slurm.conf on the head node
> and did not sychronise it with the nodes. I doubt that this could cause the
> problem I described as it was occuring before I made the change to
> slurm.conf
On Mon, Feb 6, 2017 at 11:50 AM, Hans-Nikolai Viessmann wrote:
> Hi Michael,
>
> That's an interesting suggestion, and this works for you? I'm
> a bit confused then because the man-page for gres.conf
> states otherwise: https://slurm.schedmd.com/gres.conf.html,
> indicating the it mustmatch one o
On Mon, Feb 6, 2017 at 1:55 PM, Hans-Nikolai Viessmann wrote:
> Hi Michael,
>
> Yes, on all the compute nodes there is a gres.conf, and all the GPU nodes
> except gpu08 have the following defined:
>
> Name=gpu Count=1
> Name=mic Count=0
>
> The head node has this defined:
>
> Name=gpu Count=0
> N
You can stop tracking memory by changing
SelectTypeParameters=CR_Core_Memory to SelectTypeParameters=CR_Core
doing this will mean slurm is no longer tracking memory at all and
jobs could in theory stop on one another if they allocate to much
physical memory.
we haven't started tracking memory on
do people here running slurm with gres based gpu's, check that the gpu
is actually usable before launching the job? if so, can you detail
how you're doing it?
my cluster is currently using slurm, but we run htcondor on the nodes
in the background. when a node isn't currently allocated through
s
On Mon, Jul 31, 2017 at 11:58 AM, Sean McGrath wrote:
> We do check that the GPU drivers are working on nodes before launching a job
> on
> them.
>
> Our prolog call's another in house script, (we should move to NHC to be
> honest),
> that does the following:
>
> if [[ -n $(lspci | grep
is there anyway after a job starts to determine why the scheduler
choose the series of nodes it did?
for some reason on an empty cluster when i spin up a large job it's
staggering the allocation across a seemingly random allocation of
nodes
we're using backfill/cons_res + gres, and all the nodes
On Thu, Oct 19, 2017 at 3:14 AM, Steffen Grunewald
wrote:
>> for some reason on an empty cluster when i spin up a large job it's
>> staggering the allocation across a seemingly random allocation of
>> nodes
>
> Have you looked into topology? With topology.conf, you may group nodes
> by (virtually
On Thu, Oct 26, 2017 at 1:39 PM, Kilian Cavalotti
wrote:
> and for a 4-GPU node which has a gres.conf like this (don't ask, some
> vendors like their CPU ids alternating between sockets):
>
> NodeName=sh-114-03 name=gpuFile=/dev/nvidia[0-1]
> CPUs=0,2,4,6,8,10,12,14,16,18
> NodeName=sh-114-
I have Slurm 2.5.3 installed and running, however, when i enable the
slurmctldprolog script my jobs seem to stall for a long time.
when i srun a job
srun hostname
Job step creation disabled, retrying
-- it will sit for usually between 1-2mins
while the slurmctld.log file shows a repeating synta
31 matches
Mail list logo