> #AcctGatherEnergyType=acct_gather_energy/rapl
> #AcctGatherNodeFreq=30
>
> #Memoria
> #DefMemPerCPU=1024 # 1GB
> #MaxMemPerCPU=3072 # 3GB
>
>
>
> # COMPUTE NODES
> NodeName=foner[11-14] Procs=20 RealMemory= 258126 Sockets=2
> CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN
>
> NodeName=foner[101-142] CPUs=20 Sockets=2 CoresPerSocket=10
> ThreadsPerCore=1 RealMemory=64398 State=UNKNOWN
>
> PartitionName=thin Nodes=foner[103-142] Shared=NO PreemptMode=CANCEL
> State=UP MaxTime=4320 MinNodes=2
> PartitionName=thin_test Nodes=foner[101,102] Default=YES Shared=NO
> PreemptMode=CANCEL State=UP MaxTime=60 MaxNodes=1
> PartitionName=fat Nodes=foner[11-14] Shared=NO PreemptMode=CANCEL
> State=UP MaxTime=4320 MaxNodes=1
>
> ##END SLURM.CONF###
>
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
es on the first node and
> the other half on the second. Also tried to remove --nodes=2.
>
> ---
>
> It seems that it's the way sbatch influences srun. Is there anyway to
> see which parameters does the sbatch call transfers to srun?
>
> Thanks,
> Joan
>
>
>
&
the first one and 5 on the
> second one.
>
>
> Thanks and sorry for the confusion,
> Joan
>
>
>
> On 04/04/14 13:22, Mehdi Denou wrote:
>> It's a little bit confusing:
>>
>> When in sbatch I specify that I want to allocate 25 nodes and I execute
&g
to create job step: More processors requested than
> permitted
>
> Thanks
>
> On 04/04/14 13:50, Mehdi Denou wrote:
>> Try with:
>> srun -N 1 -n 25
>>
>> On 04/04/2014 13:47, Joan Arbona wrote:
>>> Excuse me, I confused "Nodes" with "Tas
Could you provide us the slurm.conf ?
On 04/04/2014 14:46, Joan Arbona wrote:
> Doesn't work either. I also tried with -m block:block with no luck...
>
> On 04/04/14 14:13, Mehdi Denou wrote:
>> Of course, -N 1 is wrong since you request more cpu than available on 1
>&
Sysadmin who will need to implement any changes) at
the appropriate documentation?
Many thanks,
Fiona
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
,
Loris
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
th Sciences*
|| \\UTGERS |-*O*-
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
|| \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
`'
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
’ with sacctmgr
from what I can tell.
It looks like in a situation like this, the qos should have the
“NoReserve” flag set too, correct?
Thanks!
Chris
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
? The only default I have found is DefaultTime.
http://slurm.schedmd.com/slurm.conf.html
Regards,
Brian
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
If you are not the intended recipient, you
>> should not disseminate, distribute or copy this e-mail. Please notify the
>> sender immediately and destroy all copies of this
>> message and any attachments. WARNING: Computer viruses can be transmitted
>> via email. The recipient
as to create
> a separate list for "more advanced topics".
>
> Any news on this?
>
> cheers,
> marcin
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
IMESTAMP NODELIST
>> Not responding root 2015-03-10T14:21:11 democlient1
>> Low socket*core*thre root 2015-03-10T14:37:51 demomaster1
>> And I am attaching configuration file too.
>> Kindly see to it.
>>
>> -Original Message-
>> F
ent where compute nodes are not on a physically secure private
> network)
>
> thanks
> --
> Simon Michnowicz
> Monash e-Research Centre
> PH: (03) 9902 0794
> Mob: 0418 302 046
> www.monash.edu.au/eresearch <http://www.monash.edu.au/eresearch>
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
rds
> Simon
>
> On 27 March 2015 at 21:47, Mehdi Denou <mailto:mehdi.de...@atos.net>> wrote:
>
> Hi Simon,
>
> As far as I know, munge allows the communication to be
> authenticated but they are not encrypted.
> If the key is compromised, you may c
ealthCheckNodeState=CYCLE
> and
> decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent
> impact (please see
> slurm.conf attached).
>
> Any advice would be gratefully received.
>
> Many thanks -
>
> Stuart
>
>
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
>
> > #SlurmctldTimeout=120
> >
> > #SlurmdTimeout=300
> >
> > #
> >
> > #
> >
> > # SCHEDULING
> >
> > FastSchedule=1
> >
> > SchedulerType=sched/backfill
> >
> > #SchedulerPort=7321
> >
> > #SelectType=select/serial
> >
> > SelectType=select/cons_res
> >
> > SelectTypeParameters=CR_CORE
> >
> > #
> >
> > #
> >
> > # LOGGING AND ACCOUNTING
> >
> > AccountingStorageType=accounting_storage/none
> >
> > ClusterName=MESA-Web
> >
> > #JobAcctGatherFrequency=30
> >
> > JobAcctGatherType=jobacct_gather/none
> >
> > SlurmctldDebug=3
> >
> > SlurmctldLogFile=/var/log/slurm/slurmctld.log
> >
> > SlurmdDebug=3
> >
> > SlurmdLogFile=/var/log/slurm/slurmd.log
> >
> > #
> >
> >
> > #
> >
> > # COMPUTE NODES
> >
> > NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1
> RealMemory=128940 TmpDisk=19895
> >
> >
> > PartitionName=compute Nodes=sod264 Default=YES STATE=UP
> >
> >
> >
> >
> > Kind Regards,
> >
> > Carl
> >
> >
> >
>
>
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
t;>> not responding, setting DOW
>> Now the nodes stop responding (not before).
>>
>>> From these logs, it looks like the compute nodes are not
>>> responding to the control node (master node).
>>>
>>> Not sure how to debug this - any tips?
>> I would suggest looking at the slurmd logs on the compute nodes to see
>> if they report any problems, and check to see what state the processes
>> are in - especially if they're stuck in a 'D' state waiting on some form
>> of device I/O.
>>
>> I know some people have reported strange interactions between Slurm
>> being on an NFSv4 mount (NFSv3 is fine).
>>
>> Good luck!
>> Chris
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
ory /scratch/merri/jobs/2096417 has been allocated
> /slurm/uid_500/job_2096417/step_0
> salloc: Relinquishing job allocation 2096417
> salloc: Job allocation 2096417 has been revoked.
> [samuel@merri ~]$
>
>
> Hope that helps!
>
> All the best,
> Chris
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
>
> Thanks.
>
> -Paul Edmon-
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
قاتها
> (إن وجدت) أو أي جزء منها، أو البوح بمحتوياتها للغير أو استعمالها لأي
> غرض. علماً بأن فحوى هذه الرسالة ومرفقاتها (ان وجدت) تعبر عن رأي المُرسل
> وليس بالضرورة رأي مدينة الملك عبدالعزيز للعلوم والتقنية بالمملكة
> العربية السعودية، ولا تتحمل المدينة أي مسئولية عن الأضرار الناتجة عن
> ما قد يحتويه هذا البريد.
>
--
---
Mehdi Denou
International HPC support
+336 45 57 66 56
gt; scontrol show job
>
>
> Regards
> A.Anandaraman
>
> --
> "Religion is just a set of symbols. It is being hijacked for political
> and monetary gains."
--
---
Mehdi Denou
Bull/Atos
International HPC support
+336 45 57 66 56
he first node where the job start.
>
> if I run "srun/sbatch --mem 512" the job will failed because there is
> not enough memory on others nodes.
>
> I have played with the Prolog/PrologSlurmctld variable to try to
> define RSS limits without success because prolog script
switches may perform some firewall
functions by themselves?
Firewalls must be off between Slurm compute nodes as well as the
controller host. See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
/Ole
--
---
Mehdi Denou
Bull/Atos international HPC support
+336 45 57 66 56
ique
--
Véronique Legrand
IT engineer – scientific calculation & software development
https://research.pasteur.fr/en/member/veronique-legrand/
<https://research.pasteur.fr/en/member/veronique-legrand/>
Cluster and computing group
IT department
Institut Pasteur Paris
Tel : 95 03
--
---
Mehdi Denou
Bull/Atos international HPC support
+336 45 57 66 56
25 matches
Mail list logo