[slurm-dev] Re: Jobstep distribution among nodes

2014-04-04 Thread Mehdi Denou
> #AcctGatherEnergyType=acct_gather_energy/rapl > #AcctGatherNodeFreq=30 > > #Memoria > #DefMemPerCPU=1024 # 1GB > #MaxMemPerCPU=3072 # 3GB > > > > # COMPUTE NODES > NodeName=foner[11-14] Procs=20 RealMemory= 258126 Sockets=2 > CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN > > NodeName=foner[101-142] CPUs=20 Sockets=2 CoresPerSocket=10 > ThreadsPerCore=1 RealMemory=64398 State=UNKNOWN > > PartitionName=thin Nodes=foner[103-142] Shared=NO PreemptMode=CANCEL > State=UP MaxTime=4320 MinNodes=2 > PartitionName=thin_test Nodes=foner[101,102] Default=YES Shared=NO > PreemptMode=CANCEL State=UP MaxTime=60 MaxNodes=1 > PartitionName=fat Nodes=foner[11-14] Shared=NO PreemptMode=CANCEL > State=UP MaxTime=4320 MaxNodes=1 > > ##END SLURM.CONF### > -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Jobstep distribution among nodes

2014-04-04 Thread Mehdi Denou
es on the first node and > the other half on the second. Also tried to remove --nodes=2. > > --- > > It seems that it's the way sbatch influences srun. Is there anyway to > see which parameters does the sbatch call transfers to srun? > > Thanks, > Joan > > > &

[slurm-dev] Re: Jobstep distribution among nodes

2014-04-04 Thread Mehdi Denou
the first one and 5 on the > second one. > > > Thanks and sorry for the confusion, > Joan > > > > On 04/04/14 13:22, Mehdi Denou wrote: >> It's a little bit confusing: >> >> When in sbatch I specify that I want to allocate 25 nodes and I execute &g

[slurm-dev] Re: Jobstep distribution among nodes

2014-04-04 Thread Mehdi Denou
to create job step: More processors requested than > permitted > > Thanks > > On 04/04/14 13:50, Mehdi Denou wrote: >> Try with: >> srun -N 1 -n 25 >> >> On 04/04/2014 13:47, Joan Arbona wrote: >>> Excuse me, I confused "Nodes" with "Tas

[slurm-dev] Re: Jobstep distribution among nodes

2014-04-07 Thread Mehdi Denou
Could you provide us the slurm.conf ? On 04/04/2014 14:46, Joan Arbona wrote: > Doesn't work either. I also tried with -m block:block with no luck... > > On 04/04/14 14:13, Mehdi Denou wrote: >> Of course, -N 1 is wrong since you request more cpu than available on 1 >&

[slurm-dev] Re: Limiting no. of jobs per user for batch/interactive access

2014-12-02 Thread Mehdi Denou
Sysadmin who will need to implement any changes) at the appropriate documentation? Many thanks, Fiona -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Restricting number of jobs per user in partition

2015-01-14 Thread Mehdi Denou
, Loris -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui?

2015-01-29 Thread Mehdi Denou
th Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `' -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Possible to exclude acct from priority weight factors?

2015-02-17 Thread Mehdi Denou
’ with sacctmgr from what I can tell. It looks like in a situation like this, the qos should have the “NoReserve” flag set too, correct? Thanks! Chris -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Submission Defaults

2015-02-19 Thread Mehdi Denou
? The only default I have found is DefaultTime. http://slurm.schedmd.com/slurm.conf.html Regards, Brian -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: node getting again and again to drain or down state

2015-03-10 Thread Mehdi Denou
If you are not the intended recipient, you >> should not disseminate, distribute or copy this e-mail. Please notify the >> sender immediately and destroy all copies of this >> message and any attachments. WARNING: Computer viruses can be transmitted >> via email. The recipient

[slurm-dev] Re: Separate slurm-realdev list

2015-03-10 Thread Mehdi Denou
as to create > a separate list for "more advanced topics". > > Any news on this? > > cheers, > marcin -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: node getting again and again to drain or down state

2015-03-10 Thread Mehdi Denou
IMESTAMP NODELIST >> Not responding root 2015-03-10T14:21:11 democlient1 >> Low socket*core*thre root 2015-03-10T14:37:51 demomaster1 >> And I am attaching configuration file too. >> Kindly see to it. >> >> -Original Message- >> F

[slurm-dev] Re: Slurm and MUNGE security

2015-03-27 Thread Mehdi Denou
ent where compute nodes are not on a physically secure private > network) > > thanks > -- > Simon Michnowicz > Monash e-Research Centre > PH: (03) 9902 0794 > Mob: 0418 302 046 > www.monash.edu.au/eresearch <http://www.monash.edu.au/eresearch> -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Slurm and MUNGE security

2015-03-27 Thread Mehdi Denou
rds > Simon > > On 27 March 2015 at 21:47, Mehdi Denou <mailto:mehdi.de...@atos.net>> wrote: > > Hi Simon, > > As far as I know, munge allows the communication to be > authenticated but they are not encrypted. > If the key is compromised, you may c

[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4

2015-03-27 Thread Mehdi Denou
ealthCheckNodeState=CYCLE > and > decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent > impact (please see > slurm.conf attached). > > Any advice would be gratefully received. > > Many thanks - > > Stuart > > -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Configuration Issues

2015-03-30 Thread Mehdi Denou
> > > #SlurmctldTimeout=120 > > > > #SlurmdTimeout=300 > > > > # > > > > # > > > > # SCHEDULING > > > > FastSchedule=1 > > > > SchedulerType=sched/backfill > > > > #SchedulerPort=7321 > > > > #SelectType=select/serial > > > > SelectType=select/cons_res > > > > SelectTypeParameters=CR_CORE > > > > # > > > > # > > > > # LOGGING AND ACCOUNTING > > > > AccountingStorageType=accounting_storage/none > > > > ClusterName=MESA-Web > > > > #JobAcctGatherFrequency=30 > > > > JobAcctGatherType=jobacct_gather/none > > > > SlurmctldDebug=3 > > > > SlurmctldLogFile=/var/log/slurm/slurmctld.log > > > > SlurmdDebug=3 > > > > SlurmdLogFile=/var/log/slurm/slurmd.log > > > > # > > > > > > # > > > > # COMPUTE NODES > > > > NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 > RealMemory=128940 TmpDisk=19895 > > > > > > PartitionName=compute Nodes=sod264 Default=YES STATE=UP > > > > > > > > > > Kind Regards, > > > > Carl > > > > > > > > -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Problems running job

2015-03-31 Thread Mehdi Denou
t;>> not responding, setting DOW >> Now the nodes stop responding (not before). >> >>> From these logs, it looks like the compute nodes are not >>> responding to the control node (master node). >>> >>> Not sure how to debug this - any tips? >> I would suggest looking at the slurmd logs on the compute nodes to see >> if they report any problems, and check to see what state the processes >> are in - especially if they're stuck in a 'D' state waiting on some form >> of device I/O. >> >> I know some people have reported strange interactions between Slurm >> being on an NFSv4 mount (NFSv3 is fine). >> >> Good luck! >> Chris -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: cgroups support in slurm (sbatch vs salloc)

2015-05-07 Thread Mehdi Denou
ory /scratch/merri/jobs/2096417 has been allocated > /slurm/uid_500/job_2096417/step_0 > salloc: Relinquishing job allocation 2096417 > salloc: Job allocation 2096417 has been revoked. > [samuel@merri ~]$ > > > Hope that helps! > > All the best, > Chris -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Slurmctld Thread Count

2015-07-09 Thread Mehdi Denou
> > Thanks. > > -Paul Edmon- -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: Nodes are getting DOWN state

2015-08-25 Thread Mehdi Denou
قاتها > (إن وجدت) أو أي جزء منها، أو البوح بمحتوياتها للغير أو استعمالها لأي > غرض. علماً بأن فحوى هذه الرسالة ومرفقاتها (ان وجدت) تعبر عن رأي المُرسل > وليس بالضرورة رأي مدينة الملك عبدالعزيز للعلوم والتقنية بالمملكة > العربية السعودية، ولا تتحمل المدينة أي مسئولية عن الأضرار الناتجة عن > ما قد يحتويه هذا البريد. > -- --- Mehdi Denou International HPC support +336 45 57 66 56

[slurm-dev] Re: scontrol command allows all the users to see all the job detail

2015-08-27 Thread Mehdi Denou
gt; scontrol show job > > > Regards > A.Anandaraman > > -- > "Religion is just a set of symbols. It is being hijacked for political > and monetary gains." -- --- Mehdi Denou Bull/Atos International HPC support +336 45 57 66 56

[slurm-dev] Re: Problem to run a job with more memory only on the node where the job start

2015-10-20 Thread Mehdi Denou
he first node where the job start. > > if I run "srun/sbatch --mem 512" the job will failed because there is > not enough memory on others nodes. > > I have played with the Prolog/PrologSlurmctld variable to try to > define RSS limits without success because prolog script

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Mehdi Denou
switches may perform some firewall functions by themselves? Firewalls must be off between Slurm compute nodes as well as the controller host. See https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons /Ole -- --- Mehdi Denou Bull/Atos international HPC support +336 45 57 66 56

[slurm-dev] Re: slurm database purge,

2017-10-23 Thread Mehdi Denou
ique -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ <https://research.pasteur.fr/en/member/veronique-legrand/> Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03 -- --- Mehdi Denou Bull/Atos international HPC support +336 45 57 66 56