Hi, I see that parts of my message were scrubbed. I will try to post the relevant info below (If that does not abide to the mailing list rules please let me know and point in the right direction to convey this kind of information).
slurmctld.log 2017-11-06T11:38:26.623] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0 [2017-11-06T11:40:45.063] _slurm_rpc_submit_batch_job JobId=1489 usec=505 [2017-11-06T11:40:45.625] backfill: Started JobId=1489 in main_compute on cn_burebista [2017-11-06T11:40:45.697] _pick_step_nodes: Configuration for job 1489 is complete [2017-11-06T11:44:48.289] slurmctld: agent retry_list size is 101 [2017-11-06T11:44:48.289] retry_list msg_type=7009,7009,7009,7009,7009 [2017-11-06T11:51:12.132] slurmctld: agent retry_list size is 101 [2017-11-06T11:51:12.132] retry_list msg_type=7009,7009,7009,7009,7009 [2017-11-06T11:52:12.835] job_step_signal step 1489.56 not found [2017-11-06T11:52:12.835] job_step_signal step 1489.59 not found [2017-11-06T11:52:12.835] job_step_signal step 1489.63 not found [2017-11-06T11:52:12.835] job_step_signal step 1489.52 not found [2017-11-06T11:52:12.838] job_step_signal step 1489.53 not found [2017-11-06T11:52:12.842] job_step_signal step 1489.54 not found [2017-11-06T11:52:12.842] job_step_signal step 1489.60 not found [2017-11-06T11:52:12.856] job_step_signal step 1489.58 not found [2017-11-06T11:52:12.856] job_step_signal step 1489.61 not found [2017-11-06T11:52:12.862] job_step_signal step 1489.55 not found [2017-11-06T11:52:12.875] job_step_signal step 1489.62 not found [2017-11-06T11:52:12.884] job_step_signal step 1489.51 not found [2017-11-06T11:52:13.007] job_step_signal step 1489.57 not found [2017-11-06T11:52:13.191] job_step_signal step 1489.50 not found [2017-11-06T11:52:58.625] job_step_signal step 1489.61 not found [2017-11-06T11:52:58.625] job_step_signal step 1489.57 not found [2017-11-06T11:52:58.625] job_step_signal step 1489.55 not found script.sh #!/bin/sh # job parameters #SBATCH --job-name=preprocess_movies #SBATCH --output=preprocess_movies.log # needed resources #SBATCH --ntasks=50 #SBATCH --mem-per-cpu=2GB shopt -s globstar FILES=$(find ./full_training/raw -name "*.skv") OUTPUT=./output_files # operations echo "[preprocess] Job started at $(date)" # job steps for file in ${FILES} do echo "[preprocess] Processing file: ${file##*/}" echo "[preprocess] Output to: $OUTPUT/${file##*/}" srun -n1 --exclusive ./preprocessing-r 1 0 1 ${file} $OUTPUT/${file##*/} & done wait echo "[preprocess] Job ended at $(date)" sinfo, squeue output % sinfo -Nle Mon Nov 6 12:31:51 2017 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON cn_burebista 1 main_compute* mixed 56 2:14:2 256000 0 1 (null) none % squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1489 main_comp preproce mcetatea R 51:11 1 cn_burebista slurm.conf # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=draco ControlMachine=zalmoxis #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/tmp SlurmdSpoolDir=/tmp/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid #PluginDir= #FirstJobId= #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= UsePAM=1 # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory FastSchedule=1 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 #SlurmctldLogFile= SlurmdDebug=3 #SlurmdLogFile= JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # OpenHPC default configuration PropagateResourceLimitsExcept=MEMLOCK SlurmdLogFile=/var/log/slurm.log SlurmctldLogFile=/var/log/slurmctld.log Epilog=/etc/slurm/slurm.epilog.clean # ?? How to set up more node names ?? NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN PartitionName=normal Nodes=cn_burebista Default=YES MaxTime=24:00:00 State=UP ReturnToService=1 ---------------------------------------------------------------------- Message: 1 Date: Tue, 7 Nov 2017 09:12:17 +0000 From: Marius Cetateanu <mcetate...@softkinetic.com> To: "slurm-users@lists.schedmd.com" <slurm-users@lists.schedmd.com> Subject: [slurm-users] Having errors trying to run a packed jobs script Message-ID: <am0pr0502mb36829b5109e51c3c823badbbd9...@am0pr0502mb3682.eurprd05.prod.outlook.com> Content-Type: text/plain; charset="iso-8859-1" Hi, I am new to slurm and I'm having some issues scheduling correctly my tasks. I have a very small cluster(if it even could be called a cluster) with only one node for the moment; the node is a dual Xeon with 14 cores/socket, hyper-threaded and 256GB of memory, running CentOS 7.3. I have a single threaded process which I would like to run over a series of input files(around 370). I have found that the packed jobs scenario fits with what I'm trying to achieve. So I would like to run 50 instances of my process at the same time over different input files. The moment I schedule my script I can see that there are 50 instances of my process started and running but just a bit afterwards only 5 or so of them I can see running - so I only get full load for the first 50 instances and not afterwards. In the slurmctld.log I can see this type of messages: "[2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found" and in my script output file I can see: "srun: Job step creation temporarily disabled, retrying" At this point I'm sifting through documentation and online info trying to figure out what is going on. I have attached my slurmctld log file, slurm config file, script and the output I get from sinfo, stat and the likes. Any pointers on how to attack this problem would be much appreciated. Thank you <pre> -- Marius Cetateanu | Senior Software Engineer T +32 2 888 42 60 F +32 2 647 48 55 E m...@softkinetic.com YT https://emea01.safelinks.protection.outlook.com/?url=www.youtube.com%2Fsoftkinetic&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=w7D%2BkFYJid%2BXrMJk3EDGJkZrp4IXBmtQY%2Fgjnha%2Fo6Q%3D&reserved=0 Boulevard de la Plaine 11, 1050, Brussels, Belgium Registration No: RPM/RPR Brussels 0811 784 189 Our e-mail communication disclaimers & liability are available at: https://emea01.safelinks.protection.outlook.com/?url=www.softkinetic.com%2Fdisclaimer.aspx&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=zHn0RIVwALLlgPji6BaCJnm9vhJsryMWXq%2BGUUzLU4E%3D&reserved=0 </pre> -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.html&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=Cw%2BLHSfVCa%2BzROmxX0bO02on%2B27hhwTCsDRAZ%2BoRwxs%3D&reserved=0> -------------- next part -------------- A non-text attachment was scrubbed... Name: slurm.conf Type: application/octet-stream Size: 2229 bytes Desc: slurm.conf URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.obj&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=df4%2FYVUxwS1BBRmnvY4rMT52MJqAsKJe1rtIdSUu8f8%3D&reserved=0> -------------- next part -------------- A non-text attachment was scrubbed... Name: tools.out Type: application/octet-stream Size: 4142 bytes Desc: tools.out URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment-0001.obj&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=EKlZTp6qaz03D9okE0tgirPQ6Ufej3uwh74CUItCS2c%3D&reserved=0> -------------- next part -------------- A non-text attachment was scrubbed... Name: preprocess.sh Type: application/x-shellscript Size: 3385 bytes Desc: preprocess.sh URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment.bin&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=rfe6n75ccvS7rgM2npgs4EegYnTL6YqzkoCvYKzw4AA%3D&reserved=0> -------------- next part -------------- A non-text attachment was scrubbed... Name: slurmctld.log Type: text/x-log Size: 3700 bytes Desc: slurmctld.log URL: <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20171107%2Fac948558%2Fattachment-0001.bin&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C6d75105c8fad4d8fbb1a08d525bfac72%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=W3W5hd4yA69W%2FC6EgghdCedFzHHKq4NN7%2FJzx2zsxvg%3D&reserved=0> End of slurm-users Digest, Vol 1, Issue 5 *****************************************