Hi,
I am new to slurm and I'm having some issues scheduling correctly my tasks. I have a very small cluster(if it even could be called a cluster) with only one node for the moment; the node is a dual Xeon with 14 cores/socket, hyper-threaded and 256GB of memory, running CentOS 7.3. I have a single threaded process which I would like to run over a series of input files(around 370). I have found that the packed jobs scenario fits with what I'm trying to achieve. So I would like to run 50 instances of my process at the same time over different input files. The moment I schedule my script I can see that there are 50 instances of my process started and running but just a bit afterwards only 5 or so of them I can see running - so I only get full load for the first 50 instances and not afterwards. In the slurmctld.log I can see this type of messages: "[2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found" and in my script output file I can see: "srun: Job step creation temporarily disabled, retrying" At this point I'm sifting through documentation and online info trying to figure out what is going on. I have attached my slurmctld log file, slurm config file, script and the output I get from sinfo, stat and the likes. Any pointers on how to attack this problem would be much appreciated. Thank you <pre> -- Marius Cetateanu | Senior Software Engineer T +32 2 888 42 60 F +32 2 647 48 55 E m...@softkinetic.com YT www.youtube.com/softkinetic Boulevard de la Plaine 11, 1050, Brussels, Belgium Registration No: RPM/RPR Brussels 0811 784 189 Our e-mail communication disclaimers & liability are available at: www.softkinetic.com/disclaimer.aspx </pre>
slurm.conf
Description: slurm.conf
tools.out
Description: tools.out
preprocess.sh
Description: preprocess.sh
[2017-11-06T11:38:26.623] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0 [2017-11-06T11:40:45.063] _slurm_rpc_submit_batch_job JobId=1489 usec=505 [2017-11-06T11:40:45.625] backfill: Started JobId=1489 in main_compute on cn_burebista [2017-11-06T11:40:45.697] _pick_step_nodes: Configuration for job 1489 is complete [2017-11-06T11:44:48.289] slurmctld: agent retry_list size is 101 [2017-11-06T11:44:48.289] retry_list msg_type=7009,7009,7009,7009,7009 [2017-11-06T11:51:12.132] slurmctld: agent retry_list size is 101 [2017-11-06T11:51:12.132] retry_list msg_type=7009,7009,7009,7009,7009 [2017-11-06T11:52:12.835] job_step_signal step 1489.56 not found [2017-11-06T11:52:12.835] job_step_signal step 1489.59 not found [2017-11-06T11:52:12.835] job_step_signal step 1489.63 not found [2017-11-06T11:52:12.835] job_step_signal step 1489.52 not found [2017-11-06T11:52:12.838] job_step_signal step 1489.53 not found [2017-11-06T11:52:12.842] job_step_signal step 1489.54 not found [2017-11-06T11:52:12.842] job_step_signal step 1489.60 not found [2017-11-06T11:52:12.856] job_step_signal step 1489.58 not found [2017-11-06T11:52:12.856] job_step_signal step 1489.61 not found [2017-11-06T11:52:12.862] job_step_signal step 1489.55 not found [2017-11-06T11:52:12.875] job_step_signal step 1489.62 not found [2017-11-06T11:52:12.884] job_step_signal step 1489.51 not found [2017-11-06T11:52:13.007] job_step_signal step 1489.57 not found [2017-11-06T11:52:13.191] job_step_signal step 1489.50 not found [2017-11-06T11:52:58.625] job_step_signal step 1489.61 not found [2017-11-06T11:52:58.625] job_step_signal step 1489.57 not found [2017-11-06T11:52:58.625] job_step_signal step 1489.55 not found [2017-11-06T11:52:58.629] job_step_signal step 1489.63 not found [2017-11-06T11:52:58.629] job_step_signal step 1489.51 not found [2017-11-06T11:53:07.261] job_step_signal step 1489.52 not found [2017-11-06T11:53:10.916] job_step_signal step 1489.62 not found [2017-11-06T11:53:15.863] job_step_signal step 1489.54 not found [2017-11-06T11:53:15.863] job_step_signal step 1489.53 not found [2017-11-06T11:53:15.863] job_step_signal step 1489.59 not found [2017-11-06T11:53:15.864] job_step_signal step 1489.58 not found [2017-11-06T11:53:15.864] job_step_signal step 1489.50 not found [2017-11-06T11:53:15.864] job_step_signal step 1489.60 not found [2017-11-06T11:53:15.866] job_step_signal step 1489.56 not found [2017-11-06T11:56:39.048] slurmctld: agent retry_list size is 101 [2017-11-06T11:56:39.048] retry_list msg_type=7009,7009,7009,7009,7009 [2017-11-06T11:56:39.216] job_step_signal step 1489.100 not found [2017-11-06T11:56:39.216] job_step_signal step 1489.99 not found [2017-11-06T11:56:39.228] job_step_signal step 1489.95 not found [2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found [2017-11-06T11:56:39.228] job_step_signal step 1489.109 not found [2017-11-06T11:56:39.228] job_step_signal step 1489.96 not found [2017-11-06T11:56:39.228] job_step_signal step 1489.94 not found [2017-11-06T11:56:39.228] job_step_signal step 1489.101 not found [2017-11-06T11:56:39.228] job_step_signal step 1489.98 not found [2017-11-06T11:56:39.240] job_step_signal step 1489.106 not found [2017-11-06T11:56:39.240] job_step_signal step 1489.103 not found [2017-11-06T11:56:39.240] job_step_signal step 1489.105 not found [2017-11-06T11:56:39.240] job_step_signal step 1489.104 not found [2017-11-06T11:56:39.240] job_step_signal step 1489.102 not found [2017-11-06T11:56:39.275] job_step_signal step 1489.108 not found [2017-11-06T11:56:39.286] job_step_signal step 1489.97 not found