Hi,

I am new to slurm and I'm having some issues scheduling correctly

my tasks.

I have a very small cluster(if it even could be called a cluster) with only

one node for the moment; the node is a dual Xeon with 14 cores/socket,

hyper-threaded and 256GB of memory, running CentOS 7.3.


I have a single threaded process which I would like to run

over a series of input files(around 370). I have found that the packed

jobs scenario fits with what I'm trying to achieve. So I would like to

run 50 instances of my process at the same time over different input files.


The moment I schedule my script I can see that there are 50 instances of

my process started and running but just a bit afterwards only 5 or so of them

I can see running - so I only get full load for the first 50 instances and not

afterwards.


In the slurmctld.log I can see this type of messages:

"[2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found"

and in my script output file I can see:

"srun: Job step creation temporarily disabled, retrying"


At this point I'm sifting through documentation and online info trying to figure

out what is going on. I have attached my slurmctld log file, slurm config file, 
script and

the output I get from sinfo, stat and the likes.


Any pointers on how to attack this problem would be much appreciated.


Thank you



<pre>
--

Marius Cetateanu | Senior Software Engineer
T +32 2 888 42 60
F +32 2 647 48 55
E m...@softkinetic.com
YT www.youtube.com/softkinetic
Boulevard de la Plaine 11, 1050, Brussels, Belgium
Registration No: RPM/RPR Brussels 0811 784 189

Our e-mail communication disclaimers & liability are available
at: www.softkinetic.com/disclaimer.aspx
</pre>

Attachment: slurm.conf
Description: slurm.conf

Attachment: tools.out
Description: tools.out

Attachment: preprocess.sh
Description: preprocess.sh

[2017-11-06T11:38:26.623] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2017-11-06T11:40:45.063] _slurm_rpc_submit_batch_job JobId=1489 usec=505
[2017-11-06T11:40:45.625] backfill: Started JobId=1489 in main_compute on cn_burebista
[2017-11-06T11:40:45.697] _pick_step_nodes: Configuration for job 1489 is complete
[2017-11-06T11:44:48.289] slurmctld: agent retry_list size is 101
[2017-11-06T11:44:48.289]    retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:51:12.132] slurmctld: agent retry_list size is 101
[2017-11-06T11:51:12.132]    retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:52:12.835] job_step_signal step 1489.56 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.59 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.63 not found
[2017-11-06T11:52:12.835] job_step_signal step 1489.52 not found
[2017-11-06T11:52:12.838] job_step_signal step 1489.53 not found
[2017-11-06T11:52:12.842] job_step_signal step 1489.54 not found
[2017-11-06T11:52:12.842] job_step_signal step 1489.60 not found
[2017-11-06T11:52:12.856] job_step_signal step 1489.58 not found
[2017-11-06T11:52:12.856] job_step_signal step 1489.61 not found
[2017-11-06T11:52:12.862] job_step_signal step 1489.55 not found
[2017-11-06T11:52:12.875] job_step_signal step 1489.62 not found
[2017-11-06T11:52:12.884] job_step_signal step 1489.51 not found
[2017-11-06T11:52:13.007] job_step_signal step 1489.57 not found
[2017-11-06T11:52:13.191] job_step_signal step 1489.50 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.61 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.57 not found
[2017-11-06T11:52:58.625] job_step_signal step 1489.55 not found
[2017-11-06T11:52:58.629] job_step_signal step 1489.63 not found
[2017-11-06T11:52:58.629] job_step_signal step 1489.51 not found
[2017-11-06T11:53:07.261] job_step_signal step 1489.52 not found
[2017-11-06T11:53:10.916] job_step_signal step 1489.62 not found
[2017-11-06T11:53:15.863] job_step_signal step 1489.54 not found
[2017-11-06T11:53:15.863] job_step_signal step 1489.53 not found
[2017-11-06T11:53:15.863] job_step_signal step 1489.59 not found
[2017-11-06T11:53:15.864] job_step_signal step 1489.58 not found
[2017-11-06T11:53:15.864] job_step_signal step 1489.50 not found
[2017-11-06T11:53:15.864] job_step_signal step 1489.60 not found
[2017-11-06T11:53:15.866] job_step_signal step 1489.56 not found
[2017-11-06T11:56:39.048] slurmctld: agent retry_list size is 101
[2017-11-06T11:56:39.048]    retry_list msg_type=7009,7009,7009,7009,7009
[2017-11-06T11:56:39.216] job_step_signal step 1489.100 not found
[2017-11-06T11:56:39.216] job_step_signal step 1489.99 not found
[2017-11-06T11:56:39.228] job_step_signal step 1489.95 not found
[2017-11-06T11:56:39.228] job_step_signal step 1489.107 not found
[2017-11-06T11:56:39.228] job_step_signal step 1489.109 not found
[2017-11-06T11:56:39.228] job_step_signal step 1489.96 not found
[2017-11-06T11:56:39.228] job_step_signal step 1489.94 not found
[2017-11-06T11:56:39.228] job_step_signal step 1489.101 not found
[2017-11-06T11:56:39.228] job_step_signal step 1489.98 not found
[2017-11-06T11:56:39.240] job_step_signal step 1489.106 not found
[2017-11-06T11:56:39.240] job_step_signal step 1489.103 not found
[2017-11-06T11:56:39.240] job_step_signal step 1489.105 not found
[2017-11-06T11:56:39.240] job_step_signal step 1489.104 not found
[2017-11-06T11:56:39.240] job_step_signal step 1489.102 not found
[2017-11-06T11:56:39.275] job_step_signal step 1489.108 not found
[2017-11-06T11:56:39.286] job_step_signal step 1489.97 not found

Reply via email to