I've got a weird problem on our slurm cluster.  If I submit lots of R jobs to 
the queue then as soon as I've got more than about 7 of them running at the 
same time I start to get failures, saying:

/bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared libraries: 
libpcre2-8.so.0: cannot open shared object file: No such file or directory

..which makes no sense because that library is definitely there, and other jobs 
on the same nodes worked both before and after the failed jobs.  I recently ran 
500 identical jobs and 152 of them failed in this way.

There are no errors in the log files on the compute nodes where this failed and 
it happens across multiple nodes so it's not a single one being strange.  The R 
binary is on an isilon network share, but the libpcre2 library is on the local 
disk for the node.

Anyone come across anything like this before?  Any suggestions for fixes?

Thanks

Simon.

Reply via email to