On 28/06/2022 23:14, Chris Samuel wrote: > On 28/6/22 12:19 pm, Jean-Christophe HAESSIG wrote:
Hi, > I suspect this is where your error is happening: > > https://github.com/SchedMD/slurm/blob/1ce55318222f89fbc862ce559edfd17e911fee38/src/common/plugin.c#L284 > > Yes I also found it and that's where I saw the detailed debug3 & debug4 calls. > it's when it's checking it can load the plugin and not hit any > unresolved library symbols. The fact you are hitting this sounds like > you're missing libraries from the compute nodes that are present on the > login node (or there's some reason they're not getting found if present). Reading the code it's not 100% clear where these libraries are loaded from. I think it's all the stuff from /usr/lib/x86_64-linux-gnu/slurm-wlm/ but everything seems to be there. Then in turn these libraries have dependencies but I don't know how libraries could still have undefined symbols one all the dependency loading/resolution is over. > This depends on what part of Slurm is generating these errors, is this > something like sbatch or srun? If so using multiple -v's will increase > the debug level so you can pick those up. If it's from slurmd then > you'll want to set SlurmdDebug to "debug3" in your slurm.conf. No, the job is placed through DRMAA API which enables programs to place jobs in a cluster-agnostic way. Th program doesn't know it is talking to Slurm. The DRMAA library makes the translation and loads libslurm36, where the messages comes from. That's why I don't know how to tell libslurm to log more, since its use is hidden behind DRMAA. I both have a test using the Python binding for DRMAA and a test using pure C which behave the same. Thank you, J.C. Haessig