Hi I made some progress trying to understand the problem i reported some weeks ago:
https://lists.schedmd.com/pipermail/slurm-users/2023-May/010027.html I noticed that the intermittent connection timeout that i am experiencing occurs only when using the tcp based direct connection to establish communication between stepd on different nodes. When disabling the optimized direct connection using export SLURM_PMIX_DIRECT_CONN=false the submission of hetjobs is stable and not connection timeout occurs anymore. Any idea what can goes wrong when using tcp based direct connection together with hetjobs? Cheers, Denis --------- Denis Bertini Abteilung: CIT Ort: SB3 2.265a Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bert...@gsi.de GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz