Il 06/10/20 13:45, Riebs, Andy ha scritto: Well, the cluster is quite heterogeneus, and node bl0-02 only have 24 threads available: str957-bl0-02:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz Stepping: 7 CPU MHz: 1943.442 CPU max MHz: 2500,0000 CPU min MHz: 1200,0000 BogoMIPS: 4000.26 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
str957-bl0-03:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Stepping: 2 CPU MHz: 2400.142 CPU max MHz: 2300,0000 CPU min MHz: 1200,0000 BogoMIPS: 4800.28 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm arat pln pts Another couple of nodes do have 32 threads, but with AMD CPU... The same problem happened in the past, and seemed to "move" between nodes even with no changes in the config. While trying to fix it I added mtl = psm2 to /etc/openmpi/openmpi-mca-params.conf but only installing gdb and its dependencies apparently "worked". But, as I feared, it wos just a mask, not a solution. >> The problem is with a single, specific, node: str957-bl0-03 . The same >> job script works if being allocated to another node, even with more >> ranks (tested up to 224/4 on mtx-* nodes). > > Ahhh... here's where the details help. So it appears that the problem is on a > single node, and probably not a general configuration or system problem. I > suggest starting with something like this to help figure out why node bl0-03 > is different > > $ sudo ssh str957- bl0-02 lscpu > $ sudo ssh str957- bl0-03 lscpu > > Andy > > -----Original Message----- > From: Diego Zuccato [mailto:diego.zucc...@unibo.it] > Sent: Tuesday, October 6, 2020 3:13 AM > To: Riebs, Andy <andy.ri...@hpe.com>; Slurm User Community List > <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] Segfault with 32 processes, OK with 30 ??? > > Il 05/10/20 14:18, Riebs, Andy ha scritto: > > Tks for considering my query. > >> You need to provide some hints! What we know so far: >> 1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x >> backtrace. > Correct. > >> 2. Your decision to address this to the Slurm mailing list suggests that you >> think that Slurm might be involved. > At least I couldn't replicate launching manually (it always says "no > slots available" unless I use mpirun -np 16 ...). I'm no MPI expert > (actually less than a noob!) so I can't rule out it's unrelated to > Slurm. I mostly hope that on this list I can find someone with enough > experience with both Slurm and MPI. > >> 3. You have something (a job? a program?) that segfaults when you go from 30 >> to 32 processes. > Multiple programs, actually. > >> a. What operating system? > Debian 10.5 . Only extension is PBIS-open to authenticate users from AD. > >> b. Are you seeing this while running Slurm? What version? > 18.04, Debian packages > >> c. What version of Open MPI? > openmpi-bin/stable,now 3.1.3-11 amd64 > >> d. Are you building your own PMI-x, or are you using what's provided by Open >> MPI and Slurm? > Using Debian packages > >> e. What does your hardware configuration look like -- particularly, what cpu >> type(s), and how many cores/node? > The node uses dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz for a total > of 32 threads (hyperthreading is enabled: 2 sockets, 8 cores per socket, > 2 threads per core). > >> f. What does you Slurm configuration look like (assuming you're seeing this >> with Slurm)? I suggest purging your configuration files of node names and IP >> addresses, and including them with your query. > Here it is: > -8<-- > SlurmCtldHost=str957-cluster(*.*.*.*) > AuthType=auth/munge > CacheGroups=0 > CryptoType=crypto/munge > #DisableRootJobs=NO > EnforcePartLimits=YES > JobSubmitPlugins=lua > MpiDefault=none > MpiParams=ports=12000-12999 > ReturnToService=2 > SlurmctldPidFile=/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/run/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/lib/slurm/slurmd > SlurmUser=slurm > StateSaveLocation=/var/lib/slurm/slurmctld > SwitchType=switch/none > TaskPlugin=task/cgroup > TmpFS=/mnt/local_data/ > UsePAM=1 > GetEnvTimeout=20 > InactiveLimit=0 > KillWait=120 > MinJobAge=300 > SlurmctldTimeout=20 > SlurmdTimeout=30 > FastSchedule=0 > SchedulerType=sched/backfill > SchedulerPort=7321 > SelectType=select/cons_res > SelectTypeParameters=CR_Core_Memory > PriorityFlags=MAX_TRES > PriorityType=priority/multifactor > PreemptMode=CANCEL > PreemptType=preempt/partition_prio > AccountingStorageEnforce=safe,qos > AccountingStorageHost=str957-cluster > #AccountingStorageLoc= > #AccountingStoragePass= > #AccountingStoragePort=6819 > #AccountingStorageTRES= > AccountingStorageType=accounting_storage/slurmdbd > #AccountingStorageUser= > AccountingStoreJobComment=YES > AcctGatherNodeFreq=300 > ClusterName=oph > JobCompLoc=/var/spool/slurm/jobscompleted.txt > JobCompType=jobcomp/filetxt > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/linux > SlurmctldDebug=3 > SlurmctldLogFile=/var/log/slurm/slurmctld.log > SlurmdDebug=3 > SlurmdLogFile=/var/log/slurm/slurmd.log > NodeName=DEFAULT Sockets=2 ThreadsPerCore=2 State=UNKNOWN > NodeName=str957-bl0-0[1-2] CoresPerSocket=6 Feature=ib,blade,intel > NodeName=str957-bl0-0[3-5] CoresPerSocket=8 Feature=ib,blade,intel > NodeName=str957-bl0-[15-16] CoresPerSocket=4 Feature=ib,nonblade,intel > NodeName=str957-bl0-[17-18] CoresPerSocket=6 ThreadsPerCore=1 > Feature=nonblade,amd > NodeName=str957-bl0-[19-20] Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 > Feature=nonblade,amd > NodeName=str957-mtx-[00-15] CoresPerSocket=14 Feature=ib,nonblade,intel > -8<-- > >> g. What does your command line look like? Especially, are you trying to run >> 32 processes on a single node? Spreading them out across 2 or more nodes? > The problem is with a single, specific, node: str957-bl0-03 . The same > job script works if being allocated to another node, even with more > ranks (tested up to 224/4 on mtx-* nodes). > >> h. Can you reproduce the problem if you substitute `hostname` or `true` for >> the program in the command line? What about a simple MPI-enabled "hello >> world?"I'll try ASAP w/ a simple 'hostname'. But I expect it to work. > The original problem is with a complex program run by an user. To try to > debug the issue I'm using what I think is the simplest mpi program possible: > -8<-- > #include "mpi.h" > #include <stdio.h> > #include <stdlib.h> > #define MASTER 0 > > int main (int argc, char *argv[]) > { > int numtasks, taskid, len; > char hostname[MPI_MAX_PROCESSOR_NAME]; > MPI_Init(&argc, &argv); > // int provided=0; > // MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); > //printf("MPI provided threads: %d\n", provided); > MPI_Comm_size(MPI_COMM_WORLD, &numtasks); > MPI_Comm_rank(MPI_COMM_WORLD,&taskid); > > if (taskid == MASTER) > printf("This is an MPI parallel code for Hello World with no > communication\n"); > //MPI_Barrier(MPI_COMM_WORLD); > > > MPI_Get_processor_name(hostname, &len); > > printf ("Hello from task %d on %s!\n", taskid, hostname); > > if (taskid == MASTER) > printf("MASTER: Number of MPI tasks is: %d\n",numtasks); > > MPI_Finalize(); > > printf("END OF CODE from task %d\n", taskid); > } > -8<-- > And I got failures with it, too. > -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786