Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-12 Thread Kirill 'kkm' Katsnelson
On Wed, Mar 11, 2020 at 9:57 PM Chris Samuel wrote: > If so move it out of the way somewhere safe (just in case) and try again. > Aaah, that's a cool find! I never really looked inside my nodes for more than a year since I debugged all my stuff so it "just works". They are conjured out of nothin

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-11 Thread Kirill 'kkm' Katsnelson
the > problem. > > -mike > > > > *Michael Tie*Technical Director > Mathematics, Statistics, and Computer Science > > One North College Street phn: 507-222-4067 > Northfield, MN 55057 cel:952-212-8933 > m...@carleton.edu

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread Kirill 'kkm' Katsnelson
On Tue, Mar 10, 2020 at 1:41 PM mike tie wrote: > Here is the output of lstopo > > *$* lstopo -p > > Machine (63GB) > > Package P#0 + L3 (16MB) > > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0 > > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1 > > L2 (4096KB

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread Kirill 'kkm' Katsnelson
Yes, it's odd. -kkm On Mon, Mar 9, 2020 at 7:44 AM mike tie wrote: > > Interesting. I'm still confused by the where slurmd -C is getting the > data. When I think of where the kernel stores info about the processor, I > normally think of /proc/cpuinfo. (by the way, I am running centos 7 in

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-08 Thread Kirill 'kkm' Katsnelson
To answer your direct question, the ground truth of 'slurmctld -C' is what the kernel thinks the hardware is (what you see in lscpu, except it probably employs some tricks for VMs with an odd topology). And it got severely confused by what the kernel reported to it. I know from experience that cert

Re: [slurm-users] How to show state of CLOUD nodes

2020-02-28 Thread Kirill 'kkm' Katsnelson
I'm running clusters entirely in Google Cloud. I'm not sure I'm understanding the issue--do the nodes disappear from view entirely only when they fail to power up by ResumeTimeout? Failures of this kind are happening in GCE when resources are momentarily unavailable, but the nodes are still there,

Re: [slurm-users] Is this a bug in slurm array completion logic or expected behaviour

2020-02-01 Thread Kirill 'kkm' Katsnelson
On Thu, Jan 30, 2020 at 7:54 AM Antony Cleave wrote: > epilog jobid=513,arraytaskid=4,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING > epilog jobid=514,arraytaskid=5,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING > epilog jobid=515,arraytaskid=6,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING > epilog j