Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-12 Thread Ryan Novosielski
From what I know of how this works, no, it’s not getting it from a local file or the master node. I don’t believe it even makes a network connection, nor requires a slurm.conf in order to run. If you can run it fresh on a node with no config and that’s what it comes up with, it’s probably gettin

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-12 Thread Christopher Samuel
On 3/12/20 9:37 PM, Kirill 'kkm' Katsnelson wrote: Aaah, that's a cool find! I never really looked inside my nodes for more than a year since I debugged all my stuff so it "just works". They are conjured out of nothing and dissolve back into nothing after 10 minutes of inactivity. But good to

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-12 Thread Kirill 'kkm' Katsnelson
On Wed, Mar 11, 2020 at 9:57 PM Chris Samuel wrote: > If so move it out of the way somewhere safe (just in case) and try again. > Aaah, that's a cool find! I never really looked inside my nodes for more than a year since I debugged all my stuff so it "just works". They are conjured out of nothin

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-11 Thread Chris Samuel
On 10/3/20 1:40 pm, mike tie wrote: Here is the output of lstopo Hmm, well I believe Slurm should be using hwloc (which provides lstopo) to get its information (at least it calls the xcpuinfo_hwloc_topo_get() function for that), so if lstopo works then slurmd should too. Ah, looking a bit

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-11 Thread Kirill 'kkm' Katsnelson
Yup, I think if you get stuck so badly, the first thing is to make sure the node does not get the number 10 from the controller, and the second just reimage the VM fresh. It maybe not the quickest way, but at least predictable in the sense of time spent. Good luck! -kkm On Wed, Mar 11, 2020 at

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-11 Thread mike tie
Yep, slurmd -C is obviously getting the data from somewhere, either a local file or from the master node. hence my email to the group; I was hoping that someone would just say: "yeah, modify file ". But oh well. I'll start playing with strace and gdb later this week; looking through the so

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread Kirill 'kkm' Katsnelson
On Tue, Mar 10, 2020 at 1:41 PM mike tie wrote: > Here is the output of lstopo > > *$* lstopo -p > > Machine (63GB) > > Package P#0 + L3 (16MB) > > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0 > > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1 > > L2 (4096KB

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread mike tie
Here is the output of lstopo *$* lstopo -p Machine (63GB) Package P#0 + L3 (16MB) L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0 L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1 L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2 L2 (4096KB) + L1d

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread Kirill 'kkm' Katsnelson
Yes, it's odd. -kkm On Mon, Mar 9, 2020 at 7:44 AM mike tie wrote: > > Interesting. I'm still confused by the where slurmd -C is getting the > data. When I think of where the kernel stores info about the processor, I > normally think of /proc/cpuinfo. (by the way, I am running centos 7 in

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-09 Thread Chris Samuel
On 9/3/20 7:44 am, mike tie wrote: Specifically, how is slurmd -C getting that info?  Maybe this is a kernel issue, but other than lscpu and /proc/cpuinfo, I don't know where to look.  Maybe I should be looking at the slurmd source? It would be worth looking at what something like "lstopo" fr

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-09 Thread mike tie
Interesting. I'm still confused by the where slurmd -C is getting the data. When I think of where the kernel stores info about the processor, I normally think of /proc/cpuinfo. (by the way, I am running centos 7 in the vm. The vm hypervisor is VMware). /proc/cpuinfo does show 16 cores. I unde

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-08 Thread Kirill 'kkm' Katsnelson
To answer your direct question, the ground truth of 'slurmctld -C' is what the kernel thinks the hardware is (what you see in lscpu, except it probably employs some tricks for VMs with an odd topology). And it got severely confused by what the kernel reported to it. I know from experience that cert