> > > are you sure, your 24 core nodes have 187 TERABYTES memory? > > As you yourself cited: > > Size of real memory on the node in megabytes > > The settings in your slurm.conf: > > NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092 Sockets=2 > Gres=gpu:1 > > so, your machines should have 196489092 megabytes memory, that are ~191884 > gigabytes or ~187 terabytes >
192 GB . What was also throwing me off was this error: error: _slurm_rpc_node_registration node=node003: Invalid argument Invalid in this case appears to be "too high". It sees only 191840 megabytes, which is still less than the 191884. Since > the available memory changes slightly from OS version to OS version, I > would suggest to set RealMemory to less than 191840, e.g. 191800. > But Brian already told you to reduce the RealMemory: > > I would suggest RealMemory=191879 , where I suspect you have > RealMemory=196489092 > > Thanks Marcus and Brian that was indeed the culprit.