Just to close loop on this. This was not as Slurm issue it was more of AD configuration.
AD needs to be installed on all nodes of cluster that way SLURM knows the USER ID. I had trouble with sssd DB folders missing and sssd.conf file having appropriate permissions. So look put for those. You can copy appropriate permissions and files form other working nodes and it should work fine. For any one curious here is the link to Bright Computing KB which helped us with the configuration. https://kb.brightcomputing.com/faq/index.php?action=artikel&cat=13&id=224&artlang=en <https://kb.brightcomputing.com/faq/index.php?action=artikel&cat=13&id=224&artlang=en> Thanks, Yugi > On Feb 13, 2019, at 3:07 PM, John Hearns <hear...@googlemail.com> wrote: > > Matthew, that deserves an explanation. Bright Computing Proof of Concept > causes nightmares? > That is a pretty strong assertion. Please give more details. > > On Wed, 13 Feb 2019 at 16:01, Matthew BETTINGER > <matthew.bettin...@external.total.com > <mailto:matthew.bettin...@external.total.com>> wrote: > One of the main guy Panos left Bright so no answer to your specific question > but I hope you can get some support with it. We dumped our BC PoC, the > sysadmin working on the PoC still has nightmares. > > On 2/13/19, 6:54 AM, "slurm-users on behalf of John Hearns" > <slurm-users-boun...@lists.schedmd.com > <mailto:slurm-users-boun...@lists.schedmd.com> on behalf of > hear...@googlemail.com <mailto:hear...@googlemail.com>> wrote: > > Yugendra, the Bright support guys are excellent. > Slurm is their default choice. I would ask again. Yes, Slurm is > technically out of scope for them, but they shoudl help a bit. > > > By the way, I think your problem is that you have configured > authentication using AD on your head node. > BUT you have not confiured it ont he compute node images. You probably > have to prepare a new compute node image then push that otu to the compute > nodes. > > > > > > > > > > > > > On Wed, 13 Feb 2019 at 12:35, Yugendra Guvvala > <yguvv...@cambridgecomputer.com <mailto:yguvv...@cambridgecomputer.com>> > wrote: > > > Also reached out to bright computing support and they say slurm is out of > scope for them. > > Thanks, > Yugi > > > On Feb 13, 2019, at 7:27 AM, Antony Cleave <antony.cle...@gmail.com > <mailto:antony.cle...@gmail.com>> wrote: > > > > can you ssh to the compute node that job was trying to run on as as the > AD user in question? > > > I've seen similar issues on AD integrated systems where some nodes boot > from a different image that have not yet been joined to the domain. > > > Antony > > > On Wed, 13 Feb 2019 at 04:58, Yugendra Guvvala > <yguvv...@cambridgecomputer.com <mailto:yguvv...@cambridgecomputer.com>> > wrote: > > > Hi, > > > We are bringing a new cluster online. We installed SLURM through Bright > Cluster Manager how ever we are running into a issue here. > > > We are able to run jobs as root user and users created using bright > cluster (cmsh commands). How ever we use AD authentication for all our users > and when we try to submit jobs to slurm using AD users we are getting > following error message. > > > > > srun: fatal: Invalid user id: 10952 > srun: fatal: Invalid user id: 10952 > srun: error: cnode001: task 0: Exited with exit code 1 > > > > Attached is the slurm.con file for reference. Please let us know if you > have any insight into this. > > > > > > > Thanks, > Yugi > > > Yugendra Guvvala | HPC Technologist | Cambridge Computer | "Artists > in Data Storage" > Direct: 781-250-3273 | Cell: 806-773-4464 | > yguvv...@cambridgecomputer.com <mailto:yguvv...@cambridgecomputer.com> | > www.cambridgecomputer.com <http://www.cambridgecomputer.com/> > <http://www.cambridgecomputer.com <http://www.cambridgecomputer.com/>> > > > > _______________________________________________________________________________________________ > > > > > > > > > > > > > > > > > > > > > > > >