A user submittted an array job with a large number of tasks. For the past couple of days, I've been getting e-mails like the one below from SGE alerting my the job failed. Have any of you seen an error like this before?
Job 1252898 caused action: Job-array task 1252898.1 set to ERROR User = user1 Queue = [email protected] Start Time = <unknown> End Time = <unknown> failed assumedly before job:can't get password entry for user "user1". Either the user does not exist or NIS error! Looking in /var/log/messages and /var/log/secure, I see no errors. I've been able to 'su - user1' on the nodes where the error occured, and I can do 'getent passwd user1' and get the correct answer on every cluster node. There are jobs running on the cluster for the same user. The only place I can find errors are in my SGE logs, which show the same error as above (no surprise there), but no additional clues as to what may have caused this: main|node02|E|can't start job "1252898": can't get password entry for user "user1". Either the user does not exist or NIS error! We use LDAP for account information, and there have been no outages that I know of, and I'd know since I'm the LDAP admin, too! Any ideas? I'm not even sure if I can reproduce this error. -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
