A user submittted an array job with a large number of tasks. For the past 
couple of days, I've been getting e-mails like the one below from SGE alerting 
my the job failed. Have any of you seen an error like this before?

Job 1252898 caused action: Job-array task 1252898.1 set to ERROR
 User        = user1
 Queue       = [email protected]
 Start Time  = <unknown>
 End Time    = <unknown>
failed assumedly before job:can't get password entry for user "user1". Either 
the user does not exist or NIS error!

Looking in /var/log/messages and /var/log/secure, I see no errors. I've been 
able to 'su - user1' on the nodes where the error occured, and I can do 'getent 
passwd user1' and get the correct answer on every cluster node. There are jobs 
running on the cluster for the same user. 

The only place I can find errors are in my SGE logs, which show the same error 
as above (no surprise there), but no additional clues as to what may have 
caused this:

 main|node02|E|can't start job "1252898": can't get password entry for user 
"user1". Either the user does not exist or NIS error!

We use LDAP for account information, and there have been no outages that I know 
of, and I'd know since I'm the LDAP admin, too! 

Any ideas? I'm not even sure if I can reproduce this error. 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to