So only some jobs/tasks fail for "user1", and only happens randomly??
Rayson On Tue, Feb 7, 2012 at 4:31 PM, Prentice Bisbal <[email protected]> wrote: > A user submittted an array job with a large number of tasks. For the past > couple of days, I've been getting e-mails like the one below from SGE > alerting my the job failed. Have any of you seen an error like this before? > > Job 1252898 caused action: Job-array task 1252898.1 set to ERROR > User = user1 > Queue = [email protected] > Start Time = <unknown> > End Time = <unknown> > failed assumedly before job:can't get password entry for user "user1". Either > the user does not exist or NIS error! > > Looking in /var/log/messages and /var/log/secure, I see no errors. I've been > able to 'su - user1' on the nodes where the error occured, and I can do > 'getent passwd user1' and get the correct answer on every cluster node. There > are jobs running on the cluster for the same user. > > The only place I can find errors are in my SGE logs, which show the same > error as above (no surprise there), but no additional clues as to what may > have caused this: > > main|node02|E|can't start job "1252898": can't get password entry for user > "user1". Either the user does not exist or NIS error! > > We use LDAP for account information, and there have been no outages that I > know of, and I'd know since I'm the LDAP admin, too! > > Any ideas? I'm not even sure if I can reproduce this error. > > -- > Prentice Bisbal > Linux Software Support Specialist/System Administrator > School of Natural Sciences > Institute for Advanced Study > Princeton, NJ > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
