Hi Todd,

I personally don't know the answer, but I see that Andreas from the open source grid engine alias (u...@gridengine.sunsource.net) is addressing your issues. He should be able to address your issues since it's more related to the internals of qmaster.

http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=18773

So if anyone else wants to know about what it seems to be related to the file descriptor limit issue in the internals of the SGE/N1GE, feel free to follow the comments over there...

Heywood, Todd wrote:
I have sent the following experiences to the SGE mailing list, but I thought I would also try here…

I have been trying out version 1.2b2 for its integration with SGE. The simple “hello world” test program works fin by itself, but there are issues when submitting it to SGE.

For small numbers of tasks, for SOME runs, I get errors for each of the non-master tasks, and they are all one of the following:

error: commlib error: got read error (closing "blade27.bluehelix.cshl.edu/execd/1")

error: commlib error: can't read general message size header (GMSH) (closing "blade221

.bluehelix.cshl.edu/execd/1")

When I repeat runs, these errors tend to go away, like the first time a node runs a job it coughs on it, but then it is OK for subsequent jobs. I do get the correct output.

Things change when I try a large job, say 400 tasks. I get loads of GMSH errors, but NO output, and SGE’s qstat command aborts://

[heywood@blade1 ompi]$ qsub -pe mpi 400 hello.sh

Your job 8239 ("hello.sh") has been submitted

[heywood@blade1 ompi]$ qstat -t

critical error: unrecoverable error - contact systems manager

Aborted

[heywood@blade1 ompi]$

I then have to qdel the job from another window.

If anyone has seen anything like this, I’d be interested in hearing. Since the errors are coming from SGE’s communication library, I did increase the file descriptor limit (ulimit –n 65536), but it made no difference.

Thanks,

Todd Heywood


------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

Thanks,

- Pak Lui
pak....@sun.com

Reply via email to