Hi Todd,
I personally don't know the answer, but I see that Andreas from the open
source grid engine alias (u...@gridengine.sunsource.net) is addressing
your issues. He should be able to address your issues since it's more
related to the internals of qmaster.
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=18773
So if anyone else wants to know about what it seems to be related to the
file descriptor limit issue in the internals of the SGE/N1GE, feel free
to follow the comments over there...
Heywood, Todd wrote:
I have sent the following experiences to the SGE mailing list, but I
thought I would also try here…
I have been trying out version 1.2b2 for its integration with SGE. The
simple “hello world” test program works fin by itself, but there are
issues when submitting it to SGE.
For small numbers of tasks, for SOME runs, I get errors for each of the
non-master tasks, and they are all one of the following:
error: commlib error: got read error (closing
"blade27.bluehelix.cshl.edu/execd/1")
error: commlib error: can't read general message size header (GMSH)
(closing "blade221
.bluehelix.cshl.edu/execd/1")
When I repeat runs, these errors tend to go away, like the first time a
node runs a job it coughs on it, but then it is OK for subsequent jobs.
I do get the correct output.
Things change when I try a large job, say 400 tasks. I get loads of GMSH
errors, but NO output, and SGE’s qstat command aborts://
[heywood@blade1 ompi]$ qsub -pe mpi 400 hello.sh
Your job 8239 ("hello.sh") has been submitted
[heywood@blade1 ompi]$ qstat -t
critical error: unrecoverable error - contact systems manager
Aborted
[heywood@blade1 ompi]$
I then have to qdel the job from another window.
If anyone has seen anything like this, I’d be interested in hearing.
Since the errors are coming from SGE’s communication library, I did
increase the file descriptor limit (ulimit –n 65536), but it made no
difference.
Thanks,
Todd Heywood
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Thanks,
- Pak Lui
pak....@sun.com