etty motivated to get this working.
Thanks for any insights.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
=0x7fff8588b2c8)
at mpihello-long.c:11
Thanks!
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d.\n", node);
for(i=0; i<=1; i++)
f=i*2.718281828*i+i+i*3.141592654;
MPI_Finalize();
}
And my environment is a pretty standard CentOS-6.2 install.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
I_Init ()
from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#5 0x00400826 in main (argc=1, argv=0x7fff9fe113f8)
at mpihello-long.c:11
Another question. How reproducible is this on your system?
In my testing today, it's been 100% reproducible.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
mpirun -np $NSLOTS $HOME/mybin/mpihello-long.ompi-1.4-debug
where $NSLOTS is set by SGE based on how many slots in the PE one
requests.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
surprising.
Heh. You're telling me.
Thanks for taking an interest in this.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
pp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#7 0x00400826 in main (argc=1, argv=0x7fff93634788)
at mpihello-long.c:11
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
On Tue, 13 Mar 2012 at 11:28pm, Gutierrez, Samuel K wrote
Can you rebuild without the "--enable-mpi-threads" option and try again.
I did and still got segfaults (although w/ slightly different backtraces).
See the response I just sent to Ralph.
--
Joshua Baker-LePain
QB3 Shar
when I run across multiple machines with all the threads un-niced,
but I haven't been able to reproduce that at will like I can for the other
case.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
t fail either.
Do you face the same if you stay in one and the same queue across the
machines?
Jobs don't crash if they either:
a) all run in the same queue, or
b) run in multiple queues all on one machine
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
t file and kept fully
up to date. And, yes, the application is compiled against the exact
library I'm running it with.
Thanks again to all for looking at this.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
e
desired queue in `qrsh -inherit ...`, because then the $TMPDIR would be
unique for each orted again (assuming its using different ports for
each).
Gotcha! I suspect getting the allocator to handle this cleanly is the
better solution, though.
If I can help (testing patches, e.g.), let me know
ne of the most productive
exchanges I've had on a mailing list in far too long.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
e truth is our cluster is primarily used for, and thus SGE is tuned for,
large numbers of serial jobs. We do have *some* folks running parallel
code, and it *is* starting to get to the point where I need to reconfigure
things to make that part work better.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
the OP would be the smarter way IMO.
And I agree with that as well. I understand if the decision is made to
leave the parser the way it is, given that my setup is outside the norm.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
On Thu, 15 Mar 2012 at 11:38am, Ralph Castain wrote
No, I'll fix the parser as we should be able to run anyway. Just can't
guarantee which queue the job will end up in, but at least it -will-
run.
Makes sense to me. Thanks!
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
issue), but I downloaded it from
<https://svn.open-mpi.org/trac/ompi/changeset/26148> and applied that. My
test job ran just fine, and looking at the nodes verified a single orted
process per node despite SGE assigning slots in multiple queues.
In short, WORKSFORME.
Thanks!
--
Jo
ll the env variables properly set.
But I don't know what Fedora version that started with.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
18 matches
Mail list logo