Hi, I have a question about --bynode and --byslot that i would like to
clarify
Say, for example, I have a hostfile
#Hostfile
__
node0
node1 slots=2 max_slots=2
node2 slots=2 max_slots=2
node3 slots=4 max_slots=4
___
There are 4 nodes and 9 slots,
I've asked for verification, but I recall the original verbal
complaint claiming the wall time was random and sometimes as short as
2 minutes into a job.
They have said they've run more tests with more instrumentation on
their code, and it always fails in a random placeSame job,
different resu
Well, it turns out that the path OpenMPI looks for things seems at
least partially hard-coded. I've got some "wierd pathing" here on my
rocks cluster:
/opt is local;
/share/apps is exported from the headnode and available on all nodes.
On the head node, /opt is symlinked to /share/apps
I set my
On May 22, 2008, at 12:52 PM, Jim Kusznir wrote:
I installed openmpi 1.2.6 on my system, but now my users are
complaining about even more errors. I'm getting this:
[compute-0-23.local:26164] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182
This may be a dumb question, but is there a chance that his job is
running beyond 30 minutes, and PBS/Torque/whatever is killing it?
On May 20, 2008, at 4:23 PM, Jim Kusznir wrote:
Hello all:
I've got a user on our ROCKS 4.3 cluster that's having some strange
errors. I have other users usin
I build on Debian 4.0 and run on Suse 10 and Fedore Core 6. The only
thing I had to enforce is the availability of the corresponding libc
library (the one I build with) on the target OS. Moreover, as my nodes
have different processors, I have to enforce strict x86 code.
george.
On May 22