Here is my situation:

2 Dell R900's with 16 cpus each and 64 GB RAM
OS: SuSE SLES 10 SP2 patched up to date
R version 2.9.1
Rmpi version 0.5-7
snow version 0.3-3
maanova library version 1.14.0
openmpi version 1.3.3
slurm version 2.0.3

With a given set of R code, we get abnormal exits when using 14 or less 
cpus. When using 15 or more, the job completes normally. 
error is a variation on: 

[pdp-dev-r01:22618] [[15549,1],0] routed:binomial: Connection to lifeline 
[[15549,0],0] lost

during the array permutations.

Increasing the number of permutations above 200 also produces similar 
results. 

The R code is executed with a typical command line for 14 cpus being:

sbatch -n 14 -i ./Rtest.txt --mail-type=ALL 
--mail-user=steven_d...@hc-sc.gc.ca /usr/local/bin/R --no-save


Config.log, ompi_info, Rscript.txt and slurm outputs are attached. Network 
is GB Ethernet copper tcp/ip.


I think this to be an openmpi error/bug due to the routed:binomial 
message. This also had the same results with openmpi-1.3.2, R 2.9.0, 
maanova 1.12 and slurm 2.0.1.


No non-default MCA parameters are set.

LD_LIBRARY_PATH=/usr/local/lib.

Configuration done with defaults.

Any ideas are welcome.




____________________
Steve Dale

Attachment: bugrep.tar.bz2
Description: Binary data

Reply via email to