Here is my situation: 2 Dell R900's with 16 cpus each and 64 GB RAM OS: SuSE SLES 10 SP2 patched up to date R version 2.9.1 Rmpi version 0.5-7 snow version 0.3-3 maanova library version 1.14.0 openmpi version 1.3.3 slurm version 2.0.3
With a given set of R code, we get abnormal exits when using 14 or less cpus. When using 15 or more, the job completes normally. error is a variation on: [pdp-dev-r01:22618] [[15549,1],0] routed:binomial: Connection to lifeline [[15549,0],0] lost during the array permutations. Increasing the number of permutations above 200 also produces similar results. The R code is executed with a typical command line for 14 cpus being: sbatch -n 14 -i ./Rtest.txt --mail-type=ALL --mail-user=steven_d...@hc-sc.gc.ca /usr/local/bin/R --no-save Config.log, ompi_info, Rscript.txt and slurm outputs are attached. Network is GB Ethernet copper tcp/ip. I think this to be an openmpi error/bug due to the routed:binomial message. This also had the same results with openmpi-1.3.2, R 2.9.0, maanova 1.12 and slurm 2.0.1. No non-default MCA parameters are set. LD_LIBRARY_PATH=/usr/local/lib. Configuration done with defaults. Any ideas are welcome. ____________________ Steve Dale
bugrep.tar.bz2
Description: Binary data