Dear Reuti, Thank you very much! The jobname.nodes file is not necessary for parallel ORCA. And my "mpivars.sh" is also not a problem. ORCA3.0.3 program is compiled with openmpi-1.6.5, which can run normally on multiple node in gridengine. While ORCA4.0.0 program is compiled with openmpi-2.0.2, and cannot run on multiple node in gridengine. Maybe it is a bug of openmpi-2.0.x for the orca running on multiple node in gridengine. I download the latest stable version of openmpi, but the error is also appeared in openmpi-2.1.0. The bug maybe not fixed in the latest stable version.
>The Open MPI bug you checked already: https://www.mail-archive.com/ us...@lists.open-mpi.org/msg30824.html Thanks for your information. I read it, but I am not solve this problem. I modify the code file of "orte/mca/plm/rsh/plm_rsh_component.c <https://github.com/open-mpi/ompi/commit/dee2d8646d2e2055e2c86db9c207403366a2453d#diff-f556f53efc98e71d3bd13ee9945949fe>" following this address: https://github.com/open-mpi/ompi/commit/dee2d8646d2e2055e2c86db9c207403366a2453d#diff-f556f53efc98e71d3bd13ee9945949fe and recompiled the openmpi, but has no effect. >Please don't use "&" in the job script to put the job in the background. The job script might end and SGE discovers this an kills all orphaned processes. Also with Torque this shouldn't be necessary. Yes, I know. Thanks! >Please change the line in your PeHostfile2MachineFile() subroutine: >host=`echo $line|cut -f1 -d" "|cut -f1 -d"."` >to: >host=`echo $line|cut -f1 -d" "` >This should leave the ".local" domain, This is also not a problem. Because of my “/etc/hosts” 10.1.1.1 cluster.local cluster 10.1.255.254 compute-0-0.local compute-0-0 10.1.255.253 compute-0-1.local compute-0-1 10.1.255.244 compute-0-10.local compute-0-10 10.1.255.243 compute-0-11.local compute-0-11 ...... Best regards, Yong Wu 2017-04-09 2:42 GMT+08:00 Reuti <re...@staff.uni-marburg.de>: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > > Am 07.04.2017 um 16:04 schrieb Yong Wu: > > > Thanks for your reply. > > First of all, I can run this job on multiple nodes without Torque/SGE > resource manager, and also ok used with Torque. > > But this job does not work on multiple nodes with gridengine. > > I doubt that this is caused by the parallel environment of gridengine. > However, orte, mpi, mpich, I got the same error for these PEs of gridengine. > > > > I answer your above mentioned question. > > >Can you please post the output of the $PE_HOSTFILE and the converted > test.nodes for a run, and the allocation you got: qstat -g t > > The output of $PE_HOSTFILE: > > compute-0-34.local 16 bgmnode.q@compute-0-34.local UNDEFINED > > compute-0-67.local 8 bgmnode.q@compute-0-67.local UNDEFINED > > > > […] > > Okay. > > What does happen, what error message is generated when you don't create > the "test.nodes" file at all? > > > > > The "mpivars.sh" seems not to be in the default Open MPI compilation. > Where is it coming from, what's inside? > > The "mpivars.sh" is touched by me, and the content: > > $ cat /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh > > # PATH > > if test -z "`echo $PATH | grep /share/apps/mpi/openmpi2.0.2-ifort/bin`"; > then > > Although I like that you scan for the existence of the paths in the > environment variable, it's more safe to add some just in front in any case. > Otherwise they could be at the end and overwritten by any path found > earlier in the environment variable. > > > > […] > > $ source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh > > $ ompi_info | grep gridengine > > MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component > v2.0.2) > > Ok, this is compiled in then. > > > > >Side note: > > I create the same directory on each nodes and also use the NFS shared > directory for scratch directory. And use the following environment: > > source /usr/share/Modules/init/sh > > module load intel/compiler/2011.7.256 > > source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh > > export RSH_COMMAND="ssh" > > > > Use these environments, I can run this orca job normally on multiple > nodes without gridengine by type the command:"/share/apps/orca4.0.0/orca > test.inp &>test.log &" > > Please don't use "&" in the job script to put the job in the background. > The job script might end and SGE discovers this an kills all orphaned > processes. Also with Torque this shouldn't be necessary. > > - -- Reuti > -----BEGIN PGP SIGNATURE----- > Comment: GPGTools - https://gpgtools.org > > iEYEARECAAYFAljpLwgACgkQo/GbGkBRnRodggCfVyEP95S61Q4JKALZL1aQRr2u > JZsAoJyl7Ee0R4I8h6BvVVysEdjbeAEi > =M+rH > -----END PGP SIGNATURE----- >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users