Hi Reuti wrote: >> >> ./configure --prefix=/homes/kazi/glembek/share/openmpi-1.3.3-64 >> --with-sge --enable-shared --enable-static --host=x86_64-linux >> --build=x86_64-linux NM=x86_64-linux-nm > > Is there any list of valid values for --host, --build and NM - and what > is NM for? From the ./configure --help I would "assume" that one can > tell Open MPI to prepare to BUILD on a PPC platform, although I'm > issuing the command on a x86, and the result of the PPC compile should > be to run on x86_64. Maybe you can leave it out, as it's the same in > your case?
This is not the problem... We have both 32bit and 64bit machines and the problem occurs on both (i.e. omitting the --host --build, etc)... > >> Is there any way to force the ssh before the (...) term??? > > Using SSH directly would bypass SGE's startup. What are your entries for > qrsh_daemon and so on in SGE's configuration? Which version of SGE? qstat reports version number as "GE 6.2u4"... Below is qconf -sconf dump. > > But I think the real problem is, that Open MPI assumes you are outside > of SGE and so uses a different startup. Are you resetting any of SGE's > environment variables in your custom starter method (like $JOB_ID)? I don't think that openmpi doesn't know about SGE when it calls the starter.sh... The starter.sh looks like this: $$$ #!/bin/sh ulimit -S -c 0 ulimit -S -t unlimited #echo "$@" >>/pub/tmp/starter.log #start the job in thus shell exec "$@" so no resetting of any kind. Also the open_info looks ok: $$$ ompi_info | grep gridengine MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3) $$$ qconf -sconf: qconf -sconf #global: execd_spool_dir /usr/local/share/SGE/default/spool mailer /bin/mail xterm /usr/bin/xterm load_sensor /usr/local/share/SGE/util/disk.sh prolog none epilog none shell_start_mode posix_compliant login_shells sh,ksh,csh,tcsh,bash min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojects none enforce_project false enforce_user auto load_report_time 00:00:30 max_unheard 00:05:00 reschedule_unknown 00:00:00 loglevel log_warning administrator_mail li...@fit.vutbr.cz set_token_cmd none pag_cmd none token_extend_time none shepherd_cmd none qmaster_params none reporting_params accounting=true reporting=false \ flush_time=00:00:15 joblog=false sharelog=00:00:00 finished_jobs 20 gid_range 20000-20100 qlogin_command builtin qlogin_daemon builtin rlogin_daemon builtin max_aj_instances 2000 max_aj_tasks 90000 max_u_jobs 0 max_jobs 0 auto_user_oticket 0 auto_user_fshare 0 auto_user_default_project STD auto_user_delete_time 0 delegated_file_staging false rsh_daemon builtin rsh_command builtin rlogin_command builtin reprioritize 0 jsv_url none jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w Thanx > > -- Reuti > > >> >> Thanx >> Ondrej >> >> >> Reuti wrote: >>> Am 30.11.2009 um 18:46 schrieb Ondrej Glembek: >>>> Hi, thanx for reply... >>>> >>>> I tried to dump the $@ before calling the exec and here it is: >>>> >>>> >>>> ( test ! -r ./.profile || . ./.profile; >>>> PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/bin:$PATH ; export >>>> PATH ; >>>> LD_LIBRARY_PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/lib:$LD_LIBRARY_PATH >>>> ; export LD_LIBRARY_PATH ; >>>> /homes/kazi/glembek/share/openmpi-1.3.3-64/bin/orted -mca ess env >>>> -mca orte_ess_jobid 3870359552 -mca orte_ess_vpid 1 -mca >>>> orte_ess_num_procs 2 --hnp-uri >>>> "3870359552.0;tcp://147.229.8.134:53727" --mca >>>> pls_gridengine_verbose 1 --output-filename mpi.log ) >>>> >>>> >>>> It looks like the line gets constructed in >>>> orte/mca/plm/rsh/plm_rsh_module.c and depends on the shell... >>>> >>>> Still I wonder, why mpiexec calls the starter.sh... I thought the >>>> starter was supposed to call the script which wraps a call to >>>> mpiexec... >>> Correct. This will happen for the master node of this job, i.e. where >>> the jobscript is executed. But it will also be used for the qrsh >>> -inherit calls. I wonder about one thing: I see only a call to >>> "orted" and not the above sub-shell on my machines. Did you compile >>> Open MPI with --with-sge? >>> The original call above would be "ssh node_xy ( test ! ....)" which >>> seems working for ssh and rsh. >>> Just one note: with the starter script you will lose the set PATH and >>> LD_LIBRARY_PATH, as a new shell is created. It might be necessary to >>> set it again in your starter method. >>> -- Reuti >>>> >>>> Am I not right??? >>>> Ondrej >>>> >>>> >>>> Reuti wrote: >>>>> Hi, >>>>> Am 30.11.2009 um 16:33 schrieb Ondrej Glembek: >>>>>> we are using a custom starter method in our SGE to launch our >>>>>> jobs... It >>>>>> looks something like this: >>>>>> >>>>>> #!/bin/sh >>>>>> >>>>>> # ... we do whole bunch of stuff here >>>>>> >>>>>> #start the job in thus shell >>>>>> exec "$@" >>>>> the "$@" should be replaced by the path to the jobscript (qsub) or >>>>> command (qrsh) plus the given options. >>>>> For the spread tasks to other nodes I get as argument: " orted -mca >>>>> ess env -mca orte_ess_jobid ...". Also no . ./.profile. >>>>> So I wonder, where the . ./.profile is coming from. Can you put a >>>>> `sleep 60` or alike before the `exec ...` and grep the built line >>>>> from `ps -e f` before it crashes? >>>>> -- Reuti >>>>>> The trouble is that mpiexec passes a command which looks like this: >>>>>> >>>>>> ( . ./.profile ..... ) >>>>>> >>>>>> which, however, is not a valid exec argument... >>>>>> >>>>>> Is there any way to tell mpiexec to run it in a separate script??? >>>>>> Any >>>>>> idea how to solve this??? >>>>>> >>>>>> Thanx >>>>>> Ondrej Glembek >>>>>> >>>>>> -- >>>>>> >>>>>> Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz >>>>>> UPGM FIT VUT Brno, L226 Web: >>>>>> http://www.fit.vutbr.cz/~glembek >>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292 >>>>>> Brno, Czech Republic Fax: +420 54114-1290 >>>>>> >>>>>> ICQ: 93233896 >>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> -- >>>> >>>> Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz >>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek >>>> Bozetechova 2, 612 66 Phone: +420 54114-1292 >>>> Brno, Czech Republic Fax: +420 54114-1290 >>>> >>>> ICQ: 93233896 >>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> -- >> >> Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz >> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek >> Bozetechova 2, 612 66 Phone: +420 54114-1292 >> Brno, Czech Republic Fax: +420 54114-1290 >> >> ICQ: 93233896 >> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek Bozetechova 2, 612 66 Phone: +420 54114-1292 Brno, Czech Republic Fax: +420 54114-1290 ICQ: 93233896 GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C