Re: [OMPI users] Open MPI and Torque error
Pak Lui wrote: Prakash, tm_poll: protocol number dis error 11 ret is 17002 instead of 0: tm_init failed 3 processes killed (possibly by Open MPI) I encountered similar problem with OpenPBS before, which also uses the TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I tried to call tm_init for the second time (which in turns call tm_poll and returned that errno). I think what you did to start tm_init from another node and connect to another mom which I do not think is allowed. The TM module in OpenMPI already called tm_init once. I am curious to know about the reason that you need to call tm_init again? If you are curious to know about the implementation for PBS, you can download the source from openpbs.org. OpenPBS source: v2.3.16/src/lib/Libifl/tm.c I am interested in getting this to work as I am working on implementing support for dynamic scheduling in Torque. I want any node in an MPI-2 job (basically Open MPI implementation) to be able to request the Torque/PBS server for more nodes. I am doing a little study in that right now. Instead of nodes talking directly to the server, I want them to be able to talk to Mother Superior and MS instead will talk to the Server. Could you please explain why this does not work now? And why it works when I do the tm_init from MS, and only does not work from any other MOM? Thanks, Prakash
[OMPI users] job running question
We are trying to build a new cluster running OpenMPI. We were previous running LAM-MPI. To run jobs we would do the following: $ lamboot lam-host-file $ mpirun C program I am not sure if this works more or less the same way with ompi. We were trying to run it like this: $ [james.parker@Cent01 FORTRAN]$ mpirun --np 2 f_5x5 localhost mpirun noticed that job rank 1 with PID 0 on node "localhost" exited on signal 11. [Cent01.brooks.afmc.ds.af.mil:16124] ERROR: A daemon on node localhost failed to start as expected. [Cent01.brooks.afmc.ds.af.mil:16124] ERROR: There may be more information available from [Cent01.brooks.afmc.ds.af.mil:16124] ERROR: the remote shell (see above). [Cent01.brooks.afmc.ds.af.mil:16124] The daemon received a signal 11. 1 additional process aborted (not shown) [james.parker@Cent01 FORTRAN]$ We have ompi installed to /usr/local, and these are our environment variables: [james.parker@Cent01 FORTRAN]$ export declare -x COLORTERM="gnome-terminal" declare -x DBUS_SESSION_BUS_ADDRESS="unix:abstract=/tmp/dbus-sfzFctmRFS" declare -x DESKTOP_SESSION="default" declare -x DISPLAY=":0.0" declare -x GDMSESSION="default" declare -x GNOME_DESKTOP_SESSION_ID="Default" declare -x GNOME_KEYRING_SOCKET="/tmp/keyring-x8WQ1E/socket" declare -x GTK_RC_FILES="/etc/gtk/gtkrc:/home/BROOKS-2K/james.parker/.gtkrc-1.2-gnome2" declare -x G_BROKEN_FILENAMES="1" declare -x HISTSIZE="1000" declare -x HOME="/home/BROOKS-2K/james.parker" declare -x HOSTNAME="Cent01" declare -x INPUTRC="/etc/inputrc" declare -x KDEDIR="/usr" declare -x LANG="en_US.UTF-8" declare -x LD_LIBRARY_PATH="/usr/local/lib:/usr/local/lib/openmpi" declare -x LESSOPEN="|/usr/bin/lesspipe.sh %s" declare -x LOGNAME="james.parker" declare -x LS_COLORS="no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40 ;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com= 00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31 :*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00 ;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:* .gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:" declare -x MAIL="/var/spool/mail/james.parker" declare -x OLDPWD="/home/BROOKS-2K/james.parker/build/SuperLU_DIST_2.0" declare -x PATH="/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/BR OOKS-2K/james.parker/bin:/usr/local/bin" declare -x PERL5LIB="/usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi:/usr/lib/pe rl5/site_perl/5.8.5" declare -x PWD="/home/BROOKS-2K/james.parker/build/SuperLU_DIST_2.0/FORTRAN" declare -x SESSION_MANAGER="local/Cent01.brooks.afmc.ds.af.mil:/tmp/.ICE-unix/14516" declare -x SHELL="/bin/bash" declare -x SHLVL="2" declare -x SSH_AGENT_PID="14541" declare -x SSH_ASKPASS="/usr/libexec/openssh/gnome-ssh-askpass" declare -x SSH_AUTH_SOCK="/tmp/ssh-JUIxl14540/agent.14540" declare -x TERM="xterm" declare -x USER="james.parker" declare -x WINDOWID="35651663" declare -x XAUTHORITY="/home/BROOKS-2K/james.parker/.Xauthority" [james.parker@Cent01 FORTRAN]$ Any ideas??
[OMPI users] any checkpoint/restart function in Open-MPI?
just like the BLCR in LAM/MPI. thanks in advance Lenjoy - Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2ยข/min or less.