Currently, I am in the process of converting an MPMD program of mine from LAM to OpenMPI. The old LAM setup used an application schema to handle the launching of the server and remote processes on all the nodes in the cluster; however, I have run into an issue due to the difference in how mpirun works in both. Because mpirun will route STDIN and STDOUT on remote processes to the location of STDIN and STDOUT where mpirun was originally run, I use a shell to launch the remote processes on all the nodes. In other words, I have mpirun start a shell (/bin/sh) on all the nodes and pass to it a string of runtime variables to be passed into the executable that is started by the shell. By using the shell’s “-c” option, I can start a process this way and it allows me to control the STDIN and STDOUT of the process. Below is an example application schema that works in LAM but not OpenMPI (obviously the –mca option doesn’t exist in LAM but its equivalence would). When trying to use this below in OpenMPI, I get EOF file parsing errors because OpenMPI does not know what to do with the variables listed in the quotations. It will parse the first quote, the program and its path, then errors trying to look for a matching quote when it should have kept on reading in all the runtime variables located in this string. How do I get this entire string to be passed by mpirun so that the shell can execute the corresponding process and pass the associated runtime variables to it.
#Example Application Schema #server -host node1 --mca btl tcp,self –np 1 /bin/sh –c “/usr/bin/SERVER_PROG --varTen blah” > myOwnLogfile_server.log 2>&1” #node2 -host node2 --mca btl tcp,self -np 1 /bin/sh -c “/usr/bin/REMOTE_PROG --varOne 59339 --varTwo 65888” > myOwnLogfile_remote1.log 2>&1” #node3 -host node3 --mca btl tcp,self -np 1 /bin/sh -c “/usr/bin/REMOTE_PROG --varOne 59339 --varTwo 65888” > myOwnLogfile_remote2.log 2>&1” #node4 -host node4 --mca btl tcp,self -np 1 /bin/sh -c “/usr/bin/REMOTE_PROG --varOne 59339 --varTwo 65888” > myOwnLogfile_remote3.log 2>&1”