> However, we're quite open to other approaches. Because of the nature of > our integration with a variety of different run-time environments, our > startup is not a shell script -- mpirun ("orterun" is its real name; > "mpirun" is a sym link to orterun) is a compiled executable. Surely, I saw that mpirun is the orterun executable :) And this means that to add some features I need to rebuild it (and some run-time libs probably) each time.
> What are the requirements of your debugger? Do you attempt to launch > the MPI processes yourself, or do you attach to them after they are > launched (which is what TotalView does)? It is supposed to attach GDB to each process after it has launched, so the TotalView interface goes well, except that its details are hardcoded in the source of orte/tools/orterun (as you may guess I don't have the executable named "totalview", etc.). I'd like to know when and where do the functions from orterun/totalview.{h,c} get called, do I need to write my own file like this, etc. In other words, "the debugger adder reference manual" :) Currently I launch gdb's on remote processes via ssh (as MPICH does), but probably it will be better to use orte framework capabilities for this. Don't know yet how. In general, are there an ompi/orte architecture description docs, other than short schemes in your publications? It's too general there and too detailed in sources and doxygen docs. Some intermediate "how all this works together" doc is needed to assemble the whole picture... For me, I do not understand it completely. > Open MPI uses orterun as its launcher, not the first MPI process. > Hence, it is the one that TotalView gets it information from (in that > sense, it's similar to the MPICH model -- there is one coordinator; it's > just that it's orterun, not the first MPI process). Once orterun > receives notification that all the MPI processes have started, it gives > the nodename/PID information of each process to TotalView who then > launches its own debugger processes on those nodes and attaches to the > processes. Hm.. with MPICH I use the first gdb copy to get the info from the 0-th process and then continue to use it as a node debugger, here I'll have to use one more gdb to get the process table out of orterun process? And how to do this in a safe way? > You probably get a "stopped" message when you try to bg orterun because > the shell thinks that it is waiting for input from stdin, because we > didn't close it. Actually this shouldn't matter. Many programs don't close stdin but nothing prevents them from running in background until they try to read input. The same "Hello world" application runs well with MPICH "mpirun -np 3 a.out &" Best regards, Konstantin.