Am 06.02.2012 um 22:28 schrieb Tom Bryan: > On 2/6/12 8:14 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: > >>> If I need MPI_THREAD_MULTIPLE, and openmpi is compiled with thread support, >>> it's not clear to me whether MPI::Init_Thread() and >>> MPI::Inint_Thread(MPI::THREAD_MULTIPLE) would give me the same behavior from >>> Open MPI. >> >> If you need thread support, you will need MPI::Init_Thread and it needs one >> argument (or three). > > Sorry, typo on my side. I meant to compare > MPI::Init_thread(MPI::THREAD_MULTIPLE) and MPI::Init(). I think that your > first reply mentioned replacing MPI::Init_thread by MPI::Init.
Yes, if you don't need threads, I don't see any reason why it should add anything to the environment what you could make use of. >>> <snip> >> >> What is the setting in SGE for: >> >> $ qconf -sconf >> ... >> qlogin_command builtin >> qlogin_daemon builtin >> rlogin_command builtin >> rlogin_daemon builtin >> rsh_command builtin >> rsh_daemon builtin >> If it's set to use ssh, > > Nope. My output is the same as yours. > qlogin_command builtin > qlogin_daemon builtin > rlogin_command builtin > rlogin_daemon builtin > rsh_command builtin > rsh_daemon builtin Fine. >> But I wonder, why it's working for some nodes? > > I don't think that it's working on some nodes. In my other cases where it > hangs, I don't always get those "connection refused" errors. If "builtin" is used, there is no reason to get "connection refused". The error message from Open MPI should be different in case of a closed firewall IIRC. > I'm not sure, but the "connection refused" errors might be a red herring. > The machines' primary NICs are on a different private network (172.28.*.*). > The 192.168.122.1 address is actually the machine's own virbr0 device, which > the documentations says is a "xen interface used by Virtualization guest and > host oses for network communication." By default Open MPI is using the primary interface for its communication AFAIK. >> Are there custom configuration per node, and some are faulty: > > I did a qconf -sconf machine for each host in my grid. I get identical > output like this for each machine. > $ qconf -sconf grid-03 > #grid-03.cisco.com: > mailer /bin/mail > xterm /usr/bin/xterm > > So, I think that the SGE config is the same across those machines. Yes, ok. Then it's fine. >>> <snip> >>> 3. Experiment "d" was similar to "b", but I use mpi.sh uses "mpiexec -np 1 >>> mpitest" instead of running mpitest directly. Now both the single machine >>> queue and multiple machine queue work. So, mpiexec seems to make my >>> multi-machine configuration happier. In this case, I'm still using "-pe >>> orte 5-", and I'm still seeing the extra SLAVE slots granted in qstat -g t. >> >> Then case a) could show a bug in 1.5.4. For me both we working, but the > > OK. That helps to explain my confusion. Our previous experiments (where I > was told that case (a) was working) were with Open MPI 1.4.x. Should I open > a bug for this issue? I'm not sure, as for me it's working. Maybe it has really something to do with the virtual machines setup. >> Yes, this should work across multiple machines. And it's using `qrsh -inherit >> ...` so it's failing somewhere in Open MPI - is it working with 1.4.4? > > I'm not sure. We no longer have our 1.4 test environment, so I'm in the > process of building that now. I'll let you know once I have a chance to run > that experiment. Ok. -- Reuti