I’d be a little cautious here - I’m not sure that hetero operations are completely fixed. The README is probably a bit over-stated (reflecting an earlier state), but I’m certain we haven’t extensively tested hetero operations and suspect there are still lingering issues.
> On Dec 23, 2014, at 11:29 PM, Kawashima, Takahiro > <t-kawash...@jp.fujitsu.com> wrote: > > Gilles, > > Ahh, I didn't know the current status. Thank you for the notice! > > Thanks, > Takahiro Kawashima > >> Kawashima-san, >> >> i'd rather consider this as a bug in the README (!) >> >> >> heterogenous support has been broken for some time, but it was >> eventually fixed. >> >> truth is there are *very* limited resources (both human and hardware) >> maintaining heterogeneous >> support, but that does not mean heterogeneous support should not be >> used, nor that bug report >> will be ignored. >> >> Cheers, >> >> Gilles >> >> On 2014/12/24 9:26, Kawashima, Takahiro wrote: >>> Hi Siegmar, >>> >>> Heterogeneous environment is not supported officially. >>> >>> README of Open MPI master says: >>> >>> --enable-heterogeneous >>> Enable support for running on heterogeneous clusters (e.g., machines >>> with different endian representations). Heterogeneous support is >>> disabled by default because it imposes a minor performance penalty. >>> >>> *** THIS FUNCTIONALITY IS CURRENTLY BROKEN - DO NOT USE *** >>> >>>> Hi, >>>> >>>> today I installed openmpi-dev-602-g82c02b4 on my machines (Solaris 10 >>>> Sparc, >>>> Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with gcc-4.9.2 and the >>>> new Solaris Studio 12.4 compilers. All build processes finished without >>>> errors, but I have a problem running a very small program. It works for >>>> three processes but hangs for six processes. I have the same behaviour >>>> for both compilers. >>>> >>>> tyr small_prog 139 time; mpiexec -np 3 --host sunpc1,linpc1,tyr >>>> init_finalize; time >>>> 827.161u 210.126s 30:51.08 56.0% 0+0k 4151+20io 2898pf+0w >>>> Hello! >>>> Hello! >>>> Hello! >>>> 827.886u 210.335s 30:54.68 55.9% 0+0k 4151+20io 2898pf+0w >>>> tyr small_prog 140 time; mpiexec -np 6 --host sunpc1,linpc1,tyr >>>> init_finalize; time >>>> 827.946u 210.370s 31:15.02 55.3% 0+0k 4151+20io 2898pf+0w >>>> ^CKilled by signal 2. >>>> Killed by signal 2. >>>> 869.242u 221.644s 33:40.54 53.9% 0+0k 4151+20io 2898pf+0w >>>> tyr small_prog 141 >>>> >>>> tyr small_prog 145 ompi_info | grep -e "Open MPI repo revision:" -e "C >>>> compiler:" >>>> Open MPI repo revision: dev-602-g82c02b4 >>>> C compiler: cc >>>> tyr small_prog 146 >>>> >>>> >>>> tyr small_prog 146 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec >>>> GNU gdb (GDB) 7.6.1 >>>> ... >>>> (gdb) run -np 3 --host sunpc1,linpc1,tyr init_finalize >>>> Starting program: /usr/local/openmpi-1.9.0_64_cc/bin/mpiexec -np 3 --host >>>> sunpc1,linpc1,tyr >>>> init_finalize >>>> [Thread debugging using libthread_db enabled] >>>> [New Thread 1 (LWP 1)] >>>> [New LWP 2 ] >>>> Hello! >>>> Hello! >>>> Hello! >>>> [LWP 2 exited] >>>> [New Thread 2 ] >>>> [Switching to Thread 1 (LWP 1)] >>>> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to >>>> satisfy query >>>> (gdb) run -np 6 --host sunpc1,linpc1,tyr init_finalize >>>> The program being debugged has been started already. >>>> Start it from the beginning? (y or n) y >>>> >>>> Starting program: /usr/local/openmpi-1.9.0_64_cc/bin/mpiexec -np 6 --host >>>> sunpc1,linpc1,tyr >>>> init_finalize >>>> [Thread debugging using libthread_db enabled] >>>> [New Thread 1 (LWP 1)] >>>> [New LWP 2 ] >>>> ^CKilled by signal 2. >>>> Killed by signal 2. >>>> >>>> Program received signal SIGINT, Interrupt. >>>> [Switching to Thread 1 (LWP 1)] >>>> 0xffffffff7d1dc6b0 in __pollsys () from /lib/sparcv9/libc.so.1 >>>> (gdb) bt >>>> #0 0xffffffff7d1dc6b0 in __pollsys () from /lib/sparcv9/libc.so.1 >>>> #1 0xffffffff7d1cb468 in _pollsys () from /lib/sparcv9/libc.so.1 >>>> #2 0xffffffff7d170ed8 in poll () from /lib/sparcv9/libc.so.1 >>>> #3 0xffffffff7e69a630 in poll_dispatch () >>>> from /usr/local/openmpi-1.9.0_64_cc/lib64/libopen-pal.so.0 >>>> #4 0xffffffff7e6894ec in opal_libevent2021_event_base_loop () >>>> from /usr/local/openmpi-1.9.0_64_cc/lib64/libopen-pal.so.0 >>>> #5 0x000000010000eb14 in orterun (argc=1757447168, >>>> argv=0xffffff7ed8550cff) >>>> at >>>> ../../../../openmpi-dev-602-g82c02b4/orte/tools/orterun/orterun.c:1090 >>>> #6 0x0000000100004e2c in main (argc=256, argv=0xffffff7ed8af5c00) >>>> at ../../../../openmpi-dev-602-g82c02b4/orte/tools/orterun/main.c:13 >>>> (gdb) >>>> >>>> Any ideas? Unfortunately I'm leaving for vaccation so that I cannot test >>>> any patches until the end of the year. Neverthess I wanted to report the >>>> problem. At the moment I cannot test if I have the same behaviour in a >>>> homogeneous environment with three machines because the new version isn't >>>> available before tomorrow on the other machines. I used the following >>>> configure command. >>>> >>>> ../openmpi-dev-602-g82c02b4/configure >>>> --prefix=/usr/local/openmpi-1.9.0_64_cc \ >>>> --libdir=/usr/local/openmpi-1.9.0_64_cc/lib64 \ >>>> --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ >>>> --with-jdk-headers=/usr/local/jdk1.8.0/include \ >>>> JAVA_HOME=/usr/local/jdk1.8.0 \ >>>> LDFLAGS="-m64 -mt" \ >>>> CC="cc" CXX="CC" FC="f95" \ >>>> CFLAGS="-m64 -mt" CXXFLAGS="-m64 -library=stlport4" FCFLAGS="-m64" \ >>>> CPP="cpp" CXXCPP="cpp" \ >>>> CPPFLAGS="" CXXCPPFLAGS="" \ >>>> --enable-mpi-cxx \ >>>> --enable-cxx-exceptions \ >>>> --enable-mpi-java \ >>>> --enable-heterogeneous \ >>>> --enable-mpi-thread-multiple \ >>>> --with-threads=posix \ >>>> --with-hwloc=internal \ >>>> --without-verbs \ >>>> --with-wrapper-cflags="-m64 -mt" \ >>>> --with-wrapper-cxxflags="-m64 -library=stlport4" \ >>>> --with-wrapper-ldflags="-mt" \ >>>> --enable-debug \ >>>> |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc >>>> >>>> Furthermore I used the following test program. >>>> >>>> #include <stdio.h> >>>> #include <stdlib.h> >>>> #include "mpi.h" >>>> >>>> int main (int argc, char *argv[]) >>>> { >>>> MPI_Init (&argc, &argv); >>>> printf ("Hello!\n"); >>>> MPI_Finalize (); >>>> return EXIT_SUCCESS; >>>> } > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/12/26070.php