[OMPI users] configure is too smart !
Dear developers, I "switched" from Lam-MPI to Open MPI recently. I am using MacOS X server on small clusters, previously with XLF/XLC on G5, now gfortran/gcc with Intels. Since users are used to Unix file systems, since most applications/ libraries compilations are not aware of HFS+ file system case insensitivity, I have installed a UFS formatted disk on our new cluster. Being a careful administrator, I configured/compiled OpenMPI as a user, on the UFS partition. Then I installed it as root, on an HFS+ system partition. When I tried to install Scalapack, BLACS compilation failed miserably: BI_EmergencyBuff.c: In function 'void BI_EmergencyBuff(int)': BI_EmergencyBuff.c:34: error: invalid conversion from 'void*' to 'char*' make[2]: *** [BI_EmergencyBuff.o] Error 1 make[1]: *** [INTERN] Error 2 make: *** [MPI] Error 2 This is, I guess, due to confusion between wrappers : $/usr/local/openmpi-1.1.4_32bits/bin/mpic++ i686-apple-darwin8-g++-4.0.1: no input files seems ok, but: $ /usr/local/openmpi-1.1.4_32bits/bin/mpicc i686-apple-darwin8-g++-4.0.1: no input files is wrong... Re-compiling OpenMPI on an HFS+ filesystem, I get: $ /usr/local/openmpi-1.1.4_32bits_hfs/bin/mpic++ i686-apple-darwin8-g++-4.0.1: no input files and $ /usr/local/openmpi-1.1.4_32bits_hfs/bin/mpicc i686-apple-darwin8-gcc-4.0.1: no input files which is correct. Then BLACS/Scalapack and others get compiled without troubles. (I have not tested execution yet !) Is my explanation right ? If yes, although the documentation is excellent, and FAQ already well detailed, could you please add a caveat somewhere: OpenMPI's configure is smarter than the average: it is case sensitiveness aware. Anyway, many thanks for your great great job ! -- Dr. Christian SIMON, Maitre de Conferences Laboratoire LI2C-UMR7612Bat. F74, piece 757 Universite Pierre et Marie CurieTel:+33.1.44.27.32.65 Case 51 Fax:+33.1.44.27.32.28 4 Place Jussieu 75252 Paris Cedex 05 France/Europe
Re: [OMPI users] Current working directory issue
OMPI uses the getcwd() library call to determine the pwd, whereas the shell $PWD variable contains the shell's point of view of what the PWD is (I *suspect* that the pwd(1) shell command also uses getcwd(), but I don't know that for sure). From the OSX getcwd(3) man page: The getcwd() function copies the absolute pathname of the current working directory into the memory referenced by buf and returns a pointer to buf. The size argument is the size, in bytes, of the array referenced by buf. From the Linux getcwd(3) man page: The getcwd() function shall place an absolute pathname of the current working directory in the array pointed to by buf, and return buf. The pathname copied to the array shall contain no components that are sym- bolic links. ... So this at least explains why you're seeing that behavior. I'm trying to think of a good reason why we're not checking PWD -- I think the reasons are as follows: 1. LAM/MPI has used getcwd() for about 10 years (I can't speak for the other MPI's, though) 2. You're the first guy to ask in that time (or the frequency of asking is so low that I've forgotten) But these are pretty wimpy reasons. :-) I'll have to check with the other developers to see if there are any "gotchas" to using PWD if it's defined and contains a valid alias for the current directory. On Mar 2, 2007, at 1:12 PM, Grismer, Matthew J Civ AFRL/VAAC wrote: I’m using OpenMPI on an Xserve cluster running OS X Server 10.4.8. The user directories exist on an XserveRAID connected to the master node via fibre channel. So, on the master node the full pathname for the user directories is /Volumes/RAID1/users1. The compute nodes of the cluster automount the user directories via NFS, so the full path to the user directories appears on the nodes as /xhome/ users1. I created a hostfile list of all the compute nodes on the cluster, not including the master node. When I attempt to run a program in my home directory matt from the master node with mpirun –hostfile nodes –np 4 program it fails because it cannot find program. If I add the –wdir option and specify the directory as /xhome/users1/matt, everything is fine. My question is this, how does OpenMPI determine your working directory, and is there a way to fix this without the –wdir option? For example, if you look at the PWD environment variable, it is always /xhome/users1/matt, even on the master. If you use the pwd command, however, you get different the two different results on the master and the nodes. Thanks. Matt _ Matthew Grismer ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] Fortran90 interfaces--problem?
On Mar 5, 2007, at 9:50 AM, Michael wrote: I have discovered a problem with the Fortran90 interfaces for all types of communication when one uses derived datatypes (I'm currently using openmpi-1.3a1r13918 [for testing] and openmpi-1.1.2 [for compatibility with an HPC system]), for example call MPI_RECV(tsk,1,MPI_TASKSTATE,src, 1,MPI_COMM_WORLD,MPI_STATUS_IGNORE,ier) where tsk is a Fortran 90 structure and MPI_TASKSTATE has been created by MPI_TYPE_CREATE_STRUCT. At the moment I can't imagine a way to modify the OpenMPI interface generation to work around this besides switching to --with-mpi-f90- size=small. This is unfortunately a known problem -- not just with Open MPI, but with the F90 bindings specification in MPI. :-( Since there's no F90 equivalent of C's (void*), there's no way to pass a variable of arbitrary type through the MPI F90 bindings. Hence, all we can do is define bindings for all the known types (i.e., various dimension sizes of the MPI types). -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] BLACS tests fails on IPF
Sorry for the delay in replying -- we've been quite busy trying to get OMPI v1.2 out the door! Are you sure that you build BLACS properly with Open MPI? Check this FAQ item: http://www.open-mpi.org/faq/?category=mpi-apps#blacs In particular, note that there are items in Bmake.inc that you need to set properly or BLACS won't work properly with Open MPI. On Feb 20, 2007, at 4:25 AM, Kobotov, Alexander V wrote: Hello all, I built BLACS on Itanium using Intel compilers under linux (2.6.9-34.EL). But it fails default BLACS Fortran tests (xFbtest), C tests (xCbtest) are ok. I’ve tried different configurations combining OpenMPI-1.1.2 or OpenMPI-1.1.4, ICC 9.1.038 or ICC 8.1.38, IFORT 9.1.33 or IFORT 8.1.34, but all results were the same. OpenMPI is built using 9.1 compilers. Also I’ve tried the same using em64t compiler on Intel XEON – all tests were passed. MPICH2 on IPF also works fine. Is that an OpenMPI bug? Maybe some workaround exists? Bmake.inc is attached. Below is output I’ve got (Don’t pay attention to blacs warnings, they are normal for MPI): ===[ begin of: xFbtest output ]= -bash-3.00$ mpirun -np 4 xFbtest_MPI-LINUX-0 BLACS WARNING 'No need to set message ID range due to MPI communicator.' from {-1,-1}, pnum=1, Contxt=-1, on line 18 of file 'blacs_set_.c'. BLACS WARNING 'No need to set message ID range due to MPI communicator.' from {-1,-1}, pnum=3, Contxt=-1, on line 18 of file 'blacs_set_.c'. BLACS WARNING 'No need to set message ID range due to MPI communicator.' from {-1,-1}, pnum=0, Contxt=-1, on line 18 of file 'blacs_set_.c'. BLACS WARNING 'No need to set message ID range due to MPI communicator.' from {-1,-1}, pnum=2, Contxt=-1, on line 18 of file 'blacs_set_.c'. [comp-pvfs-0-7.local:30119] *** An error occurred in MPI_Comm_group [comp-pvfs-0-7.local:30118] *** An error occurred in MPI_Comm_group [comp-pvfs-0-7.local:30118] *** on communicator MPI_COMM_WORLD [comp-pvfs-0-7.local:30118] *** MPI_ERR_COMM: invalid communicator [comp-pvfs-0-7.local:30119] *** on communicator MPI_COMM_WORLD [comp-pvfs-0-7.local:30119] *** MPI_ERR_COMM: invalid communicator [comp-pvfs-0-7.local:30119] *** MPI_ERRORS_ARE_FATAL (goodbye) [comp-pvfs-0-7.local:30116] *** An error occurred in MPI_Comm_group [comp-pvfs-0-7.local:30116] *** on communicator MPI_COMM_WORLD [comp-pvfs-0-7.local:30118] *** MPI_ERRORS_ARE_FATAL (goodbye) [comp-pvfs-0-7.local:30116] *** MPI_ERR_COMM: invalid communicator [comp-pvfs-0-7.local:30116] *** MPI_ERRORS_ARE_FATAL (goodbye) [comp-pvfs-0-7.local:30117] *** An error occurred in MPI_Comm_group [comp-pvfs-0-7.local:30117] *** on communicator MPI_COMM_WORLD [comp-pvfs-0-7.local:30117] *** MPI_ERR_COMM: invalid communicator [comp-pvfs-0-7.local:30117] *** MPI_ERRORS_ARE_FATAL (goodbye) forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) ===[ end of: xFbtest output ]= W.B.R., Kobotov Alexander ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] Fortran90 interfaces--problem?
On Tue, 2007-03-06 at 09:51 -0500, Jeff Squyres wrote: > On Mar 5, 2007, at 9:50 AM, Michael wrote: > > > I have discovered a problem with the Fortran90 interfaces for all > > types of communication when one uses derived datatypes (I'm currently > > using openmpi-1.3a1r13918 [for testing] and openmpi-1.1.2 [for > > compatibility with an HPC system]), for example > > > > call MPI_RECV(tsk,1,MPI_TASKSTATE,src, > > 1,MPI_COMM_WORLD,MPI_STATUS_IGNORE,ier) > > > > where tsk is a Fortran 90 structure and MPI_TASKSTATE has been > > created by MPI_TYPE_CREATE_STRUCT. > > > > At the moment I can't imagine a way to modify the OpenMPI interface > > generation to work around this besides switching to --with-mpi-f90- > > size=small. > > This is unfortunately a known problem -- not just with Open MPI, but > with the F90 bindings specification in MPI. :-( Since there's no > F90 equivalent of C's (void*), there's no way to pass a variable of > arbitrary type through the MPI F90 bindings. Hence, all we can do is > define bindings for all the known types (i.e., various dimension > sizes of the MPI types). > What about the "Fortran 2003 ISO_C_BINDING" couldn't a C_LOC be used here? (I probably don't know what i'm talking about but i just saw a reference to it.) -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Re: [OMPI users] performance question
On Feb 19, 2007, at 1:53 PM, Mark Kosmowski wrote: [snipped good description of cluster] Sorry for the delay in replying -- traveling for a week-long OMPI developer meeting and trying to get v1.2 out the door has sucked up all of our time recently. :-( For just the one system with two processors: CPU time: 32:43 Elapsed time: 36:52 Peak memory: 373 Mb For just the cluster: CPU time: 12:23 Elapsed time: 20:30 Peak memory: 131 Mb Is this a typical scaling or should I be thinking about doing some sort of tweaking to the [network / ompi] system at some point? The Unfortunately, there is no "typical" scaling -- every application is different. I'm unfortunately unfamiliar with the application you mentioned (CPMD), so I don't know how it runs (memory footprint, communication pattern, etc.). cpu time is scaling about right, but elapsed time is getting hammered - with the low memory overhead it has to be a communications issue rather than a swap issue, right? Possibly. But even with low memory usage, there can be other factors that create low CPU utilization (e.g., other IO, such as disk), processor/memory hierarchy issues (are your motherboards NUMA?), etc. Would it be helpful to see a serial time point using the same executable (if so, I'd probably repeat all the runs with a smaller job - I don't know that I want to spend half a week just for benchmarking)? I'm not sure what you mean -- see *what* at a serial point in time? I have included the appropriate btl_tcp_if_include configuration so that OMPI only uses the gigabit ports (and not the internet connections that some of the machines have). Gotcha. OMPI's TCP support is "ok" -- it's not great (we've spent much more time optimizing the low latency/high bandwidth interconnects). We do intend to go back to optimize TCP, but it's one of those time and monkeys issues (don't have enough time or monkeys to do it...). But it shouldn't be a major slowdown, particularly over a 12 or 32 hour run. Do you have any idea what the communication pattern is for CPMD? Does it send a little data, or a lot? How often does it communicate between the MPI processes, and how big are the messages? Etc. I am already planning on doing some benchmark comparisons to determine the effect of compiler / math library on speed. Depending on the app, this can have a big impact. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] Fortran90 interfaces--problem?
On Mar 6, 2007, at 10:23 AM, Åke Sandgren wrote: What about the "Fortran 2003 ISO_C_BINDING" couldn't a C_LOC be used here? (I probably don't know what i'm talking about but i just saw a reference to it.) FWIW, we wrote a paper about a proposed Fortran 2003 bindings that uses the ISO_C_BINDINGS stuff: http://www.open-mpi.org/papers/euro-pvmmpi-2005-fortran/ We haven't spent many cycles implementing it, but it's on the long- term to-do list. Contributions would be great! ;-) -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] configure is too smart !
Sure, we can add a FAQ entry on that :). At present, configure decides whether Open MPI will be installed on a case sensitive file-system or not based on what the build file system does. Which is far from perfect, but covers 99.9% of the cases. You happen to be the .1%, but we do have an option for you. You can specify --with-cs-fs or --without-cs-fs to specify whether the installation filesystem is case sensitive or not (overriding the auto- detection). Of course, I suppose I could add a sanity check during "make install" to ensure that the installation filesystem really is case sensitive if we expect it to be. mmm... I'll add that to the long term todo list. For now, I think a FAQ entry will do. Brian On Mar 6, 2007, at 2:24 AM, Christian Simon wrote: Dear developers, I "switched" from Lam-MPI to Open MPI recently. I am using MacOS X server on small clusters, previously with XLF/XLC on G5, now gfortran/gcc with Intels. Since users are used to Unix file systems, since most applications/ libraries compilations are not aware of HFS+ file system case insensitivity, I have installed a UFS formatted disk on our new cluster. Being a careful administrator, I configured/compiled OpenMPI as a user, on the UFS partition. Then I installed it as root, on an HFS+ system partition. When I tried to install Scalapack, BLACS compilation failed miserably: BI_EmergencyBuff.c: In function 'void BI_EmergencyBuff(int)': BI_EmergencyBuff.c:34: error: invalid conversion from 'void*' to 'char*' make[2]: *** [BI_EmergencyBuff.o] Error 1 make[1]: *** [INTERN] Error 2 make: *** [MPI] Error 2 This is, I guess, due to confusion between wrappers : $/usr/local/openmpi-1.1.4_32bits/bin/mpic++ i686-apple-darwin8-g++-4.0.1: no input files seems ok, but: $ /usr/local/openmpi-1.1.4_32bits/bin/mpicc i686-apple-darwin8-g++-4.0.1: no input files is wrong... Re-compiling OpenMPI on an HFS+ filesystem, I get: $ /usr/local/openmpi-1.1.4_32bits_hfs/bin/mpic++ i686-apple-darwin8-g++-4.0.1: no input files and $ /usr/local/openmpi-1.1.4_32bits_hfs/bin/mpicc i686-apple-darwin8-gcc-4.0.1: no input files which is correct. Then BLACS/Scalapack and others get compiled without troubles. (I have not tested execution yet !) Is my explanation right ? If yes, although the documentation is excellent, and FAQ already well detailed, could you please add a caveat somewhere: OpenMPI's configure is smarter than the average: it is case sensitiveness aware. Anyway, many thanks for your great great job ! -- Dr. Christian SIMON, Maitre de Conferences Laboratoire LI2C-UMR7612Bat. F74, piece 757 Universite Pierre et Marie CurieTel:+33.1.44.27.32.65 Case 51 Fax:+33.1.44.27.32.28 4 Place Jussieu 75252 Paris Cedex 05 France/Europe ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Comm_Spawn
Hi Tim, I get back to you "What kind of system is it?" =>The system is a "Debian Sarge". "How many nodes are you running on?" => There is no cluster configured, so I guess I work with no node environnement. "Have you been able to try a more recent version of Open MPI?" =>Today, I tried with version 1.1.4, but the results are not better. I tested 2 cases : Test 1 : with the sames configuration options (./configure --enable-mpi-threads --enable-progress-threads --with-threads=posix --enable-smp-locks) The program stopped on MPI_Init_thread in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 Test 2 : with the default configuration options (./configure --prefix=/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread) The program stoped on the "node allocation" after the spawn n°31. Maybe the problem comes from the lack of node definition? Thanks for your help. Here below, the different log files of the 2 tests /**TEST 1***/ GNU gdb 6.3-debian Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-linux"...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) run Starting program: /home/workspace/test_spaw1/src/spawn [Thread debugging using libthread_db enabled] [New Thread 1076646560 (LWP 5178)] main*** main : Lancement MPI* [New Thread 1085225904 (LWP 5181)] [New Thread 1094495152 (LWP 5182)] Program received signal SIGINT, Interrupt. [Switching to Thread 1076646560 (LWP 5178)] 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 (gdb) where #0 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #1 0x40187893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 #2 0xb508 in ?? () #3 0x4000bcd0 in _dl_map_object_deps () from /lib/ld-linux.so.2 #4 0x40b9f8cb in mca_btl_tcp_component_create_listen () from /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so #5 0x40b9f8cb in mca_btl_tcp_component_create_listen () from /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so #6 0x40b9eef4 in mca_btl_tcp_component_init () from /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so #7 0x4008c652 in mca_btl_base_select () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #8 0x40b8dd28 in mca_bml_r2_component_init () from /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_bml_r2.so #9 0x4008bf54 in mca_bml_base_init () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #10 0x40b7e5c9 in mca_pml_ob1_component_init () from /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_pml_ob1.so #11 0x40094192 in mca_pml_base_select () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #12 0x4005742c in ompi_mpi_init () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #13 0x4007c182 in PMPI_Init_thread () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #14 0x080489f3 in main (argc=1, argv=0xb8a4) at spawn6.c:33 /**TEST 2***/ GNU gdb 6.3-debian Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-linux"...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) run -np 1 --host myhost spawn6 Starting program: /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/bin/mpirun -np 1 --host myhost spawn6 [Thread debugging using libthread_db enabled] [New Thread 1076121728 (LWP 4022)] main*** main : Lancement MPI* Exe : Lance Exe: lRankExe = 1 lRankMain = 0 1 main***MPI_Comm_spawn return : 0 1 main***Rang main : 0 Rang exe : 1 Exe : Lance Exe: Fin. Exe: lRankExe = 1 lRankMain = 0 2 main***MPI_Comm_spawn return : 0 2 main***Rang main : 0 Rang exe : 1 Exe : Lance Exe: Fin. ... Exe: lRankExe = 1 lRankMain = 0 30 main***MPI_Comm_spawn return : 0 30 main***Rang main : 0 Rang exe : 1 Exe : Lance Exe: Fin. Exe: lRankExe = 1 lRankMain = 0 31 main***MPI_Comm_spawn return : 0 31 main***Rang main : 0 Rang exe : 1 Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1076121728 (LWP 4022)] 0x4018833b in strlen () from /lib/tls/libc.so.6 (gdb) where #0 0x4018833b in strlen () from /lib/tls/libc.so.6 #1 0x40297c5e in orte_gpr_replica_create_itag () from /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi/mca_gpr_replica.so #2 0x4029d2df in orte_gpr_replica_put_fn () from /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/lib/openmpi
[OMPI users] MPI_PACK very slow?
I have a section of code were I need to send 8 separate integers via BCAST. Initially I was just putting the 8 integers into an array and then sending that array. I just tried using MPI_PACK on those 8 integers and I'm seeing a massive slow down in the code, I have a lot of other communication and this section is being used only 5 times. I went from 140 seconds to 277 seconds on 16 processors using TCP via a dual gigabit ethernet setup (I'm the only user working on this system today). This was run with OpenMPI 1.1.2 to maintain compatibility with a major HPC site. Is there a know problem with MPI_PACK/UNPACK in OpenMPI? Michael
Re: [OMPI users] MPI_PACK very slow?
I doubt this come from the MPI_Pack/MPI_Unpack. The difference is 137 seconds for 5 calls. That's basically 27 seconds by call to MPI_Pack, for packing 8 integers. I know the code and I'm affirmative there is no way to spend 27 seconds over there. Can you run your application using valgrind with the callgrind tool. This will give you some basic informations about where the time is spend. This will give us additional information about where to look. Thanks, george. On Mar 6, 2007, at 11:26 AM, Michael wrote: I have a section of code were I need to send 8 separate integers via BCAST. Initially I was just putting the 8 integers into an array and then sending that array. I just tried using MPI_PACK on those 8 integers and I'm seeing a massive slow down in the code, I have a lot of other communication and this section is being used only 5 times. I went from 140 seconds to 277 seconds on 16 processors using TCP via a dual gigabit ethernet setup (I'm the only user working on this system today). This was run with OpenMPI 1.1.2 to maintain compatibility with a major HPC site. Is there a know problem with MPI_PACK/UNPACK in OpenMPI? Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users "Half of what I say is meaningless; but I say it so that the other half may reach you" Kahlil Gibran
Re: [OMPI users] configure is too smart !
Brian Barrett wrote: specify --with-cs-fs or --without-cs-fs Unbeliveable ! Thanks again. -- Christian SIMON
Re: [OMPI users] MPI_Comm_Spawn
I believe I know what is happening here. My availability in the next week is pretty limited due to a family emergency, but I'll take a look when I get back. In brief, this is a resource starvation issue where the system thinks your node is unable to support any further processes and so it blocks. On a separate note, I never use threaded configurations due to the lack of any real thread-safety review or testing on Open MPI to-date (per Tim's earlier comment). My "standard" configuration for development and testing is with --disable-progress-threads --without-threads. I'll post something back to the list when I get it resolved. Thanks Ralph On 3/6/07 9:00 AM, "rozzen.vinc...@fr.thalesgroup.com" wrote: > Hi Tim, I get back to you > > "What kind of system is it?" > =>The system is a "Debian Sarge". > "How many nodes are you running on?" > => There is no cluster configured, so I guess I work with no node > environnement. > "Have you been able to try a more recent version of Open MPI?" > =>Today, I tried with version 1.1.4, but the results are not better. > I tested 2 cases : > Test 1 : with the sames configuration options (./configure > --enable-mpi-threads --enable-progress-threads --with-threads=posix > --enable-smp-locks) > The program stopped on MPI_Init_thread in __lll_mutex_lock_wait () from > /lib/tls/libpthread.so.0 > > Test 2 : with the default configuration options (./configure > --prefix=/usr/local/Mpi/openmpi-1.1.4-noBproc-noThread) > The program stoped on the "node allocation" after the spawn n°31. > Maybe the problem comes from the lack of node definition? > Thanks for your help. > > Here below, the different log files of the 2 tests > > /**TEST 1***/ > GNU gdb 6.3-debian > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "i386-linux"...Using host libthread_db library > "/lib/tls/libthread_db.so.1". > > (gdb) run > Starting program: /home/workspace/test_spaw1/src/spawn > [Thread debugging using libthread_db enabled] > [New Thread 1076646560 (LWP 5178)] > main*** > main : Lancement MPI* > [New Thread 1085225904 (LWP 5181)] > [New Thread 1094495152 (LWP 5182)] > > Program received signal SIGINT, Interrupt. > [Switching to Thread 1076646560 (LWP 5178)] > 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > (gdb) where > #0 0x4018a436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #1 0x40187893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 > #2 0xb508 in ?? () > #3 0x4000bcd0 in _dl_map_object_deps () from /lib/ld-linux.so.2 > #4 0x40b9f8cb in mca_btl_tcp_component_create_listen () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so > #5 0x40b9f8cb in mca_btl_tcp_component_create_listen () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so > #6 0x40b9eef4 in mca_btl_tcp_component_init () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_btl_tcp.so > #7 0x4008c652 in mca_btl_base_select () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #8 0x40b8dd28 in mca_bml_r2_component_init () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_bml_r2.so > #9 0x4008bf54 in mca_bml_base_init () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #10 0x40b7e5c9 in mca_pml_ob1_component_init () from > /usr/local/Mpi/openmpi-1.1.4-noBproc/lib/openmpi/mca_pml_ob1.so > #11 0x40094192 in mca_pml_base_select () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #12 0x4005742c in ompi_mpi_init () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #13 0x4007c182 in PMPI_Init_thread () from > /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 > #14 0x080489f3 in main (argc=1, argv=0xb8a4) at spawn6.c:33 > > > > /**TEST 2***/ > > GNU gdb 6.3-debian > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "i386-linux"...Using host libthread_db library > "/lib/tls/libthread_db.so.1". > > (gdb) run -np 1 --host myhost spawn6 > Starting program: /usr/local/Mpi/openmpi-1.1.4-noBproc-noThread/bin/mpirun -np > 1 --host myhost spawn6 > [Thread debugging using libthread_db enabled] > [New Thread 1076121728 (LWP 4022)] > main*** > main : Lancement MPI* > Exe : Lance > Exe: lRankExe = 1 lRankMain = 0 > 1 main***MPI_Comm_spawn return : 0 > 1 ma
Re: [OMPI users] MPI_PACK very slow?
I discovered I made a minor change that cost me dearly (I had thought I had tested this single change but perhaps didn't track the timing data closely). MPI_Type_creat_struct performs well only when all the data is continuous in memory (at least for OpenMPI 1.1.2). Is this normal or expected? In my case the program has a f90 structure with 11 integers, 2 logicals, and five 50 element integer arrays. But at the first stage of the program only the first element of those arrays are used. But using MPI_Type_create_struct it is more efficient to send the entire 263 words of continuous memory (58 sec's) than to try and send only 18 words of noncontinuous memory (64 sec's). At the second stage it's 33 words and at that stage it becomes 47 sec's vs. 163 sec's, an extra 116 seconds, which dominates the push of my overall wall clock time from 130 to 278 seconds. The third stage increases from 13 seconds to 37 seconds. Because I need to send this block of data back and forward a lot I was hoping to find a way to speed up this data transfer of this odd block of data and a couple other variables. I may try PACK and UNPACK on the structure, but calling those lots of times can't be more efficient. Previously I was equivalencing the structure to a integer array and sending the integer array as a fast dirty solution to get started and it worked. Not completely portable no doubt. Michael ps. I don't currently have valgrind installed on this cluster and valgrind is not part of the Debian Linux 3.1r3 distribution. Without any experience with valgrind I'm not sure how useful valgrind will be with a MPI program of 500+ subroutines and 50K+ lines running on 16 processes. It took us a bit to get profiling working for the OpenMP version of this code. On Mar 6, 2007, at 11:28 AM, George Bosilca wrote: I doubt this come from the MPI_Pack/MPI_Unpack. The difference is 137 seconds for 5 calls. That's basically 27 seconds by call to MPI_Pack, for packing 8 integers. I know the code and I'm affirmative there is no way to spend 27 seconds over there. Can you run your application using valgrind with the callgrind tool. This will give you some basic informations about where the time is spend. This will give us additional information about where to look. Thanks, george. On Mar 6, 2007, at 11:26 AM, Michael wrote: I have a section of code were I need to send 8 separate integers via BCAST. Initially I was just putting the 8 integers into an array and then sending that array. I just tried using MPI_PACK on those 8 integers and I'm seeing a massive slow down in the code, I have a lot of other communication and this section is being used only 5 times. I went from 140 seconds to 277 seconds on 16 processors using TCP via a dual gigabit ethernet setup (I'm the only user working on this system today). This was run with OpenMPI 1.1.2 to maintain compatibility with a major HPC site. Is there a know problem with MPI_PACK/UNPACK in OpenMPI? Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users "Half of what I say is meaningless; but I say it so that the other half may reach you" Kahlil Gibran ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_PACK very slow?
On Mar 6, 2007, at 4:51 PM, Michael wrote: MPI_Type_creat_struct performs well only when all the data is continuous in memory (at least for OpenMPI 1.1.2). There are always benefits for sending contiguous data, especially when the message is small. Packing and unpacking, are costly operations. Even having a highly optimized version, cannot beat a user hand pack routine when the data is small. Increase the size of your message to over 64K and you will see another story. In my case the program has a f90 structure with 11 integers, 2 logicals, and five 50 element integer arrays. But at the first stage of the program only the first element of those arrays are used. But using MPI_Type_create_struct it is more efficient to send the entire 263 words of continuous memory (58 sec's) than to try and send only 18 words of noncontinuous memory (64 sec's). At the second stage it's 33 words and at that stage it becomes 47 sec's vs. 163 sec's, an extra 116 seconds, which dominates the push of my overall wall clock time from 130 to 278 seconds. The third stage increases from 13 seconds to 37 seconds. Because I need to send this block of data back and forward a lot I was hoping to find a way to speed up this data transfer of this odd block of data and a couple other variables. I may try PACK and UNPACK on the structure, but calling those lots of times can't be more efficient. Is there any way I can get access to your software ? Or at least the data-type related code ? ps. I don't currently have valgrind installed on this cluster and valgrind is not part of the Debian Linux 3.1r3 distribution. Without any experience with valgrind I'm not sure how useful valgrind will be with a MPI program of 500+ subroutines and 50K+ lines running on 16 processes. It took us a bit to get profiling working for the OpenMP version of this code. It will be seamless. What I'm doing is the following: instead of: mpirun -np 16 my_program my_args I'm using: mpirun -np 16 valgrind --tool=callgrind my_program my_args Once the execution is completed (which will usually take about 20 times more than without valgrind) I gather all resulting files on a common location (if not already over NFS) and analyze them with kcachegrind (comming by default with kde). george. On Mar 6, 2007, at 11:28 AM, George Bosilca wrote: I doubt this come from the MPI_Pack/MPI_Unpack. The difference is 137 seconds for 5 calls. That's basically 27 seconds by call to MPI_Pack, for packing 8 integers. I know the code and I'm affirmative there is no way to spend 27 seconds over there. Can you run your application using valgrind with the callgrind tool. This will give you some basic informations about where the time is spend. This will give us additional information about where to look. Thanks, george. On Mar 6, 2007, at 11:26 AM, Michael wrote: I have a section of code were I need to send 8 separate integers via BCAST. Initially I was just putting the 8 integers into an array and then sending that array. I just tried using MPI_PACK on those 8 integers and I'm seeing a massive slow down in the code, I have a lot of other communication and this section is being used only 5 times. I went from 140 seconds to 277 seconds on 16 processors using TCP via a dual gigabit ethernet setup (I'm the only user working on this system today). This was run with OpenMPI 1.1.2 to maintain compatibility with a major HPC site. Is there a know problem with MPI_PACK/UNPACK in OpenMPI? Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users "Half of what I say is meaningless; but I say it so that the other half may reach you" Kahlil Gibran ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users "Half of what I say is meaningless; but I say it so that the other half may reach you" Kahlil Gibran