Re: [OMPI users] ORTE errors
On Apr 10, 2006, at 6:31 PM, Ralph Castain wrote: Was this the only output you received? If so, then it looks like your parent process never gets to spawn and bcast - you should have seen your write statements first, yes? Ralph I only listed the ORTE errors, I get the correct output, complete as follows: parent: 0 of 1 parent: How many processes total? 2 parent: Calling MPI_Comm_spawn to start 1 subprocesses. parent: Calling MPI_BCAST with btest = 17 . child = 3 child 0 of 1: Parent 3 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ soh_base_get_proc_soh.c at line 80 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ oob_base_xcast.c at line 108 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ rmgr_base_stage_gate.c at line 276 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ soh_base_get_proc_soh.c at line 80 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ oob_base_xcast.c at line 108 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ rmgr_base_stage_gate.c at line 276 child 0 of 1: Receiving 17 from parent Maximum user memory allocated: 0 Michael Michael Kluskens wrote: The ORTE errors again, these are new and different errors. Tested as of OpenMPI 1.1a1r9596. [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ soh_base_get_proc_soh.c at line 80 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ oob_base_xcast.c at line 108 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ rmgr_base_stage_gate.c at line 276 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ soh_base_get_proc_soh.c at line 80 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ oob_base_xcast.c at line 108 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ rmgr_base_stage_gate.c at line 276 This test was run using OpenMPI 1.1 built on OS X 10.4.6 with g95 from 4/9/06. Past experience was that the ORTE errors were independent of OS and compiler. Attached sample codes generated these errors. They use MPI_SPAWN and MPI_BCAST (most vendors MPI's can't run this test case).
[OMPI users] Problem running code with OpenMPI-1.0.1
Good morning, I'm trying to run one of the NAS Parallel Benchmarks (bt) with OpenMPI-1.0.1 that was built with PGI 6.0. The code never starts (at least I don't see any output) until I kill the code. Then I get the following message: [0,1,2][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,4][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,8][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113mpirun: killing job... Any ideas on this one? Thanks! Jeff
Re: [OMPI users] ORTE errors
Thanks Michael - we're looking into it and will get back to you shortly. Ralph Michael Kluskens wrote: On Apr 10, 2006, at 6:31 PM, Ralph Castain wrote: Was this the only output you received? If so, then it looks like your parent process never gets to spawn and bcast - you should have seen your write statements first, yes? Ralph I only listed the ORTE errors, I get the correct output, complete as follows: parent: 0 of 1 parent: How many processes total? 2 parent: Calling MPI_Comm_spawn to start 1 subprocesses. parent: Calling MPI_BCAST with btest = 17 . child = 3 child 0 of 1: Parent 3 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ soh_base_get_proc_soh.c at line 80 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ oob_base_xcast.c at line 108 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ rmgr_base_stage_gate.c at line 276 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ soh_base_get_proc_soh.c at line 80 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ oob_base_xcast.c at line 108 [host:00258] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ rmgr_base_stage_gate.c at line 276 child 0 of 1: Receiving 17 from parent Maximum user memory allocated: 0 Michael Michael Kluskens wrote: The ORTE errors again, these are new and different errors. Tested as of OpenMPI 1.1a1r9596. [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ soh_base_get_proc_soh.c at line 80 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ oob_base_xcast.c at line 108 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ rmgr_base_stage_gate.c at line 276 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ soh_base_get_proc_soh.c at line 80 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ oob_base_xcast.c at line 108 [host:10198] [0,0,0] ORTE_ERROR_LOG: Not found in file base/ rmgr_base_stage_gate.c at line 276 This test was run using OpenMPI 1.1 built on OS X 10.4.6 with g95 from 4/9/06. Past experience was that the ORTE errors were independent of OS and compiler. Attached sample codes generated these errors. They use MPI_SPAWN and MPI_BCAST (most vendors MPI's can't run this test case). ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Building 32-bit OpenMPI package for 64-bit Opteron platform
I suspect that to get this to work for bproc, then we will have to build mpirun as 64-bit and the library as 32-bit. That's because a 32-bit compiled mpirun calls functions in the 32-bit /usr/lib/ libbroc.so which don't appear to function when the system is booted 64-bit. Of course that would mean we need heterogeneous support to run on a single homogeneous system! Will this work on the 1.0 branch? An alternative worth thinking about is to bypass the library calls and start processes using a system() call to invoke the bpsh command. This is a 64-bit executable linked with /usr/lib64/ libbproc.so and which successfully launches both 32- and 64-bit executables. I'm currently trying to solve the same issue for LA-MPI :( David On Apr 10, 2006, at 9:18 AM, Brian Barrett wrote: On Apr 10, 2006, at 11:07 AM, David Gunter wrote: (flashc 105%) mpiexec -n 4 ./send4 [flashc.lanl.gov:09921] mca: base: component_find: unable to open: / lib/libc.so.6: version `GLIBC_2.3.4' not found (required by /net/ scratch1/dog/flash64/openmpi/openmpi-1.0.2-32b/lib/openmpi/ mca_paffinity_linux.so) (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) mpiexec: relocation error: /net/scratch1/dog/flash64/openmpi/ openmpi-1.0.2-32b/lib/openmpi/mca_soh_bproc.so: undefined symbol: bproc_nodelist The problem now looks like /lib/libc.so.6 is not longer available. Indeed, it is available on the compiler nodes but it cannot be found on the backend nodes - whoops! Well, that's interesting. Is this on a bproc platform? If so, you might be best off configuring with either --enable-static or -- disable-dlopen. Either one will prevent components from loading, which doesn't seem to always work well. Also, it looks like at least one of the components has a different libc its linked against than the others. This makes me think that perhaps you have some old components from a previous build in your tree. You might want to completely remove your installation prefix (or lib/openmpi in your installation prefix) and run make install again. Are you no longer seeing the errors about epoll? The other problem is that the -m32 flag didn't make it into mpicc for some reason. This is expected behavior. There are going to be more and more cases where Open MPI provides one wrapper compiler that does the right thing whether the user passes in -m32 / -m64 (or any of the vendor options for doing the same thing). So it will be increasingly impossible for us to know what to add (but as Jeff said, you can tell configure to always add -m32 to the wrapper compilers if you want). Brian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Building 32-bit OpenMPI package for 64-bit Opteron platform
Heterogeneous operations are not supported on 1.0 - they are, however, on the new 1.1. :-) Also, remember that you must configure for static operation for bproc - use the configuration options "--enable-static --disable-shared". Our current bproc launcher *really* dislikes shared libraries ;-) Ralph David Daniel wrote: I suspect that to get this to work for bproc, then we will have to build mpirun as 64-bit and the library as 32-bit. That's because a 32-bit compiled mpirun calls functions in the 32-bit /usr/lib/ libbroc.so which don't appear to function when the system is booted 64-bit. Of course that would mean we need heterogeneous support to run on a single homogeneous system! Will this work on the 1.0 branch? An alternative worth thinking about is to bypass the library calls and start processes using a system() call to invoke the bpsh command. This is a 64-bit executable linked with /usr/lib64/ libbproc.so and which successfully launches both 32- and 64-bit executables. I'm currently trying to solve the same issue for LA-MPI :( David On Apr 10, 2006, at 9:18 AM, Brian Barrett wrote: On Apr 10, 2006, at 11:07 AM, David Gunter wrote: (flashc 105%) mpiexec -n 4 ./send4 [flashc.lanl.gov:09921] mca: base: component_find: unable to open: / lib/libc.so.6: version `GLIBC_2.3.4' not found (required by /net/ scratch1/dog/flash64/openmpi/openmpi-1.0.2-32b/lib/openmpi/ mca_paffinity_linux.so) (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) [flashc.lanl.gov:09921] mca: base: component_find: unable to open: libbproc.so.4: cannot open shared object file: No such file or directory (ignored) mpiexec: relocation error: /net/scratch1/dog/flash64/openmpi/ openmpi-1.0.2-32b/lib/openmpi/mca_soh_bproc.so: undefined symbol: bproc_nodelist The problem now looks like /lib/libc.so.6 is not longer available. Indeed, it is available on the compiler nodes but it cannot be found on the backend nodes - whoops! Well, that's interesting. Is this on a bproc platform? If so, you might be best off configuring with either --enable-static or -- disable-dlopen. Either one will prevent components from loading, which doesn't seem to always work well. Also, it looks like at least one of the components has a different libc its linked against than the others. This makes me think that perhaps you have some old components from a previous build in your tree. You might want to completely remove your installation prefix (or lib/openmpi in your installation prefix) and run make install again. Are you no longer seeing the errors about epoll? The other problem is that the -m32 flag didn't make it into mpicc for some reason. This is expected behavior. There are going to be more and more cases where Open MPI provides one wrapper compiler that does the right thing whether the user passes in -m32 / -m64 (or any of the vendor options for doing the same thing). So it will be increasingly impossible for us to know what to add (but as Jeff said, you can tell configure to always add -m32 to the wrapper compilers if you want). Brian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Building 32-bit OpenMPI package for 64-bit Opteron platform
Unfortunately static-only will create binaries that will overwhelm our machines. This is not a realistic option. -david On Apr 11, 2006, at 1:04 PM, Ralph Castain wrote: Also, remember that you must configure for static operation for bproc - use the configuration options "--enable-static --disable- shared". Our current bproc launcher *really* dislikes shared libraries ;-)
[OMPI users] Intel EM64T Compiler error on Opteron
I am trying to build OpenMPI v1.0.2 (stable) on an Opteron using the v8.1 Intel EM64T compilers: Intel(R) C Compiler for Intel(R) EM64T-based applications, Version 8.1 Build 20041123 Package ID: l_cce_pc_8.1.024 Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, Version 8.1 Build 20041123 Package ID: l_fce_pc_8.1.024 The compiler core dumps during make with: icc -DHAVE_CONFIG_H -I. -I. -I../../include -I../../include -DOMPI_PKGDATADIR=\"/scratch/merz//share/openmpi\" -I../../include -I../.. -I../.. -I../../include -I../../opal -I../../orte -I../../ompi -O3 -DNDEBUG -fno-strict-aliasing -pthread -MT cmd_line.lo -MD -MP -MF .deps/cmd_line.Tpo -c cmd_line.c -fPIC -DPIC -o .libs/cmd_line.o icc: error: /opt/intel_cce_80/bin/mcpcom: core dumped icc: error: Fatal error in /opt/intel_cce_80/bin/mcpcom, terminated by unknown signal(139) I couldn't find any other threads in the mailing list concerning usage of the Intel EM64T compilers - has anyone successfully compiled OpenMPI using this combination? It also occurs on the Athlon 64 processor. Logs attached. Thanks, Hugh openmpi_1.0.2_logs.tar.bz2 Description: BZip2 compressed data
Re: [OMPI users] Building 32-bit OpenMPI package for 64-bit Opteron platform
Unfortunately, that's all that is available at the moment. Future releases (post 1.1) may get around this problem. The issue is that the bproc launcher actually does a binary memory image of the process, then replicates that across all the nodes. This is how we were told to implement it originally by the BProc folks. However, that means that shared libraries have problems, for obvious reasons. We have to reimplement the bproc launcher using a different approach - will take a little time. Ralph David Gunter wrote: Unfortunately static-only will create binaries that will overwhelm our machines. This is not a realistic option. -david On Apr 11, 2006, at 1:04 PM, Ralph Castain wrote: Also, remember that you must configure for static operation for bproc - use the configuration options "--enable-static --disable- shared". Our current bproc launcher *really* dislikes shared libraries ;-) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Building 32-bit OpenMPI package for 64-bit Opteron platform
Ralph/all, Ralph Castain wrote: Unfortunately, that's all that is available at the moment. Future releases (post 1.1) may get around this problem. The issue is that the bproc launcher actually does a binary memory image of the process, then replicates that across all the nodes. This is how we were told to implement it originally by the BProc folks. However, that means that shared libraries have problems, for obvious reasons. We have to reimplement the bproc launcher using a different approach - will take a little time. The current launcher does work w/ shared libraries, if they are available on the backend nodes. So, it's more convienent if they are linked statically, but not a requirement. Tim Ralph David Gunter wrote: Unfortunately static-only will create binaries that will overwhelm our machines. This is not a realistic option. -david On Apr 11, 2006, at 1:04 PM, Ralph Castain wrote: Also, remember that you must configure for static operation for bproc - use the configuration options "--enable-static --disable- shared". Our current bproc launcher *really* dislikes shared libraries ;-) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Building 32-bit OpenMPI package for 64-bit Opteron platform
Thanks Ralph. Was there a reason this functionality wasn't in from the start then? LA-MPI works under bproc using shared libraries. I know Bproc folks like to kill the notion of shared libs but they are a fact of life we can't live without. Just my $0.02. -david On Apr 11, 2006, at 1:24 PM, Ralph Castain wrote: Unfortunately, that's all that is available at the moment. Future releases (post 1.1) may get around this problem. The issue is that the bproc launcher actually does a binary memory image of the process, then replicates that across all the nodes. This is how we were told to implement it originally by the BProc folks. However, that means that shared libraries have problems, for obvious reasons. We have to reimplement the bproc launcher using a different approach - will take a little time. Ralph David Gunter wrote: Unfortunately static-only will create binaries that will overwhelm our machines. This is not a realistic option. -david On Apr 11, 2006, at 1:04 PM, Ralph Castain wrote: Also, remember that you must configure for static operation for bproc - use the configuration options "--enable-static --disable- shared". Our current bproc launcher *really* dislikes shared libraries ;-) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Building 32-bit OpenMPI package for 64-bit Opteron platform
Nothing nefarious - just some bad advice. Fortunately, as my other note indicated, Tim and company already fixed this by revising the launcher. Sorry for the confusion Ralph David Gunter wrote: Thanks Ralph. Was there a reason this functionality wasn't in from the start then? LA-MPI works under bproc using shared libraries. I know Bproc folks like to kill the notion of shared libs but they are a fact of life we can't live without. Just my $0.02. -david On Apr 11, 2006, at 1:24 PM, Ralph Castain wrote: Unfortunately, that's all that is available at the moment. Future releases (post 1.1) may get around this problem. The issue is that the bproc launcher actually does a binary memory image of the process, then replicates that across all the nodes. This is how we were told to implement it originally by the BProc folks. However, that means that shared libraries have problems, for obvious reasons. We have to reimplement the bproc launcher using a different approach - will take a little time. Ralph David Gunter wrote: Unfortunately static-only will create binaries that will overwhelm our machines. This is not a realistic option. -david On Apr 11, 2006, at 1:04 PM, Ralph Castain wrote: Also, remember that you must configure for static operation for bproc - use the configuration options "--enable-static --disable- shared". Our current bproc launcher *really* dislikes shared libraries ;-) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Intel EM64T Compiler error on Opteron
On Tue, 11 Apr 2006 13:19:43 -0600, Hugh Merz wrote: I couldn't find any other threads in the mailing list concerning usage of the Intel EM64T compilers - has anyone successfully compiled OpenMPI using this combination? It also occurs on the Athlon 64 processor. Logs attached. Thanks, Hugh I have compiled Open MPI (on an Opteron) with the Intel 9 EM64T compilers; It's been a while since I've used the 8.1 series, but I'll give it a shot with Intel 8.1 and tell you what happens.
Re: [OMPI users] Intel EM64T Compiler error on Opteron
On Tue, 11 Apr 2006 13:48:43 -0600, Troy Telford wrote: I have compiled Open MPI (on an Opteron) with the Intel 9 EM64T compilers; It's been a while since I've used the 8.1 series, but I'll give it a shot with Intel 8.1 and tell you what happens. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users I can confirm that I'm able to compile Open MPI 1.0.2 on my systems. Other info: * Opteron 244 CPUs * SLES 9 SP3 x86_64 * Intel(R) C Compiler for Intel(R) EM64T-based applications, Version 8.1Build 20050628 * Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, Version 8.1Build 20050517 -- Troy Telford Linux Networx ttelf...@linuxnetworx.com (801) 649-1356
Re: [OMPI users] Problem running code with OpenMPI-1.0.1
Do you, perchance, have multiple TCP interfaces on at least one of the nodes you're running on? We had a mistake in the TCP network matching code during startup -- this is fixed in v1.0.2. Can you give that a whirl? > -Original Message- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Jeffrey B. Layton > Sent: Tuesday, April 11, 2006 11:25 AM > To: Open MPI Users > Subject: [OMPI users] Problem running code with OpenMPI-1.0.1 > > Good morning, > >I'm trying to run one of the NAS Parallel Benchmarks (bt) with > OpenMPI-1.0.1 that was built with PGI 6.0. The code never > starts (at least I don't see any output) until I kill the code. Then > I get the following message: > > [0,1,2][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] > connect() failed with > errno=113[0,1,4][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_c > omplete_connect] > connect() failed with > errno=113[0,1,8][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_c > omplete_connect] > connect() failed with errno=113mpirun: killing job... > > Any ideas on this one? > > Thanks! > > Jeff > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Problem running code with OpenMPI-1.0.1
Well, yes these nodes do have multiple TCP interfaces. I'll give 1.0.2 a whirl :) Thanks! Jeff Do you, perchance, have multiple TCP interfaces on at least one of the nodes you're running on? We had a mistake in the TCP network matching code during startup -- this is fixed in v1.0.2. Can you give that a whirl? -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeffrey B. Layton Sent: Tuesday, April 11, 2006 11:25 AM To: Open MPI Users Subject: [OMPI users] Problem running code with OpenMPI-1.0.1 Good morning, I'm trying to run one of the NAS Parallel Benchmarks (bt) with OpenMPI-1.0.1 that was built with PGI 6.0. The code never starts (at least I don't see any output) until I kill the code. Then I get the following message: [0,1,2][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113[0,1,4][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_c omplete_connect] connect() failed with errno=113[0,1,8][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_c omplete_connect] connect() failed with errno=113mpirun: killing job... Any ideas on this one? Thanks! Jeff ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users