Re: [OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux
HI Siegmar, I've attempted to reproduce this using gnu compilers and the version of this test program(s) you posted earlier in 2016 but am unable to reproduce the problem. Could you double check that the slave program can be successfully run when launched directly by mpirun/mpiexec? It might also help to use --mca btl_base_verbose 10 when running the slave program standalone. Thanks, Howard 2016-12-28 7:06 GMT-07:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi, > > I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise > Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately, > I get an error when I run one of my programs. Everything works as > expected with openmpi-master-201612232109-67a08e8. The program > gets a timeout with openmpi-v2.x-201612232156-5ce66b0. > > loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:" > Open MPI: 2.0.2rc2 > C compiler absolute: /opt/solstudio12.5b/bin/cc > > > loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 > spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > -- > A system call failed during shared memory initialization that should > not have. It is likely that your MPI job will now either abort or > experience performance degradation. > > Local host: loki > System call: open(2) > Error: No such file or directory (errno 2) > -- > [loki:17855] *** Process received signal *** > [loki:17855] Signal: Segmentation fault (11) > [loki:17855] Signal code: Address not mapped (1) > [loki:17855] Failing at address: 0x8 > [loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870] > [loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc > /lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae] > [loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc > /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1 > 96)[0x7f053250cb16] > [loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc > /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8] > [loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c > )[0x7f053e52300c] > [loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+ > 0x1ed)[0x7f053e523eed] > [loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc > /lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_ > intra_dec_fixed+0x1a3)[0x7f0531ea7c03] > [loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38] > [loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in > file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line > 186 > /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_ > dyn_init+0xcd)[0x7f053d48aeed] > [loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3] > [loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd] > [loki:17855] [11] spawn_slave[0x4009cf] > [loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25] > [loki:17855] [13] spawn_slave[0x400892] > [loki:17855] *** End of error message *** > [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file > ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186 > -- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[55817,2],0]) is on host: loki > Process 2 ([[55817,2],1]) is on host: unknown! > BTLs attempted: self sm tcp vader > > Your MPI job is now going to abort; sorry. > -- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > ***and potentially your MPI job) > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_dpm_dyn_init() failed > --> Returned "Unreachable" (-12) instead of "Success" (0) > --
Re: [OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux
Hi Howard, thank you very much that you try to solve my problem. I haven't changed the programs since 2013 so that you use the correct version. The program works as expected with the master trunk as you can see at the bottom of this email from my last mail. The slave program works when I launch it directly. loki spawn 122 mpicc --showme cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath -Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags -L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:" Open MPI: 2.0.2rc2 C compiler absolute: /opt/solstudio12.5b/bin/cc loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca btl_base_verbose 10 spawn_slave [loki:05572] mca: base: components_register: registering framework btl components [loki:05572] mca: base: components_register: found loaded component self [loki:05572] mca: base: components_register: component self register function successful [loki:05572] mca: base: components_register: found loaded component sm [loki:05572] mca: base: components_register: component sm register function successful [loki:05572] mca: base: components_register: found loaded component tcp [loki:05572] mca: base: components_register: component tcp register function successful [loki:05572] mca: base: components_register: found loaded component vader [loki:05572] mca: base: components_register: component vader register function successful [loki:05572] mca: base: components_open: opening btl components [loki:05572] mca: base: components_open: found loaded component self [loki:05572] mca: base: components_open: component self open function successful [loki:05572] mca: base: components_open: found loaded component sm [loki:05572] mca: base: components_open: component sm open function successful [loki:05572] mca: base: components_open: found loaded component tcp [loki:05572] mca: base: components_open: component tcp open function successful [loki:05572] mca: base: components_open: found loaded component vader [loki:05572] mca: base: components_open: component vader open function successful [loki:05572] select: initializing btl component self [loki:05572] select: init of component self returned success [loki:05572] select: initializing btl component sm [loki:05572] select: init of component sm returned failure [loki:05572] mca: base: close: component sm closed [loki:05572] mca: base: close: unloading component sm [loki:05572] select: initializing btl component tcp [loki:05572] select: init of component tcp returned success [loki:05572] select: initializing btl component vader [loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca/btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No peers to communicate with. Disabling vader. [loki:05572] select: init of component vader returned failure [loki:05572] mca: base: close: component vader closed [loki:05572] mca: base: close: unloading component vader [loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node loki Slave process 0 of 1 running on loki spawn_slave 0: argv[0]: spawn_slave [loki:05572] mca: base: close: component self closed [loki:05572] mca: base: close: unloading component self [loki:05572] mca: base: close: component tcp closed [loki:05572] mca: base: close: unloading component tcp loki spawn 125 Kind regards and thank you very much once more Siegmar Am 03.01.2017 um 00:17 schrieb Howard Pritchard: HI Siegmar, I've attempted to reproduce this using gnu compilers and the version of this test program(s) you posted earlier in 2016 but am unable to reproduce the problem. Could you double check that the slave program can be successfully run when launched directly by mpirun/mpiexec? It might also help to use --mca btl_base_verbose 10 when running the slave program standalone. Thanks, Howard 2016-12-28 7:06 GMT-07:00 Siegmar Gross mailto:siegmar.gr...@informatik.hs-fulda.de>>: Hi, I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately, I get an error when I run one of my programs. Everything works as expected with openmpi-master-201612232109-67a08e8. The program gets a timeout with openmpi-v2.x-201612232156-5ce66b0. loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:" Open MPI: 2.0.2rc2 C compiler absolute: /opt/solstudio12.5b/bin/cc loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master Parent process 0 running on loki I create 4 slave processes -- A system call failed during shared memory initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation. Local host: loki System call: open(2)