Hi Tim, I installed the openmpi-1.2.1a0r14178 tarball (took this opportunity to use the intel fortran compiler instead gfortran). With a simple test it seems to work but note the same messages
->mpirun -np 8 -machinefile mymachines a.out [x1:25417] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x1:25418] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31983] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31982] mca_btl_mx_init: mx_open_endpoint() failed with status=20 [x2:31980] mca_btl_mx_init: mx_open_endpoint() failed with status=20 Hello, world! I am 4 of 7 Hello, world! I am 0 of 7 Hello, world! I am 1 of 7 Hello, world! I am 5 of 7 Hello, world! I am 2 of 7 Hello, world! I am 7 of 7 Hello, world! I am 6 of 7 Hello, world! I am 3 of 7 and the machinefile is x1 slots=4 max_slots=4 x2 slots=4 max_slots=4 However with a realistic code, it starts fine (same messages as above) and somewhere later: [x1:25947] *** Process received signal *** [x1:25947] Signal: Segmentation fault (11) [x1:25947] Signal code: Address not mapped (1) [x1:25947] Failing at address: 0x14 [x1:25947] [ 0] [0xb7f00440] [x1:25947] [ 1] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_r equest_start_copy+0x13e) [0xb7a80e6e] [x1:25947] [ 2] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_r equest_process_pending+0x1e3) [0xb7a82463] [x1:25947] [ 3] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so [0xb7a7ebf8] [x1:25947] [ 4] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_btl_sm.so(mca_btl_sm_componen t_progress+0x1813) [0xb7a41923] [x1:25947] [ 5] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress +0x36) [0xb7a4fdd6] [x1:25947] [ 6] /opt/ompi/lib/libopen-pal.so.0(opal_progress+0x79) [0xb7dc41a9] [x1:25947] [ 7] /opt/ompi/lib/libmpi.so.0(ompi_request_wait_all+0xb5) [0xb7e90145] [x1:25947] [ 8] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned _sendrecv_actual+0xc9) [0xb7a167a9] [x1:25947] [ 9] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned _barrier_intra_recursivedoubling+0xe4) [0xb7a1bfb4] [x1:25947] [10] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned _barrier_intra_dec_fixed+0x48) [0xb7a16a18] [x1:25947] [11] /opt/ompi/lib/libmpi.so.0(PMPI_Barrier+0x69) [0xb7ea4059] [x1:25947] [12] driver0(_ZNK3MPI4Comm7BarrierEv+0x20) [0x806baf4] [x1:25947] [13] driver0(_ZN3gms12PartitionSet14ReadData_Case2Ev+0xc92) [0x808bb78] [x1:25947] [14] driver0(_ZN3gms12PartitionSet8ReadDataESsSsSst+0xbc) [0x8086f96] [x1:25947] [15] driver0(main+0x181) [0x8068c7f] [x1:25947] [16] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7b6a824] [x1:25947] [17] driver0(__gxx_personality_v0+0xb9) [0x8068991] [x1:25947] *** End of error message *** mpirun noticed that job rank 0 with PID 25945 on node x1 exited on signal 15 (Terminated). 7 additional processes aborted (not shown) This code does run to completion using ompi-1.2 if I use only 2 slots per machine. Thanks for any help. -- Valmor > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Tim Prins > Sent: Friday, March 30, 2007 10:49 PM > To: Open MPI Users > Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed > withstatus=20 > > Hi Valmor, > > What is happening here is that when Open MPI tries to create MX endpoint > for > communication, mx returns code 20, which is MX_BUSY. > > At this point we should gracefully move on, but there is a bug in Open MPI > 1.2 > which causes a segmentation fault in case of this type of error. This will > be > fixed in 1.2.1, and the fix is available now in the 1.2 nightly tarballs. > > Hope this helps, > > Tim > > On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote: > > Hello, > > > > I am getting this error any time the number of processes requested per > > machine is greater than the number of cpus. I suspect it is something on > > the configuration of mx / ompi that I am missing since another machine I > > have without mx installed runs ompi correctly with oversubscription. > > > > Thanks for any help. > > > > -- > > Valmor > > > > > > ->mpirun -np 3 --machinefile mymachines-1 a.out > > [x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20 > > [x1:23624] *** Process received signal *** [x1:23624] Signal: > > Segmentation fault (11) [x1:23624] Signal code: Address not mapped (1) > > [x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440] > > [x1:23624] [ 1] > > /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25) > > [0xb7aca825] [x1:23624] [ 2] > > /opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_component_init+0x6 > > f8) [0xb7acc658] [x1:23624] [ 3] > > /opt/ompi/lib/libmpi.so.0(mca_btl_base_select+0x1a0) [0xb7f41900] > > [x1:23624] [ 4] > > /opt/openmpi-1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x2 > > 6) [0xb7ad1006] [x1:23624] [ 5] > > /opt/ompi/lib/libmpi.so.0(mca_bml_base_init+0x78) [0xb7f41198] > > [x1:23624] [ 6] > > /opt/openmpi-1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_component_init+0 > > x7d) [0xb7af866d] [x1:23624] [ 7] > > /opt/ompi/lib/libmpi.so.0(mca_pml_base_select+0x176) [0xb7f49b56] > > [x1:23624] [ 8] /opt/ompi/lib/libmpi.so.0(ompi_mpi_init+0x4cf) > > [0xb7f0fe2f] [x1:23624] [ 9] /opt/ompi/lib/libmpi.so.0(MPI_Init+0xab) > > [0xb7f3204b] [x1:23624] [10] a.out(_ZN3MPI4InitERiRPPc+0x18) [0x8052cbe] > > [x1:23624] [11] a.out(main+0x21) [0x804f4a7] [x1:23624] [12] > > /lib/libc.so.6(__libc_start_main+0xdc) [0xb7be9824] > > > > content of mymachines-1 file > > > > x1 max_slots=4 > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users