When you say "stuck", what actually happens? On Aug 10, 2011, at 2:09 PM, CB wrote:
> Now I was able to run MPI hello world example up to 3096 processes across 129 > nodes (24 cores per node). > However, it seems to get stuck with 3097 processes. > > Any suggestions for troubleshooting? > > Thanks, > - Chansup > > > On Tue, Aug 9, 2011 at 2:02 PM, CB <cbalw...@gmail.com> wrote: > Hi Ralph, > > Yes, you are right. Those nodes were still pointing to an old version. > I'll check the installation on all nodes and try to run it again. > > Thanks, > - Chansup > > > On Tue, Aug 9, 2011 at 1:48 PM, Ralph Castain <r...@open-mpi.org> wrote: > That error makes no sense - line 335 is just a variable declaration. Sure you > are not picking up a different version on that node? > > > On Aug 9, 2011, at 11:37 AM, CB wrote: > > > Hi, > > > > Currently I'm having trouble to scale an MPI job beyond a certain limit. > > So I'm running an MPI hello example to test beyond 1024 but it failed with > > the following error with 2048 processes. > > It worked fine with 1024 processes. I have enough file descriptor limit > > (65536) defined for each process. > > > > I appreciate if anyone gives me any suggestions. > > I'm running (Open MPI) 1.4.3 > > > > [x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had inadequate > > space in file base/odls_base_default_fns.c at line 335 > > [x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had inadequate > > space in file base/odls_base_default_fns.c at line 335 > > -------------------------------------------------------------------------- > > mpirun noticed that the job aborted, but has no info as to the process > > that caused that situation. > > -------------------------------------------------------------------------- > > [x-03-20-b:23316] *** Process received signal *** > > [x-03-20-b:23316] Signal: Segmentation fault (11) > > [x-03-20-b:23316] Signal code: Address not mapped (1) > > [x-03-20-b:23316] Failing at address: 0x6c > > [x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90] > > [x-03-20-b:23316] [ 1] > > /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230) > > [0x7f0dbe0c5010] > > [x-03-20-b:23316] [ 2] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 > > [0x7f0dbde5c8f8] > > [x-03-20-b:23316] [ 3] mpirun [0x403bbe] > > [x-03-20-b:23316] [ 4] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 > > [0x7f0dbde5c8f8] > > [x-03-20-b:23316] [ 5] > > /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99) > > [0x7f0dbde50e49] > > [x-03-20-b:23316] [ 6] > > /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42) > > [0x7f0dbe0a7ca2] > > [x-03-20-b:23316] [ 7] > > /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d) > > [0x7f0dbe0c500d] > > [x-03-20-b:23316] [ 8] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 > > [0x7f0dbde5c8f8] > > [x-03-20-b:23316] [ 9] > > /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99) > > [0x7f0dbde50e49] > > [x-03-20-b:23316] [10] > > /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d) > > [0x7f0dbe0c5ddd] > > [x-03-20-b:23316] [11] > > /usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679] > > [x-03-20-b:23316] [12] mpirun [0x40373f] > > [x-03-20-b:23316] [13] mpirun [0x402a1c] > > [x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd) > > [0x3107e1ea2d] > > [x-03-20-b:23316] [15] mpirun [0x402939] > > [x-03-20-b:23316] *** End of error message *** > > [x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv: readv > > failed: Connection reset by peer (104) > > [x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv: readv > > failed: Connection reset by peer (104) > > ./sge_jsb.sh: line 9: 23316 Segmentation fault (core dumped) mpirun > > -np $NSLOTS ./hello_openmpi.exe > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users