Hello Users
I'm by no means an MPI expert, but I have successfully being using my
own compiled version of OMPI 1.10.2 for some time without issue. Lately
however I'm seeing a strange issue, which is that when I try to run on
more than 3 or 4 nodes I get a hang during setup. My code (the Fortran MHD
code FLASH ver
4.2.2) is attempting to call MPI_COMM_SPLIT:
!! first make a communicator for group of processors
!! that have the whole computational grid
!! The grid is duplicated on all communicators
countInComm=dr_globalNumProcs/dr_meshCopyCount
if((countInComm*dr_meshCopyCount) /= dr_globalNumProcs)&
call Driver_abortFlash("when duplicating mesh, numProcs should
be a multiple of meshCopyCount")
color = dr_globalMe/countInComm
key = mod(dr_globalMe,countInComm)
call MPI_Comm_split(dr_globalComm,color,key,dr_meshComm,error)
call MPI_Comm_split(dr_globalComm,key,color,dr_meshAcrossComm,error)
call MPI_COMM_RANK(dr_meshComm,dr_meshMe, error)
call MPI_COMM_SIZE(dr_meshComm, dr_meshNumProcs,error)
call MPI_COMM_RANK(dr_meshAcrossComm,dr_meshAcrossMe, error)
call MPI_COMM_SIZE(dr_meshAcrossComm, dr_meshAcrossNumProcs,error)
and is hanging in split call. Attaching a GDB to the process on the
local node I find (CentOS is way behind on updating GDB so there aren't
a lot of symbols unfortunately):
(gdb) bt full
#0 0x2aaab150facd in mca_btl_vader_component_progress () from
/home/draco/jwall/local_openmpi/lib/openmpi/mca_btl_vader.so
No symbol table info available.
#1 0x2d348e6a in opal_progress () from
/home/draco/jwall/local_openmpi/lib/libopen-pal.so.13
_mm_free_fn = 0
event_debug_map_PRIMES = {53, 97, 193, 389, 769, 1543, 3079,
6151, 12289, 24593, 49157, 98317, 196613, 393241, 786433, 1572869,
3145739, 6291469, 12582917, 25165843, 50331653,
100663319, 201326611, 402653189, 805306457, 1610612741}
_event_debug_map_lock = 0x3e9cc70
_mm_realloc_fn = 0
event_debug_mode_too_late = 1
global_debug_map = {hth_table = 0x0, hth_table_length = 0,
hth_n_entries = 0, hth_load_limit = 0, hth_prime_idx = -1}
warn_once = 0
use_monotonic = 1
eventops = {0x2d5ee860, 0x2d5ee8c0, 0x2d5ee900, 0x0}
_mm_malloc_fn = 0
event_global_current_base_ = 0x0
opal_libevent2021__event_debug_mode_on = 0
#2 0x2b635305 in ompi_request_default_wait_all () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#3 0x2aaab1f60417 in ompi_coll_tuned_sendrecv_nonzero_actual ()
from /home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#4 0x2aaab1f68074 in ompi_coll_tuned_allgather_intra_bruck () from
/home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#5 0x2b621e4d in ompi_comm_split () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#6 0x2b64f16d in PMPI_Comm_split () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#7 0x2b3db70f in pmpi_comm_split__ () from
/home/draco/jwall/local_openmpi/lib/libmpi_mpifh.so.12
No symbol table info available.
#8 0x005e43ed in driver_setupparallelenv_ ()
No symbol table info available.
#9 0x005e1187 in driver_initflash_ ()
No symbol table info available.
#10 0x004451ce in __flash_run_MOD_initialize_code ()
No symbol table info available.
#11 0x004124e9 in handle_call.1908 ()
No symbol table info available.
#12 0x00422791 in run_loop_mpi.1914 ()
No symbol table info available.
#13 0x004169db in MAIN__ ()
No symbol table info available.
#14 0x00423c6f in main ()
No symbol table info available.
#15 0x2c507d1d in __libc_start_main () from /lib64/libc.so.6
No symbol table info available.
#16 0x00405509 in _start ()
No symbol table info available.
Anyone have any ideas what the issue might be?
Thanks so much.
Joshua Wall
Ph. D. Candidate
Physics Department
Drexel University
--
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users