Hello Users I'm by no means an MPI expert, but I have successfully being using my own compiled version of OMPI 1.10.2 for some time without issue. Lately however I'm seeing a strange issue, which is that when I try to run on more than 3 or 4 nodes I get a hang during setup. My code (the Fortran MHD code FLASH ver 4.2.2) is attempting to call MPI_COMM_SPLIT:
!! first make a communicator for group of processors !! that have the whole computational grid !! The grid is duplicated on all communicators countInComm=dr_globalNumProcs/dr_meshCopyCount if((countInComm*dr_meshCopyCount) /= dr_globalNumProcs)& call Driver_abortFlash("when duplicating mesh, numProcs should be a multiple of meshCopyCount") color = dr_globalMe/countInComm key = mod(dr_globalMe,countInComm) call MPI_Comm_split(dr_globalComm,color,key,dr_meshComm,error) call MPI_Comm_split(dr_globalComm,key,color,dr_meshAcrossComm,error) call MPI_COMM_RANK(dr_meshComm,dr_meshMe, error) call MPI_COMM_SIZE(dr_meshComm, dr_meshNumProcs,error) call MPI_COMM_RANK(dr_meshAcrossComm,dr_meshAcrossMe, error) call MPI_COMM_SIZE(dr_meshAcrossComm, dr_meshAcrossNumProcs,error) and is hanging in split call. Attaching a GDB to the process on the local node I find (CentOS is way behind on updating GDB so there aren't a lot of symbols unfortunately): (gdb) bt full #0 0x00002aaab150facd in mca_btl_vader_component_progress () from /home/draco/jwall/local_openmpi/lib/openmpi/mca_btl_vader.so No symbol table info available. #1 0x00002aaaad348e6a in opal_progress () from /home/draco/jwall/local_openmpi/lib/libopen-pal.so.13 _mm_free_fn = 0 event_debug_map_PRIMES = {53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593, 49157, 98317, 196613, 393241, 786433, 1572869, 3145739, 6291469, 12582917, 25165843, 50331653, 100663319, 201326611, 402653189, 805306457, 1610612741} _event_debug_map_lock = 0x3e9cc70 _mm_realloc_fn = 0 event_debug_mode_too_late = 1 global_debug_map = {hth_table = 0x0, hth_table_length = 0, hth_n_entries = 0, hth_load_limit = 0, hth_prime_idx = -1} warn_once = 0 use_monotonic = 1 eventops = {0x2aaaad5ee860, 0x2aaaad5ee8c0, 0x2aaaad5ee900, 0x0} _mm_malloc_fn = 0 event_global_current_base_ = 0x0 opal_libevent2021__event_debug_mode_on = 0 #2 0x00002aaaab635305 in ompi_request_default_wait_all () from /home/draco/jwall/local_openmpi/lib/libmpi.so.12 No symbol table info available. #3 0x00002aaab1f60417 in ompi_coll_tuned_sendrecv_nonzero_actual () from /home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so No symbol table info available. #4 0x00002aaab1f68074 in ompi_coll_tuned_allgather_intra_bruck () from /home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so No symbol table info available. #5 0x00002aaaab621e4d in ompi_comm_split () from /home/draco/jwall/local_openmpi/lib/libmpi.so.12 No symbol table info available. #6 0x00002aaaab64f16d in PMPI_Comm_split () from /home/draco/jwall/local_openmpi/lib/libmpi.so.12 No symbol table info available. #7 0x00002aaaab3db70f in pmpi_comm_split__ () from /home/draco/jwall/local_openmpi/lib/libmpi_mpifh.so.12 No symbol table info available. #8 0x00000000005e43ed in driver_setupparallelenv_ () No symbol table info available. #9 0x00000000005e1187 in driver_initflash_ () No symbol table info available. #10 0x00000000004451ce in __flash_run_MOD_initialize_code () No symbol table info available. #11 0x00000000004124e9 in handle_call.1908 () No symbol table info available. #12 0x0000000000422791 in run_loop_mpi.1914 () No symbol table info available. #13 0x00000000004169db in MAIN__ () No symbol table info available. #14 0x0000000000423c6f in main () No symbol table info available. #15 0x00002aaaac507d1d in __libc_start_main () from /lib64/libc.so.6 No symbol table info available. #16 0x0000000000405509 in _start () No symbol table info available. Anyone have any ideas what the issue might be? Thanks so much. Joshua Wall Ph. D. Candidate Physics Department Drexel University -- Joshua Wall Doctoral Candidate Department of Physics Drexel University 3141 Chestnut Street Philadelphia, PA 19104
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users