On Tue, 23 May 2023 at 19:51, Zongze Yang <[email protected]> wrote:
> Thank you for your suggestion. I solved the problem with SuperLU_DIST, and > it works well. > This is solved with four nodes, each equipped with 500G of memory. Best wishes, Zongze Best wishes, > Zongze > > > On Tue, 23 May 2023 at 18:00, Matthew Knepley <[email protected]> wrote: > >> On Mon, May 22, 2023 at 10:46 PM Zongze Yang <[email protected]> >> wrote: >> >>> I have an additional question to ask: Is it possible for the >>> SuperLU_DIST library to encounter the same MPI problem (PMPI_Iprobe failed) >>> as MUMPS? >>> >> >> I do not know if they use that function. But it is easy to try it out, so >> I would. >> >> Thanks, >> >> Matt >> >> >>> Best wishes, >>> Zongze >>> >>> >>> On Tue, 23 May 2023 at 10:41, Zongze Yang <[email protected]> wrote: >>> >>>> On Tue, 23 May 2023 at 05:31, Stefano Zampini < >>>> [email protected]> wrote: >>>> >>>>> If I may add to the discussion, it may be that you are going OOM since >>>>> you are trying to factorize a 3 million dofs problem, this problem goes >>>>> undetected and then fails at a later stage >>>>> >>>> >>>> Thank you for your comment. I ran the problem with 90 processes >>>> distributed across three nodes, each equipped with 500G of memory. If this >>>> amount of memory is sufficient for solving the matrix with approximately 3 >>>> million degrees of freedom? >>>> >>>> Thanks! >>>> Zongze >>>> >>>> Il giorno lun 22 mag 2023 alle ore 20:03 Zongze Yang < >>>>> [email protected]> ha scritto: >>>>> >>>>>> Thanks! >>>>>> >>>>>> Zongze >>>>>> >>>>>> Matthew Knepley <[email protected]>于2023年5月23日 周二00:09写道: >>>>>> >>>>>>> On Mon, May 22, 2023 at 11:07 AM Zongze Yang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I hope this letter finds you well. I am writing to seek guidance >>>>>>>> regarding an error I encountered while solving a matrix using MUMPS on >>>>>>>> multiple nodes: >>>>>>>> >>>>>>> >>>>>>> Iprobe is buggy on several MPI implementations. PETSc has an option >>>>>>> for shutting it off for this reason. >>>>>>> I do not know how to shut it off inside MUMPS however. I would mail >>>>>>> their mailing list to see. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>> >>>>>>>> ```bash >>>>>>>> Abort(1681039) on node 60 (rank 60 in comm 240): Fatal error in >>>>>>>> PMPI_Iprobe: Other MPI error, error stack: >>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE, >>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000026, flag=0x7ffc130f9c4c, >>>>>>>> status=0x7ffc130f9e80) failed >>>>>>>> MPID_Iprobe(240)..............: >>>>>>>> MPIDI_iprobe_safe(108)........: >>>>>>>> MPIDI_iprobe_unsafe(35).......: >>>>>>>> MPIDI_OFI_do_iprobe(69).......: >>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed >>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) >>>>>>>> Assertion failed in file src/mpid/ch4/netmod/ofi/ofi_events.c at >>>>>>>> line 125: 0 >>>>>>>> ``` >>>>>>>> >>>>>>>> The matrix in question has a degree of freedom (dof) of 3.86e+06. >>>>>>>> Interestingly, when solving smaller-scale problems, everything >>>>>>>> functions >>>>>>>> perfectly without any issues. However, when attempting to solve the >>>>>>>> larger >>>>>>>> matrix on multiple nodes, I encounter the aforementioned error. >>>>>>>> >>>>>>>> The complete error message I received is as follows: >>>>>>>> ```bash >>>>>>>> Abort(1681039) on node 60 (rank 60 in comm 240): Fatal error in >>>>>>>> PMPI_Iprobe: Other MPI error, error stack: >>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE, >>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000026, flag=0x7ffc130f9c4c, >>>>>>>> status=0x7ffc130f9e80) failed >>>>>>>> MPID_Iprobe(240)..............: >>>>>>>> MPIDI_iprobe_safe(108)........: >>>>>>>> MPIDI_iprobe_unsafe(35).......: >>>>>>>> MPIDI_OFI_do_iprobe(69).......: >>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed >>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) >>>>>>>> Assertion failed in file src/mpid/ch4/netmod/ofi/ofi_events.c at >>>>>>>> line 125: 0 >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPL_backtrace_show+0x26) >>>>>>>> [0x7f6076063f2c] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x41dc24) >>>>>>>> [0x7f6075fc5c24] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49cc51) >>>>>>>> [0x7f6076044c51] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49f799) >>>>>>>> [0x7f6076047799] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x451e18) >>>>>>>> [0x7f6075ff9e18] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x452272) >>>>>>>> [0x7f6075ffa272] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce836) >>>>>>>> [0x7f6075e76836] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce90d) >>>>>>>> [0x7f6075e7690d] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x48137b) >>>>>>>> [0x7f607602937b] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x44d471) >>>>>>>> [0x7f6075ff5471] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x407acd) >>>>>>>> [0x7f6075fafacd] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPIR_Err_return_comm+0x10a) >>>>>>>> [0x7f6075fafbea] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPI_Iprobe+0x312) >>>>>>>> [0x7f6075ddd542] >>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpifort.so.12(pmpi_iprobe+0x2f) >>>>>>>> [0x7f606e08f19f] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_load_MOD_zmumps_load_recv_msgs+0x142) >>>>>>>> [0x7f60737b194d] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_try_recvtreat_+0x34) >>>>>>>> [0x7f60738ab735] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_fac_par_m_MOD_zmumps_fac_par+0x991) >>>>>>>> [0x7f607378bcc8] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_par_i_+0x240) >>>>>>>> [0x7f6073881d36] >>>>>>>> Abort(805938831) on node 51 (rank 51 in comm 240): Fatal error in >>>>>>>> PMPI_Iprobe: Other MPI error, error stack: >>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE, >>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000017, flag=0x7ffe20e1402c, >>>>>>>> status=0x7ffe20e14260) failed >>>>>>>> MPID_Iprobe(244)..............: >>>>>>>> progress_test(100)............: >>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed >>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_b_+0x1463) >>>>>>>> [0x7f60738831a1] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_driver_+0x6969) >>>>>>>> [0x7f60738446c9] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_+0x2d83) >>>>>>>> [0x7f60738bf9cf] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_f77_+0x178c) >>>>>>>> [0x7f60738c33bc] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_c+0x8f8) >>>>>>>> [0x7f60738baacb] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x894560) >>>>>>>> [0x7f6077297560] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(MatLUFactorNumeric+0x32e) >>>>>>>> [0x7f60773bb1e6] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0xf51665) >>>>>>>> [0x7f6077954665] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(PCSetUp+0x64b) >>>>>>>> [0x7f60779c77e0] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSetUp+0xfb6) >>>>>>>> [0x7f6077ac2d53] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x10c1c28) >>>>>>>> [0x7f6077ac4c28] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSolve+0x13) >>>>>>>> [0x7f6077ac8070] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x11249df) >>>>>>>> [0x7f6077b279df] >>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(SNESSolve+0x10df) >>>>>>>> [0x7f6077b676c6] >>>>>>>> Abort(1) on node 60: Internal error >>>>>>>> Abort(1007265423) on node 65 (rank 65 in comm 240): Fatal error in >>>>>>>> PMPI_Iprobe: Other MPI error, error stack: >>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE, >>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000017, flag=0x7fff4d82827c, >>>>>>>> status=0x7fff4d8284b0) failed >>>>>>>> MPID_Iprobe(244)..............: >>>>>>>> progress_test(100)............: >>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed >>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) >>>>>>>> Abort(941205135) on node 32 (rank 32 in comm 240): Fatal error in >>>>>>>> PMPI_Iprobe: Other MPI error, error stack: >>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE, >>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000017, flag=0x7fff715ba3fc, >>>>>>>> status=0x7fff715ba630) failed >>>>>>>> MPID_Iprobe(240)..............: >>>>>>>> MPIDI_iprobe_safe(108)........: >>>>>>>> MPIDI_iprobe_unsafe(35).......: >>>>>>>> MPIDI_OFI_do_iprobe(69).......: >>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed >>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) >>>>>>>> Abort(470941839) on node 75 (rank 75 in comm 0): Fatal error in >>>>>>>> PMPI_Test: Other MPI error, error stack: >>>>>>>> PMPI_Test(188)................: MPI_Test(request=0x7efe31e03014, >>>>>>>> flag=0x7ffea65d673c, status=0x7ffea65d6760) failed >>>>>>>> MPIR_Test(73).................: >>>>>>>> MPIR_Test_state(33)...........: >>>>>>>> progress_test(100)............: >>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed >>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) >>>>>>>> Abort(805946511) on node 31 (rank 31 in comm 256): Fatal error in >>>>>>>> PMPI_Probe: Other MPI error, error stack: >>>>>>>> PMPI_Probe(118)...............: MPI_Probe(src=MPI_ANY_SOURCE, >>>>>>>> tag=7, comm=0xc4000015, status=0x7fff9538b7a0) failed >>>>>>>> MPID_Probe(159)...............: >>>>>>>> progress_test(100)............: >>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed >>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) >>>>>>>> Abort(1179791) on node 73 (rank 73 in comm 0): Fatal error in >>>>>>>> PMPI_Test: Other MPI error, error stack: >>>>>>>> PMPI_Test(188)................: MPI_Test(request=0x5b638d4, >>>>>>>> flag=0x7ffd755119cc, status=0x7ffd755121b0) failed >>>>>>>> MPIR_Test(73).................: >>>>>>>> MPIR_Test_state(33)...........: >>>>>>>> progress_test(100)............: >>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed >>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) >>>>>>>> ``` >>>>>>>> >>>>>>>> Thank you very much for your time and consideration. >>>>>>>> >>>>>>>> Best wishes, >>>>>>>> Zongze >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin their >>>>>>> experiments is infinitely more interesting than any results to which >>>>>>> their >>>>>>> experiments lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>> >>>>>> -- >>>>>> Best wishes, >>>>>> Zongze >>>>>> >>>>> >>>>> >>>>> -- >>>>> Stefano >>>>> >>>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> <http://www.cse.buffalo.edu/~knepley/> >> >
