On Tue, 23 May 2023 at 20:09, Yann Jobic <[email protected]> wrote:
> If i may, you can use the command line option "-mat_mumps_icntl_4 2" > MUMPS then gives infomations about the factorization step, such as the > estimated needed memory. > > Thank you for your suggestion! Best wishes, Zongze Best regards, > > Yann > > Le 5/23/2023 à 11:59 AM, Matthew Knepley a écrit : > > On Mon, May 22, 2023 at 10:42 PM Zongze Yang <[email protected] > > <mailto:[email protected]>> wrote: > > > > On Tue, 23 May 2023 at 05:31, Stefano Zampini > > <[email protected] <mailto:[email protected]>> > wrote: > > > > If I may add to the discussion, it may be that you are going OOM > > since you are trying to factorize a 3 million dofs problem, this > > problem goes undetected and then fails at a later stage > > > > Thank you for your comment. I ran the problem with 90 processes > > distributed across three nodes, each equipped with 500G of memory. > > If this amount of memory is sufficient for solving the matrix with > > approximately 3 million degrees of freedom? > > > > > > It really depends on the fill. Suppose that you get 1% fill, then > > > > (3e6)^2 * 0.01 * 8 = 1e12 B > > > > and you have 1.5e12 B, so I could easily see running out of memory. > > > > Thanks, > > > > Matt > > > > Thanks! > > Zongze > > > > Il giorno lun 22 mag 2023 alle ore 20:03 Zongze Yang > > <[email protected] <mailto:[email protected]>> ha scritto: > > > > Thanks! > > > > Zongze > > > > Matthew Knepley <[email protected] > > <mailto:[email protected]>>于2023年5月23日 周二00:09写道: > > > > On Mon, May 22, 2023 at 11:07 AM Zongze Yang > > <[email protected] <mailto:[email protected]>> > wrote: > > > > Hi, > > > > I hope this letter finds you well. I am writing to > > seek guidance regarding an error I encountered while > > solving a matrix using MUMPS on multiple nodes: > > > > > > Iprobe is buggy on several MPI implementations. PETSc > > has an option for shutting it off for this reason. > > I do not know how to shut it off inside MUMPS however. I > > would mail their mailing list to see. > > > > Thanks, > > > > Matt > > > > ```bash > > Abort(1681039) on node 60 (rank 60 in comm 240): > > Fatal error in PMPI_Iprobe: Other MPI error, error > > stack: > > PMPI_Iprobe(124)..............: > > MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, > > comm=0xc4000026, flag=0x7ffc130f9c4c, > > status=0x7ffc130f9e80) failed > > MPID_Iprobe(240)..............: > > MPIDI_iprobe_safe(108)........: > > MPIDI_iprobe_unsafe(35).......: > > MPIDI_OFI_do_iprobe(69).......: > > MPIDI_OFI_handle_cq_error(949): OFI poll failed > > > (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) > > Assertion failed in file > > src/mpid/ch4/netmod/ofi/ofi_events.c at line 125: 0 > > ``` > > > > The matrix in question has a degree of freedom (dof) > > of 3.86e+06. Interestingly, when solving > > smaller-scale problems, everything functions > > perfectly without any issues. However, when > > attempting to solve the larger matrix on multiple > > nodes, I encounter the aforementioned error. > > > > The complete error message I received is as follows: > > ```bash > > Abort(1681039) on node 60 (rank 60 in comm 240): > > Fatal error in PMPI_Iprobe: Other MPI error, error > > stack: > > PMPI_Iprobe(124)..............: > > MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, > > comm=0xc4000026, flag=0x7ffc130f9c4c, > > status=0x7ffc130f9e80) failed > > MPID_Iprobe(240)..............: > > MPIDI_iprobe_safe(108)........: > > MPIDI_iprobe_unsafe(35).......: > > MPIDI_OFI_do_iprobe(69).......: > > MPIDI_OFI_handle_cq_error(949): OFI poll failed > > > (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) > > Assertion failed in file > > src/mpid/ch4/netmod/ofi/ofi_events.c at line 125: 0 > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPL_backtrace_show+0x26) > [0x7f6076063f2c] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x41dc24) > [0x7f6075fc5c24] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49cc51) > [0x7f6076044c51] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49f799) > [0x7f6076047799] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x451e18) > [0x7f6075ff9e18] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x452272) > [0x7f6075ffa272] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce836) > [0x7f6075e76836] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce90d) > [0x7f6075e7690d] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x48137b) > [0x7f607602937b] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x44d471) > [0x7f6075ff5471] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x407acd) > [0x7f6075fafacd] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPIR_Err_return_comm+0x10a) > [0x7f6075fafbea] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPI_Iprobe+0x312) > [0x7f6075ddd542] > > > > /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpifort.so.12(pmpi_iprobe+0x2f) > [0x7f606e08f19f] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_load_MOD_zmumps_load_recv_msgs+0x142) > [0x7f60737b194d] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_try_recvtreat_+0x34) > [0x7f60738ab735] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_fac_par_m_MOD_zmumps_fac_par+0x991) > [0x7f607378bcc8] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_par_i_+0x240) > [0x7f6073881d36] > > Abort(805938831) on node 51 (rank 51 in comm 240): > > Fatal error in PMPI_Iprobe: Other MPI error, error > > stack: > > PMPI_Iprobe(124)..............: > > MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, > > comm=0xc4000017, flag=0x7ffe20e1402c, > > status=0x7ffe20e14260) failed > > MPID_Iprobe(244)..............: > > progress_test(100)............: > > MPIDI_OFI_handle_cq_error(949): OFI poll failed > > > (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_b_+0x1463) > [0x7f60738831a1] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_driver_+0x6969) > [0x7f60738446c9] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_+0x2d83) > [0x7f60738bf9cf] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_f77_+0x178c) > [0x7f60738c33bc] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_c+0x8f8) > [0x7f60738baacb] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x894560) > [0x7f6077297560] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(MatLUFactorNumeric+0x32e) > [0x7f60773bb1e6] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0xf51665) > [0x7f6077954665] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(PCSetUp+0x64b) > [0x7f60779c77e0] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSetUp+0xfb6) > [0x7f6077ac2d53] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x10c1c28) > [0x7f6077ac4c28] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSolve+0x13) > [0x7f6077ac8070] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x11249df) > [0x7f6077b279df] > > > > /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(SNESSolve+0x10df) > [0x7f6077b676c6] > > Abort(1) on node 60: Internal error > > Abort(1007265423) on node 65 (rank 65 in comm 240): > > Fatal error in PMPI_Iprobe: Other MPI error, error > > stack: > > PMPI_Iprobe(124)..............: > > MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, > > comm=0xc4000017, flag=0x7fff4d82827c, > > status=0x7fff4d8284b0) failed > > MPID_Iprobe(244)..............: > > progress_test(100)............: > > MPIDI_OFI_handle_cq_error(949): OFI poll failed > > > (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) > > Abort(941205135) on node 32 (rank 32 in comm 240): > > Fatal error in PMPI_Iprobe: Other MPI error, error > > stack: > > PMPI_Iprobe(124)..............: > > MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, > > comm=0xc4000017, flag=0x7fff715ba3fc, > > status=0x7fff715ba630) failed > > MPID_Iprobe(240)..............: > > MPIDI_iprobe_safe(108)........: > > MPIDI_iprobe_unsafe(35).......: > > MPIDI_OFI_do_iprobe(69).......: > > MPIDI_OFI_handle_cq_error(949): OFI poll failed > > > (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) > > Abort(470941839) on node 75 (rank 75 in comm 0): > > Fatal error in PMPI_Test: Other MPI error, error > stack: > > PMPI_Test(188)................: > > MPI_Test(request=0x7efe31e03014, > > flag=0x7ffea65d673c, status=0x7ffea65d6760) failed > > MPIR_Test(73).................: > > MPIR_Test_state(33)...........: > > progress_test(100)............: > > MPIDI_OFI_handle_cq_error(949): OFI poll failed > > > (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) > > Abort(805946511) on node 31 (rank 31 in comm 256): > > Fatal error in PMPI_Probe: Other MPI error, error > stack: > > PMPI_Probe(118)...............: > > MPI_Probe(src=MPI_ANY_SOURCE, tag=7, > > comm=0xc4000015, status=0x7fff9538b7a0) failed > > MPID_Probe(159)...............: > > progress_test(100)............: > > MPIDI_OFI_handle_cq_error(949): OFI poll failed > > > (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) > > Abort(1179791) on node 73 (rank 73 in comm 0): Fatal > > error in PMPI_Test: Other MPI error, error stack: > > PMPI_Test(188)................: > > MPI_Test(request=0x5b638d4, flag=0x7ffd755119cc, > > status=0x7ffd755121b0) failed > > MPIR_Test(73).................: > > MPIR_Test_state(33)...........: > > progress_test(100)............: > > MPIDI_OFI_handle_cq_error(949): OFI poll failed > > > (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error) > > ``` > > > > Thank you very much for your time and consideration. > > > > Best wishes, > > Zongze > > > > > > > > -- > > What most experimenters take for granted before they > > begin their experiments is infinitely more interesting > > than any results to which their experiments lead. > > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ > > <http://www.cse.buffalo.edu/~knepley/> > > > > -- > > Best wishes, > > Zongze > > > > > > > > -- > > Stefano > > > > > > > > -- > > What most experimenters take for granted before they begin their > > experiments is infinitely more interesting than any results to which > > their experiments lead. > > -- Norbert Wiener > > > > https://www.cse.buffalo.edu/~knepley/ < > http://www.cse.buffalo.edu/~knepley/> >
