If i may, you can use the command line option "-mat_mumps_icntl_4 2"
MUMPS then gives infomations about the factorization step, such as the estimated needed memory.

Best regards,

Yann

Le 5/23/2023 à 11:59 AM, Matthew Knepley a écrit :
On Mon, May 22, 2023 at 10:42 PM Zongze Yang <[email protected] <mailto:[email protected]>> wrote:

    On Tue, 23 May 2023 at 05:31, Stefano Zampini
    <[email protected] <mailto:[email protected]>> wrote:

        If I may add to the discussion, it may be that you are going OOM
        since you are trying to factorize a 3 million dofs problem, this
        problem goes undetected and then fails at a later stage

    Thank you for your comment. I ran the problem with 90 processes
    distributed across three nodes, each equipped with 500G of memory.
    If this amount of memory is sufficient for solving the matrix with
    approximately 3 million degrees of freedom?


It really depends on the fill. Suppose that you get 1% fill, then

   (3e6)^2 * 0.01 * 8 = 1e12 B

and you have 1.5e12 B, so I could easily see running out of memory.

   Thanks,

      Matt

    Thanks!
    Zongze

        Il giorno lun 22 mag 2023 alle ore 20:03 Zongze Yang
        <[email protected] <mailto:[email protected]>> ha scritto:

            Thanks!

            Zongze

            Matthew Knepley <[email protected]
            <mailto:[email protected]>>于2023年5月23日 周二00:09写道:

                On Mon, May 22, 2023 at 11:07 AM Zongze Yang
                <[email protected] <mailto:[email protected]>> wrote:

                    Hi,

                    I hope this letter finds you well. I am writing to
                    seek guidance regarding an error I encountered while
                    solving a matrix using MUMPS on multiple nodes:


                Iprobe is buggy on several MPI implementations. PETSc
                has an option for shutting it off for this reason.
                I do not know how to shut it off inside MUMPS however. I
                would mail their mailing list to see.

                   Thanks,

                      Matt

                    ```bash
                    Abort(1681039) on node 60 (rank 60 in comm 240):
                    Fatal error in PMPI_Iprobe: Other MPI error, error
                    stack:
                    PMPI_Iprobe(124)..............:
                    MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
                    comm=0xc4000026, flag=0x7ffc130f9c4c,
                    status=0x7ffc130f9e80) failed
                    MPID_Iprobe(240)..............:
                    MPIDI_iprobe_safe(108)........:
                    MPIDI_iprobe_unsafe(35).......:
                    MPIDI_OFI_do_iprobe(69).......:
                    MPIDI_OFI_handle_cq_error(949): OFI poll failed
                    (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output 
error)
                    Assertion failed in file
                    src/mpid/ch4/netmod/ofi/ofi_events.c at line 125: 0
                    ```

                    The matrix in question has a degree of freedom (dof)
                    of 3.86e+06. Interestingly, when solving
                    smaller-scale problems, everything functions
                    perfectly without any issues. However, when
                    attempting to solve the larger matrix on multiple
                    nodes, I encounter the aforementioned error.

                    The complete error message I received is as follows:
                    ```bash
                    Abort(1681039) on node 60 (rank 60 in comm 240):
                    Fatal error in PMPI_Iprobe: Other MPI error, error
                    stack:
                    PMPI_Iprobe(124)..............:
                    MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
                    comm=0xc4000026, flag=0x7ffc130f9c4c,
                    status=0x7ffc130f9e80) failed
                    MPID_Iprobe(240)..............:
                    MPIDI_iprobe_safe(108)........:
                    MPIDI_iprobe_unsafe(35).......:
                    MPIDI_OFI_do_iprobe(69).......:
                    MPIDI_OFI_handle_cq_error(949): OFI poll failed
                    (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output 
error)
                    Assertion failed in file
                    src/mpid/ch4/netmod/ofi/ofi_events.c at line 125: 0
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPL_backtrace_show+0x26)
 [0x7f6076063f2c]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x41dc24)
 [0x7f6075fc5c24]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49cc51)
 [0x7f6076044c51]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49f799)
 [0x7f6076047799]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x451e18)
 [0x7f6075ff9e18]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x452272)
 [0x7f6075ffa272]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce836)
 [0x7f6075e76836]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce90d)
 [0x7f6075e7690d]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x48137b)
 [0x7f607602937b]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x44d471)
 [0x7f6075ff5471]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x407acd)
 [0x7f6075fafacd]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPIR_Err_return_comm+0x10a)
 [0x7f6075fafbea]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPI_Iprobe+0x312)
 [0x7f6075ddd542]
                    
/nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpifort.so.12(pmpi_iprobe+0x2f)
 [0x7f606e08f19f]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_load_MOD_zmumps_load_recv_msgs+0x142)
 [0x7f60737b194d]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_try_recvtreat_+0x34)
 [0x7f60738ab735]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_fac_par_m_MOD_zmumps_fac_par+0x991)
 [0x7f607378bcc8]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_par_i_+0x240)
 [0x7f6073881d36]
                    Abort(805938831) on node 51 (rank 51 in comm 240):
                    Fatal error in PMPI_Iprobe: Other MPI error, error
                    stack:
                    PMPI_Iprobe(124)..............:
                    MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
                    comm=0xc4000017, flag=0x7ffe20e1402c,
                    status=0x7ffe20e14260) failed
                    MPID_Iprobe(244)..............:
                    progress_test(100)............:
                    MPIDI_OFI_handle_cq_error(949): OFI poll failed
                    (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output 
error)
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_b_+0x1463)
 [0x7f60738831a1]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_driver_+0x6969)
 [0x7f60738446c9]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_+0x2d83)
 [0x7f60738bf9cf]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_f77_+0x178c)
 [0x7f60738c33bc]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_c+0x8f8)
 [0x7f60738baacb]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x894560)
 [0x7f6077297560]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(MatLUFactorNumeric+0x32e)
 [0x7f60773bb1e6]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0xf51665)
 [0x7f6077954665]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(PCSetUp+0x64b)
 [0x7f60779c77e0]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSetUp+0xfb6)
 [0x7f6077ac2d53]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x10c1c28)
 [0x7f6077ac4c28]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSolve+0x13)
 [0x7f6077ac8070]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x11249df)
 [0x7f6077b279df]
                    
/nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(SNESSolve+0x10df)
 [0x7f6077b676c6]
                    Abort(1) on node 60: Internal error
                    Abort(1007265423) on node 65 (rank 65 in comm 240):
                    Fatal error in PMPI_Iprobe: Other MPI error, error
                    stack:
                    PMPI_Iprobe(124)..............:
                    MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
                    comm=0xc4000017, flag=0x7fff4d82827c,
                    status=0x7fff4d8284b0) failed
                    MPID_Iprobe(244)..............:
                    progress_test(100)............:
                    MPIDI_OFI_handle_cq_error(949): OFI poll failed
                    (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output 
error)
                    Abort(941205135) on node 32 (rank 32 in comm 240):
                    Fatal error in PMPI_Iprobe: Other MPI error, error
                    stack:
                    PMPI_Iprobe(124)..............:
                    MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
                    comm=0xc4000017, flag=0x7fff715ba3fc,
                    status=0x7fff715ba630) failed
                    MPID_Iprobe(240)..............:
                    MPIDI_iprobe_safe(108)........:
                    MPIDI_iprobe_unsafe(35).......:
                    MPIDI_OFI_do_iprobe(69).......:
                    MPIDI_OFI_handle_cq_error(949): OFI poll failed
                    (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output 
error)
                    Abort(470941839) on node 75 (rank 75 in comm 0):
                    Fatal error in PMPI_Test: Other MPI error, error stack:
                    PMPI_Test(188)................:
                    MPI_Test(request=0x7efe31e03014,
                    flag=0x7ffea65d673c, status=0x7ffea65d6760) failed
                    MPIR_Test(73).................:
                    MPIR_Test_state(33)...........:
                    progress_test(100)............:
                    MPIDI_OFI_handle_cq_error(949): OFI poll failed
                    (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output 
error)
                    Abort(805946511) on node 31 (rank 31 in comm 256):
                    Fatal error in PMPI_Probe: Other MPI error, error stack:
                    PMPI_Probe(118)...............:
                    MPI_Probe(src=MPI_ANY_SOURCE, tag=7,
                    comm=0xc4000015, status=0x7fff9538b7a0) failed
                    MPID_Probe(159)...............:
                    progress_test(100)............:
                    MPIDI_OFI_handle_cq_error(949): OFI poll failed
                    (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output 
error)
                    Abort(1179791) on node 73 (rank 73 in comm 0): Fatal
                    error in PMPI_Test: Other MPI error, error stack:
                    PMPI_Test(188)................:
                    MPI_Test(request=0x5b638d4, flag=0x7ffd755119cc,
                    status=0x7ffd755121b0) failed
                    MPIR_Test(73).................:
                    MPIR_Test_state(33)...........:
                    progress_test(100)............:
                    MPIDI_OFI_handle_cq_error(949): OFI poll failed
                    (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output 
error)
                    ```

                    Thank you very much for your time and consideration.

                    Best wishes,
                    Zongze



-- What most experimenters take for granted before they
                begin their experiments is infinitely more interesting
                than any results to which their experiments lead.
                -- Norbert Wiener

                https://www.cse.buffalo.edu/~knepley/
                <http://www.cse.buffalo.edu/~knepley/>

-- Best wishes,
            Zongze



-- Stefano



--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Reply via email to