Hey Jeff, George Bosilca already cleared it up in a previous answer, I tested everything again, by simply considering the modulo 256 everything behaves as expected.
BR Alex From: Jeff Squyres (jsquyres) <jsquy...@cisco.com> Sent: Wednesday, July 19, 2023 5:09 PM To: George Bosilca <bosi...@icl.utk.edu>; Open MPI Users <users@lists.open-mpi.org> Cc: Alexander Stadik <alexander.sta...@essteyr.com> Subject: [EXT] Re: [OMPI users] [EXT] Re: Error handling External: Check sender address and use caution opening links or attachments MPI_Allreduce should work just fine, even with negative numbers. If you are seeing something different, can you provide a small reproducer program that shows the problem? We can dig deeper into if if we can reproduce the problem. mpirun's exit status can't distinguish between MPI processes who call MPI_Finalize and then return a non-zero exit status and those who invoked MPI_Abort. But if you have 1 process that invokes MPI_Abort with an exit status <255, it should be reflected in mpirun's exit status. For example: $ cat abort.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int i, rank, size; MPI_Init(NULL, NULL); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if (rank == size - 1) { int err_code = 79; fprintf(stderr, "I am rank %d and am aborting with error code %d\n", rank, err_code); MPI_Abort(MPI_COMM_WORLD, err_code); } fprintf(stderr, "I am rank %d and am exiting with 0\n", rank); MPI_Finalize(); return 0; } $ mpicc abort.c -o abort $ mpirun --host mpi004:2,mpi005:2 -np 4 ./abort I am rank 0 and am exiting with 0 I am rank 1 and am exiting with 0 I am rank 2 and am exiting with 0 I am rank 3 and am aborting with error code 79 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD with errorcode 79. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- $ echo $? 79 ________________________________ From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on behalf of Alexander Stadik via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Sent: Wednesday, July 19, 2023 12:45 AM To: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>>; Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Alexander Stadik <alexander.sta...@essteyr.com<mailto:alexander.sta...@essteyr.com>> Subject: Re: [OMPI users] [EXT] Re: Error handling Hey George, I said random only because I do not see the method behind it, but exactly like this when I do allreduce by MIN and return a negative number I get either 248, 253, 11 or 6 usually. Meaning that's purely a number from MPI side. The Problem with MPI_Abort is it shows the correct number in its output in Logfile, but it does not communicate its value to other processes, or forward its value to exit. So one also always sees these "random" values. When using positive numbers in range it seems to work, so my question was on how it works, and how one can do it? Is there a way to let MPI_Abort communicate the value as exit code? Why do negative numbers not work, or does one simply have to always use positive numbers? Why I would prefer Abort is because it seems safer. BR Alex ________________________________ Von: George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> Gesendet: Dienstag, 18. Juli 2023 18:47 An: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Alexander Stadik <alexander.sta...@essteyr.com<mailto:alexander.sta...@essteyr.com>> Betreff: [EXT] Re: [OMPI users] Error handling External: Check sender address and use caution opening links or attachments Alex, How are your values "random" if you provide correct values ? Even for negative values you could use MIN to pick one value and return it. What is the problem with `MPI_Abort` ? it does seem to do what you want. George. On Tue, Jul 18, 2023 at 4:38 AM Alexander Stadik via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hey everyone, I am working for longer time now with cuda-aware OpenMPI, and developed longer time back a small exceptions handling framework including MPI and CUDA exceptions. Currently I am using MPI_Abort with costum error numbers, to terminate everything elegantly, which works well, by just reading the logfile in case of a crash. Now I was wondering how one can handle return / exit codes properly between processes, since we would like to filter non-zero exits by return code. One way is a simple Allreduce (in my case) + exit instead of Abort. But the problem seems to be the values are always "random" (since I was using negative codes), only by using MPI error codes it seems to work correctly. But usage of that is limited. Any suggestions on how to do this / how it can work properly? BR Alex [https://www.essteyr.com/wp-content/uploads/2020/02/pic-1_1568d80e-78e3-426f-85e8-4bf0051208351.png] [https://www.essteyr.com/wp-content/uploads/2021/01/ESSSignatur3.png]<https://www.essteyr.com/> [https://www.essteyr.com/wp-content/uploads/2020/02/linkedin_38a91193-02cf-4df9-8e91-230f7459e9c3.png]<https://at.linkedin.com/company/ess-engineeringsoftwaresteyr> [https://www.essteyr.com/wp-content/uploads/2020/02/twitter_5fc7318f-c0e4-495c-b96c-ebd9cf186067.png] <https://twitter.com/essteyr> [https://www.essteyr.com/wp-content/uploads/2020/02/facebook_ee01289e-1a90-48d0-8e82-049bb3c3a46b.png] <https://www.facebook.com/essteyr> [https://www.essteyr.com/wp-content/uploads/2020/09/SocialLink_Instagram_32x32_ea55186d-8d0b-4f5e-a023-02e04995f5bf.png] <https://www.instagram.com/ess_engineering_software_steyr/> [cid:image001.png@01D9BAD4.EBB3C300] DI Alexander Stadik Head of Large Scale Solutions Research & Development | Large Scale Solutions [cid:image002.png@01D9BAD4.EBB3C300]Book a Meeting<https://outlook.office365.com/owa/calendar/di%20alexandersta...@essteyr.com/bookings/> Phone: +4372522044622 Company: +43725220446 Mail: alexander.sta...@essteyr.com<mailto:alexander.sta...@essteyr.com> Register of Firms No.: FN 427703 a Commercial Court: District Court Steyr UID: ATU69213102 [https://www.essteyr.com/wp-content/uploads/2018/09/pic-2_f96fc865-57a5-4ef1-a924-add9b85d55cc1.png] ESS Engineering Software Steyr GmbH • Berggasse 35 • 4400 • Steyr • Austria [https://www.essteyr.com/wp-content/uploads/2018/09/pic-2_1df6b77f-61f1-40d3-a337-0145e62afb3e1.png] This message is confidential. It may also be privileged or otherwise protected by work product immunity or other legal rules. If you have received it by mistake, please let us know by e-mail reply and delete it from your system; you may not copy this message or disclose its contents to anyone. Please send us by fax any message containing deadlines as incoming e-mails are not screened for response deadlines. The integrity and security of this message cannot be guaranteed on the Internet.