Hi, George! I had studied the ULFM document before begin the tests with failure detection in open mpi and seems me a good choice.
But I'm having trouble with the ULFM-enabled version of Open MPI (openmpi-1.7ft_b3.tar.gz). I follow the UFML setup (in http://fault-tolerance.org/ulfm/ulfm-setup/). The program compile seems ok, but when running happens the error below. Any mpi program does not run anymore (with ou without ft). Could you help me? Thanks a lot! Edson Linux version 3.2.0-51-generic (buildd@allspice) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #77-Ubuntu SMP Wed Jul 24 20:18:19 UTC 2013 ---------------- edson@edson:~/UFPR/MPI_Fault$ mpirun -np 8 -am ft-enable-mpi ./teste1 [edson:04372] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_errmgr_default: /usr/local/lib/openmpi/mca_errmgr_default.so: undefined symbol: orte_errmgr_base_error_abort (ignored) [edson:04372] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_grpcomm_basic: /usr/local/lib/openmpi/mca_grpcomm_basic.so: undefined symbol: opal_profile_file (ignored) [edson:04372] *** Process received signal *** [edson:04372] Signal: Segmentation fault (11) [edson:04372] Signal code: Address not mapped (1) [edson:04372] Failing at address: 0x14 [edson:04372] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f5d425bdcb0] [edson:04372] [ 1] /usr/local/lib/openmpi/mca_rmaps_load_balance.so(+0xa88) [0x7f5d409bca88] [edson:04372] [ 2] /usr/local/lib/libopen-rte.so.0(orte_rmaps_base_map_job+0x112) [0x7f5d42838132] [edson:04372] [ 3] /usr/local/lib/libopen-rte.so.0(orte_plm_base_setup_job+0x11c) [0x7f5d4283362c] [edson:04372] [ 4] /usr/local/lib/openmpi/mca_plm_rsh.so(+0x4ee7) [0x7f5d401a9ee7] [edson:04372] [ 5] mpirun(orterun+0xeb0) [0x404420] [edson:04372] [ 6] mpirun(main+0x20) [0x4033c4] [edson:04372] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f5d4221076d] [edson:04372] [ 8] mpirun() [0x4032e9] [edson:04372] *** End of error message *** Falha de segmentação (imagem do núcleo gravada) ----------- > Edson, > > Based on your questions I would suggest you take a look at the > ULFM-enabled version of Open MPI. You can find it at > http://fault-tolerance.org/. > > George. > > > On Aug 11, 2013, at 15:33 , Edson Tavares de Camargo > <etcama...@inf.ufpr.br> wrote: > >> Thanks a lot for your reply, Ralph! >> >> Could you tell me in what situation the error handler would be called in >> the 1.6.5 version? >> >> I had thought that a failure in a process would be catched by the error >> handler. Kill, or abort, the process wouldn't the same behaviour? >> >> In the 1.7.4 release if a process was killed the error handler will be >> catched? >> >> Thanks, >> >> Edson >> --------------------- >> >>> The error handler wouldn't be called in that situation - we simply >>> abort >>> the job. We expect to provide that integration in something like the >>> 1.7.4 >>> release milestone. >>> >>> >>> On Aug 10, 2013, at 11:07 AM, Edson Tavares de Camargo >>> <etcama...@inf.ufpr.br> wrote: >>> >>>> Hi All, >>>> >>>> I was looking for posts about fault tolerant in MPI and I found the >>>> post >>>> below: >>>> >>>> http://www.open-mpi.org/community/lists/users/2012/06/19658.php >>>> >>>> I am trying to understand all work about failures detection present >>>> in >>>> open-mpi. So, I began with a simple application, a ring application >>>> (ring.c) , to understand errors handlers. But, it seems me that didn't >>>> work, why not? (the code is below) >>>> >>>> The application (the process) was running in the same machine with the >>>> following code line: >>>> >>>> $ mpiexec -n 4 ring >>>> >>>> While the ring application was running, one of the process was >>>> killed. >>>> So, the entire application stopped (ok until here), but didn't show me >>>> the >>>> error message. The line if(error != MPI_SUCCESS) should not worked? >>>> >>>> I am using the mpiexec (OpenRTE) 1.6.5. >>>> >>>> Thanks in advance, >>>> >>>> Edson >>>> >>>> ----------------------------------------------- >>>> #include <stdio.h> >>>> #include <mpi.h> >>>> #include <time.h> >>>> >>>> int main( int argc, char *argv[] ) >>>> { >>>> int rank, size; >>>> int n = 0; >>>> int tag = 0; >>>> int error; >>>> int root = 0; >>>> int next, previous; >>>> double start = 0; >>>> double finish = 0; >>>> >>>> MPI_Status status; >>>> >>>> MPI_Init( &argc, &argv ); >>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>> >>>> // error handler >>>> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN); >>>> >>>> do { >>>> next = (rank + 1) % (size); >>>> n++; >>>> >>>> if(rank != 0){ >>>> previous = (rank - 1); >>>> }else{ >>>> previous = size - 1; >>>> } >>>> >>>> if (rank = >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >