I have configured with the additional flags(--enable-ft-thread --enable-mpi-threads) but there is no change in behaviour, it still gives seg fault. open mpi version: Open MPI: 1.3a1r19685
blcr version: version 0.7.3 The core file is attached. hello.c is sample mpi program whose core is dumped is also attached. ~]$ ompi-restart ompi_global_snapshot_11219.ckpt -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 11288 on node acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- 2 total processes killed (some possibly by mpirun during cleanup) Best, On Mon, Oct 6, 2008 at 6:44 PM, Josh Hursey <jjhur...@open-mpi.org> wrote: > The installation looks ok, though I'm not sure what is causing the segfault > of the restarted process. Two things to try. First can you send me a > backtrace from the core file that is generated from the segmentation fault. > That will provide insight into what is causing it. > > Second you may try to enable the C/R thread which allows for a checkpoint to > progress when an application is in a computation loop instead of only when > it is in the MPI library. To do so configure with these additional flags: > --enable-ft-thread --enable-mpi-threads > > What version of Open MPI are you using? What version of BLCR? > > Best, > Josh > > On Oct 6, 2008, at 3:55 PM, arun dhakne wrote: > >> Hi all, >> >> This is the procedure i have followed to install openmpi. Is there >> some installation or environment setting problem in here? >> an openmpi program with 4 process is run across 2 dual-core intel >> machines, with 2 processes running on each of the machine. >> >> ompi-checkpoint is successful but ompi-restart fails with following error >> >> >> $:> ompi-restart ompi_global_snapshot_6045.ckpt >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 0 with PID 6372 on node >> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation >> fault). >> -------------------------------------------------------------------------- >> >> Open-mpi installation steps: >> ./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr >> --with-blcr=/usr/lib64 --enable-debug >> make >> make install >> >> >> >> export >> LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/openmpi:/usr/lib64 >> export PATH=$HOME/.openmpi/bin:$PATH >> >> NOTE: blcr is installed as a module >> $:> lsmod | grep blcr >> >> blcr 117892 0 >> blcr_vmadump 58264 1 blcr >> blcr_imports 46080 2 blcr,blcr_vmadump >> >> Please let me know if there is problem with above procedure, thanks a >> lot for your time. >> >> Best. >> >> ---------- Forwarded message ---------- >> From: arun dhakne <arundha...@gmail.com> >> Date: Tue, Sep 30, 2008 at 12:52 AM >> Subject: ompi-restart issue : ompi-restart doesn't work across nodes >> To: Open MPI Users <us...@open-mpi.org> >> >> >> Hi all, >> >> I had gone through some previous ompi-restart issues but i couldn't >> find anything similar to this problem. >> >> I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645' >> >> i) If the sample mpi program say ( np 4 on single machine that is >> without any hostfile )is ran and I try to checkpoint it, it happens >> successfully and even ompi-restart works in this case. >> >> ii) If the sample mpi program is ran across say 2 different nodes and >> checkpoint happens successfully BUT ompi-restart throws following >> error: >> >> $ ompi-restart ompi_global_snapshot_7604.ckpt >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 3 with PID 9590 on node >> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation >> fault). >> -------------------------------------------------------------------------- >> >> Please let me know if more information is needed. >> >> -- >> Thanks and Regards, >> Arun U. Dhakne >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Thanks and Regards, Arun U. Dhakne Graduate Student Computer Science and Engineering Dept. State University of New York at Buffalo
core.tar.gz
Description: GNU Zip compressed data
#include <stdio.h> #include <mpi.h> int main (int argc, char *argv[]) { int rank, size; int i; int send, recv; MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */ printf( "Hello world from process %d of %d\n", rank, size ); for (i=0; i < 100; i++){ send = i; if (rank==0){ MPI_Send(&send, 1, MPI_INT,1, 1, MPI_COMM_WORLD); } else{ MPI_Recv(&recv, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, NULL); printf("Process %d says %d\n", rank, recv); } printf("Process %d says %d\n", rank, i); sleep(3); } MPI_Finalize(); return 0; }