Hi All, As of January 29, 2010, we recently produced a new release (1.1.3) of DMTCP (Distributed MultiThreaded CheckPointing). Its web page is at http://dmtcp.sourceforge.net/ . We (the developers of DMTCP) have tried to carefully test this this version of DMTCP on OpenMPI 1.4.1, and we believe it to be working well. We would welcome feedback from any OpenMPI users who would care to test it on their own applications.
The DMTCP package provides an alternative solution for checkpoint-restart of OpenMPI computations. Using it is as simple as: dmtcp_checkpoint dmtcp_checkpoint mpirun ./hello_mpi # Manually checkpoint from any other terminal dmtcp_command --checkpoint # Execute restart script, which invokes ckpt images that were generated. ./dmtcp_restart_script.sh DMTCP works by creating a separate, stateless checkpoint coordinator, independent of OpenMPI's orterun. All OpenMPI processes are then checkpointed, including orterun. At restart time, a new DMTCP checkpoint coordinator can be used. DMTCP is transparent and runs entirely in user space. There is no modification to the MPI application binary, nor to OpenMPI nor to the operating system kernel. DMTCP also supports a dmtcpaware interface (application-initiated checkpoints), and numerous other features. At this time, DMTCP supports only the use of Ethernet (TCP/IP) and shared memory for transport. We are looking at supporting the Infiniband transport layer in the future. Finally, a bit of history. DMTCP began with a goal of checkpointing distributed desktop applications. We recognize thefine checkpoint-restart solution that already exists in OpenMPI: checkpoint-restart service on top of BLCR. We offer DMTCP as an alternative for some unusual situations, such as when the end user does not have privilege to add the BLCR kernel module. We are eager to gain feedback from the OpenMPI community. Thanks, DMTCP Developers