On Mon, Mar 12, 2012 at 05:42:25PM +0100, Michael Hanke wrote: > On Mon, Mar 12, 2012 at 11:24:09AM -0500, Alan De Smet wrote: > > Michael Hanke <michael.ha...@gmail.com> wrote: > > > I'm testing DMTCP-based checkpointing of vanilla job in our Condor pool > > > (all version 7.7.5). I noticed that jobs once evicted remain in SPOOL > > > even after they got restarted on an exec node again. Checkpoint files, > > > executable, restart script and various other files remain -- I assume > > > that is just everything. > > > > I'm not clear what you're reporting. Files in SPOOL should > > remain as long as the associated job is still in the queue. Are > > you saying that the job in question left the queue (is no longer > > visible in condor_q), but still has a subdirectory in SPOOL? > > Yes, that is the case -- sorry for having been vague. It doesn't seem to > make a difference whether a job terminates normally, or gets > condor_rm'ed. Whenever such a job got checkpointed once, it leaves its > remains in SPOOL.
Here are few more bits of information. I currently have a vanilla job (3009.9) running that got checkpointed yesterday (by DMTCP), the result is in SPOOL % ls -l /var/spool/condor/3009/9/ total 12 drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0/ drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0.swap/ drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0.tmp/ % condor_q -l 3009.10 |grep 'Owner' Owner = "mih" Right now the job is still running, so I wouldn't expect Condor to remove the checkpoints. However ShadowLog has this: 04/11/12 15:14:20 (3009.9) (11192): Failed to remove /var/spool/condor/3009/9/cluster3009.proc9.subproc0.swap: Permission denied (errno 13) 04/11/12 15:14:20 (3009.9) (11192): Job 3009.9 is being evicted from slot1@snake1 04/11/12 15:15:20 (3009.9) (11192): **** condor_shadow (condor_SHADOW) pid 11192 EXITING WITH STATUS 107 % ps aux |grep 3009.9 mih 7929 0.0 0.0 93700 6796 ? S Apr11 0:02 condor_shadow -f 3009.9 --schedd=<10.0.0.254:47584> --xfer-queue=limit=upload,download;addr=<10.0.0.254:42112> <10.0.0.254:42112> - It seems like the shadow should have permission to remove this directory. Maybe it was present at that point and the shadow tried to remove it unconditionally -- but it says "permission denied" and not "file not found". I tried using a simple 'sleep' test job to get a sense of what is going on (submit file is attached). First test: condor_submit -> condor_vacate_job -> condor_rm # what the logs say % grep 3029.0 * MatchLog:04/12/12 10:52:53 Request 03029.00000: MatchLog:04/12/12 10:52:53 Matched 3029.0 m...@example.com <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5 MatchLog:04/12/12 10:55:51 Request 03029.00000: MatchLog:04/12/12 10:55:51 Matched 3029.0 m...@example.com <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5 NegotiatorLog:04/12/12 10:52:53 Request 03029.00000: NegotiatorLog:04/12/12 10:52:53 Matched 3029.0 m...@example.com <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5 NegotiatorLog:04/12/12 10:55:51 Request 03029.00000: NegotiatorLog:04/12/12 10:55:51 Matched 3029.0 m...@example.com <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5 SchedLog:04/12/12 10:52:53 (pid:29046) Starting add_shadow_birthdate(3029.0) SchedLog:04/12/12 10:52:53 (pid:29046) Started shadow for job 3029.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 11445) SchedLog:04/12/12 10:54:03 (pid:29046) Shadow pid 11445 for job 3029.0 exited with status 107 SchedLog:04/12/12 10:54:03 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3029.0) deleted SchedLog:04/12/12 10:55:51 (pid:29046) Starting add_shadow_birthdate(3029.0) SchedLog:04/12/12 10:55:51 (pid:29046) Started shadow for job 3029.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 14533) SchedLog:04/12/12 10:58:28 (pid:29046) Shadow pid 14533 for job 3029.0 exited with status 102 SchedLog:04/12/12 10:58:28 (pid:29046) Failed to remove /var/spool/condor/3029/0/cluster3029.proc0.subproc0: Directory not empty (errno 39) ShadowLog:04/12/12 10:52:53 Initializing a VANILLA shadow for job 3029.0 ShadowLog:04/12/12 10:52:53 (3029.0) (11445): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED ShadowLog:04/12/12 10:53:55 (3029.0) (11445): Got SIGTERM. Performing graceful shutdown. ShadowLog:04/12/12 10:54:03 (3029.0) (11445): Failed to remove /var/spool/condor/3029/0/cluster3029.proc0.subproc0.swap: Permission denied (errno 13) ShadowLog:04/12/12 10:54:03 (3029.0) (11445): Job 3029.0 is being evicted from slot1@snake5 ShadowLog:04/12/12 10:54:03 (3029.0) (11445): **** condor_shadow (condor_SHADOW) pid 11445 EXITING WITH STATUS 107 ShadowLog:04/12/12 10:55:51 Initializing a VANILLA shadow for job 3029.0 ShadowLog:04/12/12 10:55:51 (3029.0) (14533): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED ShadowLog:04/12/12 10:58:28 (3029.0) (14533): Requesting fast removal of job. ShadowLog:04/12/12 10:58:28 (3029.0) (14533): Job 3029.0 is being evicted from slot1@snake5 ShadowLog:04/12/12 10:58:28 (3029.0) (14533): **** condor_shadow (condor_SHADOW) pid 14533 EXITING WITH STATUS 102 # SPOOL content AFTER restart % ll /var/spool/condor/3029/0 total 12K drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0/ drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0.swap/ drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0.tmp/ % ll /var/spool/condor/3029/0/cluster3029.proc0.subproc0 total 2.0M -rw-r--r-- 1 mih mih 0 Apr 12 10:54 3029.0.err -rw-r--r-- 1 mih mih 0 Apr 12 10:54 3029.0.out -rw-r--r-- 1 mih mih 3.0K Apr 12 10:54 3029.0.shimlog -rw------- 1 mih mih 2.0M Apr 12 10:54 ckpt_sleep_3029_0-1a0c4a68dcce1-13241-4f8697e8.dmtcp -rw-r--r-- 1 mih mih 0 Apr 12 10:54 _condor_stderr -rw-r--r-- 1 mih mih 91 Apr 12 10:54 _condor_stdout -rwxr--r-- 1 mih mih 3.9K Apr 12 10:54 dmtcp_restart_script.sh* # SPOOL content AFTER condor_rm % ll /var/spool/condor/3029/0 total 4.0K drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0/ % ll /var/spool/condor/3029/0/cluster3029.proc0.subproc0 total 2.0M -rw-r--r-- 1 mih mih 0 Apr 12 10:54 3029.0.err -rw-r--r-- 1 mih mih 0 Apr 12 10:54 3029.0.out -rw-r--r-- 1 mih mih 3.0K Apr 12 10:54 3029.0.shimlog -rw------- 1 mih mih 2.0M Apr 12 10:54 ckpt_sleep_3029_0-1a0c4a68dcce1-13241-4f8697e8.dmtcp -rw-r--r-- 1 mih mih 0 Apr 12 10:54 _condor_stderr -rw-r--r-- 1 mih mih 91 Apr 12 10:54 _condor_stdout -rwxr--r-- 1 mih mih 3.9K Apr 12 10:54 dmtcp_restart_script.sh* And the same thing again -- the only difference being that the job exit(0) on its own -- no condor_rm: # what the logs say % grep 3030.0 * MatchLog:04/12/12 11:01:12 Request 03030.00000: MatchLog:04/12/12 11:01:12 Matched 3030.0 m...@example.com <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5 MatchLog:04/12/12 11:03:07 Request 03030.00000: MatchLog:04/12/12 11:03:07 Matched 3030.0 m...@example.com <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5 NegotiatorLog:04/12/12 11:01:12 Request 03030.00000: NegotiatorLog:04/12/12 11:01:12 Matched 3030.0 m...@example.com <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5 NegotiatorLog:04/12/12 11:03:07 Request 03030.00000: NegotiatorLog:04/12/12 11:03:07 Matched 3030.0 m...@example.com <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5 SchedLog:04/12/12 11:01:12 (pid:29046) Starting add_shadow_birthdate(3030.0) SchedLog:04/12/12 11:01:12 (pid:29046) Started shadow for job 3030.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 20197) SchedLog:04/12/12 11:02:19 (pid:29046) Shadow pid 20197 for job 3030.0 exited with status 107 SchedLog:04/12/12 11:02:19 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3030.0) deleted SchedLog:04/12/12 11:03:07 (pid:29046) Starting add_shadow_birthdate(3030.0) SchedLog:04/12/12 11:03:07 (pid:29046) Started shadow for job 3030.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 22184) SchedLog:04/12/12 11:07:11 (pid:29046) Shadow pid 22184 for job 3030.0 reports job exit reason 100. SchedLog:04/12/12 11:07:11 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3030.0) deleted SchedLog:04/12/12 11:07:11 (pid:29046) Failed to remove /var/spool/condor/3030/0/cluster3030.proc0.subproc0: Directory not empty (errno 39) ShadowLog:04/12/12 11:01:12 Initializing a VANILLA shadow for job 3030.0 ShadowLog:04/12/12 11:01:12 (3030.0) (20197): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED ShadowLog:04/12/12 11:02:11 (3030.0) (20197): Got SIGTERM. Performing graceful shutdown. ShadowLog:04/12/12 11:02:19 (3030.0) (20197): Failed to remove /var/spool/condor/3030/0/cluster3030.proc0.subproc0.swap: Permission denied (errno 13) ShadowLog:04/12/12 11:02:19 (3030.0) (20197): Job 3030.0 is being evicted from slot1@snake5 ShadowLog:04/12/12 11:02:19 (3030.0) (20197): **** condor_shadow (condor_SHADOW) pid 20197 EXITING WITH STATUS 107 ShadowLog:04/12/12 11:03:07 Initializing a VANILLA shadow for job 3030.0 ShadowLog:04/12/12 11:03:07 (3030.0) (22184): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED ShadowLog:04/12/12 11:07:11 (3030.0) (22184): Job 3030.0 terminated: exited with status 0 ShadowLog:04/12/12 11:07:11 (3030.0) (22184): Reporting job exit reason 100 and attempting to fetch new job. ShadowLog:04/12/12 11:07:11 (3030.0) (22184): **** condor_shadow (condor_SHADOW) pid 22184 EXITING WITH STATUS 100 % ll /var/spool/condor/3030/0 total 4.0K drwxr-xr-x 2 mih mih 4.0K Apr 12 11:02 cluster3030.proc0.subproc0/ % ll /var/spool/condor/3030/0/cluster3030.proc0.subproc0 total 2.0M -rw-r--r-- 1 mih mih 0 Apr 12 11:02 3030.0.err -rw-r--r-- 1 mih mih 0 Apr 12 11:02 3030.0.out -rw-r--r-- 1 mih mih 3.0K Apr 12 11:02 3030.0.shimlog -rw------- 1 mih mih 2.0M Apr 12 11:02 ckpt_sleep_3030_0-1a0c4a68dcce1-13381-4f8699dc.dmtcp -rw-r--r-- 1 mih mih 0 Apr 12 11:02 _condor_stderr -rw-r--r-- 1 mih mih 91 Apr 12 11:02 _condor_stdout -rwxr--r-- 1 mih mih 3.9K Apr 12 11:02 dmtcp_restart_script.sh* Conclusion: Despite permission denied messages some SPOOL data is removed -- both with condor_rm or at "natural" job exit. However, ALL DMTCP-related files remain in SPOOL until condor_preen catches them. % sudo condor_preen -remove /var/spool/condor/3030/0/cluster3030.proc0.subproc0 - Removed /var/spool/condor/3029/0/cluster3029.proc0.subproc0 - Removed but even condor_preen doesn't clean everything % ll /var/spool/condor/3030/0 total 0 I hope helps. Thanks, Michael -- Michael Hanke http://mih.voxindeserto.de
# DMTCP checkpointing related boilerplate (should be applicable in most cases) universe = vanilla executable = /usr/lib/condor/shim_dmtcp should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT kill_sig = 2 output = $(CLUSTER).$(PROCESS).shimout error = $(CLUSTER).$(PROCESS).shimerr log = $(CLUSTER).$(PROCESS).log dmtcp_args = --log $(CLUSTER).$(PROCESS).shimlog --stdout $(CLUSTER).$(PROCESS).out --stderr $(CLUSTER).$(PROCESS).err dmtcp_env = DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null;DMTCP_PREFIX_ID=$(CLUSTER)_$(PROCESS) # job-specific setup, uses arguments and environment defined above arguments = $(dmtcp_args) sleep 240 environment = $(dmtcp_env) queue 1