I think you are right, good catch. It could be that this user deleted the files manually, but I wonder if there isn't some way that is a Kafka bug--e.g. if multiple types of retention policies kick in at the same time do we synchronize that properly?
-Jay On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <jai.forums2...@gmail.com> wrote: > Hi Jay, > > I spent some more time over this today and went back to the original > thread which brought up the issue with file leaks [1]. I think that output > of lsof in that logs has a very important hint: > > /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_ > topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) java > 8446 root 725u REG 253,2 536910838 26087364 > > /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_ > topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) java > 8446 root 726u REG 253,2 536917902 26087368 > > Notice the "(deleted)" text in that output. The last time I looked at that > output, I thought it was the user who had added that "deleted" text to help > us understand that problem. But today I read up on the output format of > lsof and it turns out that it's lsof which itself adds that hint whenever a > file has already been deleted possibly by a different process but some > other process is still holding on to open resources of that (deleted) file > [2]. > > So in the context of the issue that we are discussing and the way Kafka > deals with async deletes (i.e. first by attempting a rename of the > log/index files), I think this all makes sense now. So what I think is > happening is, some (other?) process (not sure what/why) has already deleted > the log file that Kafka is using for the LogSegment. The LogSegment however > still has open FileChannel resource on that deleted file (and that's why > the open file descriptor is held on and shows up in that output). Now > Kafka, at some point in time, triggers an async delete of the LogSegment, > which involves a file rename of that (already deleted) log file. The rename > fails (because the original file path isn't there anymore). As a result, we > end up throwing that "failed to rename, KafkaStorageException" and thus > leave behind the open FileChannel to continue being open forever (till the > Kafka program exits). > > So I think we should: > > 1) Find what/why deletes that underlying log file(s). I'll add a reply to > that original mail discussion asking the user if he can provide more > details. > 2) Handle this case and close the FileChannel. The patch that's been > uploaded to review board https://reviews.apache.org/r/29755/ does that. > The "immediate delete" on failure to rename, involves (safely) closing the > open FileChannel and (safely) deleting the (possibly non-existent) file. > > By the way, this entire thing can be easily reproduced by running the > following program which first creates a file and open a filechannel to that > file and then waits for the user to delete that file externally (I used the > rm command) and then go and tries to rename that deleted file, which then > fails. In between each of these steps, you can run the lsof command > externally to see the open file resources (I used 'lsof | grep test.log'): > > public static void main(String[] args) throws Exception { > // Open a file and file channel for read/write > final File originalLogFile = new > File("/home/jaikiran/deleteme/test.log"); > // change this path relevantly if you plan to run it > final FileChannel fileChannel = new RandomAccessFile(originalLogFile, > "rw").getChannel(); > System.out.println("Opened file channel to " + originalLogFile); > // wait for underlying file to be deleted externally > System.out.println("Waiting for the " + originalLogFile + " to be > deleted externally. Press any key after the file is deleted"); > System.in.read(); > // wait for the user to check the lsof output > System.out.println(originalLogFile + " seems to have been deleted > externally, check lsof command output to see open file resources."); > System.out.println("Press any key to try renaming this already > deleted file, from the program"); > System.in.read(); > // try rename > final File fileToRenameTo = new File(originalLogFile.getPath() + > ".deleted"); > System.out.println("Trying to rename " + originalLogFile + " to " > + fileToRenameTo); > final boolean renamedSucceeded = originalLogFile.renameTo( > fileToRenameTo); > if (renamedSucceeded) { > System.out.println("Rename SUCCEEDED. Renamed file exists? " + > fileToRenameTo.exists()); > } else { > System.out.println("FAILED to rename file " + originalLogFile > + " to " + fileToRenameTo); > } > // wait for the user to check the lsof output, after our rename > failed > System.out.println("Check the lsof output and press any key to > close the open file channel to a deleted file"); > System.in.read(); > // close the file channel > fileChannel.close(); > // let user check the lsof output one final time. This time he > won't see open file resources from this program > System.out.println("File channel closed. Check the lsof output and > press any key to terminate the program"); > System.in.read(); > // all done, exit > System.out.println("Program will terminate"); > } > > > > [1] http://mail-archives.apache.org/mod_mbox/kafka-users/ > 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE% > 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E > [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8 > > > -Jaikiran > > On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote: > >> Hey guys, >> >> Jaikiran posted a patch on KAFKA-1853 to improve the handling of failures >> during delete. >> https://issues.apache.org/jira/browse/KAFKA-1853 >> >> The core problem here is that we are doing File.rename() as part of the >> delete sequence which returns false if the rename failed. Or file delete >> sequence is something like the following: >> 1. Remove the file from the index so no new reads can begin on it >> 2. Rename the file to xyz.deleted so that if we crash it will get cleaned >> up >> 3. Schedule a task to delete the file in 30 seconds or so when any >> in-progress reads have likely completed. The goal here is to avoid errors >> on in progress reads but also avoid locking on all reads. >> >> The question is what to do when rename fails? Previously if this happened >> we actually didn't pay attention and would fail to delete the file >> entirely. This patch changes it so that if the rename fails we log an error >> and force an immediate delete. >> >> I think this is the right thing to do, but I guess the real question is >> why would rename fail? Some possibilities: >> http://stackoverflow.com/questions/2372374/why-would-a- >> file-rename-fail-in-java >> >> An alternative would be to treat this as a filesystem error and shutdown >> as we do elsewhere. >> >> Thoughts? >> >> -Jay >> >> >