I think you are right, good catch. It could be that this user deleted the
files manually, but I wonder if there isn't some way that is a Kafka
bug--e.g. if multiple types of retention policies kick in at the same time
do we synchronize that properly?

-Jay

On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <jai.forums2...@gmail.com>
wrote:

> Hi Jay,
>
> I spent some more time over this today and went back to the original
> thread which brought up the issue with file leaks [1]. I think that output
> of lsof in that logs has a very important hint:
>
> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
> topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) java
> 8446 root 725u REG 253,2 536910838 26087364
>
> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
> topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) java
> 8446 root 726u REG 253,2 536917902 26087368
>
> Notice the "(deleted)" text in that output. The last time I looked at that
> output, I thought it was the user who had added that "deleted" text to help
> us understand that problem. But today I read up on the output format of
> lsof and it turns out that it's lsof which itself adds that hint whenever a
> file has already been deleted possibly by a different process but some
> other process is still holding on to open resources of that (deleted) file
> [2].
>
> So in the context of the issue that we are discussing and the way Kafka
> deals with async deletes (i.e. first by attempting a rename of the
> log/index files), I think this all makes sense now. So what I think is
> happening is, some (other?) process (not sure what/why) has already deleted
> the log file that Kafka is using for the LogSegment. The LogSegment however
> still has open FileChannel resource on that deleted file (and that's why
> the open file descriptor is held on and shows up in that output). Now
> Kafka, at some point in time, triggers an async delete of the LogSegment,
> which involves a file rename of that (already deleted) log file. The rename
> fails (because the original file path isn't there anymore). As a result, we
> end up throwing that "failed to rename, KafkaStorageException" and thus
> leave behind the open FileChannel to continue being open forever (till the
> Kafka program exits).
>
> So I think we should:
>
> 1) Find what/why deletes that underlying log file(s). I'll add a reply to
> that original mail discussion asking the user if he can provide more
> details.
> 2) Handle this case and close the FileChannel. The patch that's been
> uploaded to review board https://reviews.apache.org/r/29755/ does that.
> The "immediate delete" on failure to rename, involves (safely) closing the
> open FileChannel and (safely) deleting the (possibly non-existent) file.
>
> By the way, this entire thing can be easily reproduced by running the
> following program which first creates a file and open a filechannel to that
> file and then waits for the user to delete that file externally (I used the
> rm command) and then go and tries to rename that deleted file, which then
> fails. In between each of these steps, you can run the lsof command
> externally to see the open file resources (I used 'lsof | grep test.log'):
>
>     public static void main(String[] args) throws Exception {
>         // Open a file and file channel for read/write
>         final File originalLogFile = new 
> File("/home/jaikiran/deleteme/test.log");
> // change this path relevantly if you plan to run it
>         final FileChannel fileChannel = new RandomAccessFile(originalLogFile,
> "rw").getChannel();
>         System.out.println("Opened file channel to " + originalLogFile);
>         // wait for underlying file to be deleted externally
>         System.out.println("Waiting for the " + originalLogFile + " to be
> deleted externally. Press any key after the file is deleted");
>         System.in.read();
>         // wait for the user to check the lsof output
>         System.out.println(originalLogFile + " seems to have been deleted
> externally, check lsof command output to see open file resources.");
>         System.out.println("Press any key to try renaming this already
> deleted file, from the program");
>         System.in.read();
>         // try rename
>         final File fileToRenameTo = new File(originalLogFile.getPath() +
> ".deleted");
>         System.out.println("Trying to rename " + originalLogFile + " to "
> + fileToRenameTo);
>         final boolean renamedSucceeded = originalLogFile.renameTo(
> fileToRenameTo);
>         if (renamedSucceeded) {
>             System.out.println("Rename SUCCEEDED. Renamed file exists? " +
> fileToRenameTo.exists());
>         } else {
>             System.out.println("FAILED to rename file " + originalLogFile
> + " to " + fileToRenameTo);
>         }
>         // wait for the user to check the lsof output, after our rename
> failed
>         System.out.println("Check the lsof output and press any key to
> close the open file channel to a deleted file");
>         System.in.read();
>         // close the file channel
>         fileChannel.close();
>         // let user check the lsof output one final time. This time he
> won't see open file resources from this program
>         System.out.println("File channel closed. Check the lsof output and
> press any key to terminate the program");
>         System.in.read();
>         // all done, exit
>         System.out.println("Program will terminate");
>     }
>
>
>
> [1] http://mail-archives.apache.org/mod_mbox/kafka-users/
> 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE%
> 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E
> [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8
>
>
> -Jaikiran
>
> On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote:
>
>> Hey guys,
>>
>> Jaikiran posted a patch on KAFKA-1853 to improve the handling of failures
>> during delete.
>> https://issues.apache.org/jira/browse/KAFKA-1853
>>
>> The core problem here is that we are doing File.rename() as part of the
>> delete sequence which returns false if the rename failed. Or file delete
>> sequence is something like the following:
>> 1. Remove the file from the index so no new reads can begin on it
>> 2. Rename the file to xyz.deleted so that if we crash it will get cleaned
>> up
>> 3. Schedule a task to delete the file in 30 seconds or so when any
>> in-progress reads have likely completed. The goal here is to avoid errors
>> on in progress reads but also avoid locking on all reads.
>>
>> The question is what to do when rename fails? Previously if this happened
>> we actually didn't pay attention and would fail to delete the file
>> entirely. This patch changes it so that if the rename fails we log an error
>> and force an immediate delete.
>>
>> I think this is the right thing to do, but I guess the real question is
>> why would rename fail? Some possibilities:
>> http://stackoverflow.com/questions/2372374/why-would-a-
>> file-rename-fail-in-java
>>
>> An alternative would be to treat this as a filesystem error and shutdown
>> as we do elsewhere.
>>
>> Thoughts?
>>
>> -Jay
>>
>>
>

Reply via email to