This feels like another type of symptom from people using /tmp/ for their
logs.  Perosnally, I would rather use /mnt/data or something and if that
doesn't exist on their machine we can exception, or no default and force
set it.

/*******************************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
********************************************/
On Jan 25, 2015 11:37 AM, "Jay Kreps" <jay.kr...@gmail.com> wrote:

> I think you are right, good catch. It could be that this user deleted the
> files manually, but I wonder if there isn't some way that is a Kafka
> bug--e.g. if multiple types of retention policies kick in at the same time
> do we synchronize that properly?
>
> -Jay
>
> On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <jai.forums2...@gmail.com>
> wrote:
>
> > Hi Jay,
> >
> > I spent some more time over this today and went back to the original
> > thread which brought up the issue with file leaks [1]. I think that
> output
> > of lsof in that logs has a very important hint:
> >
> > /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
> > topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) java
> > 8446 root 725u REG 253,2 536910838 26087364
> >
> > /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
> > topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) java
> > 8446 root 726u REG 253,2 536917902 26087368
> >
> > Notice the "(deleted)" text in that output. The last time I looked at
> that
> > output, I thought it was the user who had added that "deleted" text to
> help
> > us understand that problem. But today I read up on the output format of
> > lsof and it turns out that it's lsof which itself adds that hint
> whenever a
> > file has already been deleted possibly by a different process but some
> > other process is still holding on to open resources of that (deleted)
> file
> > [2].
> >
> > So in the context of the issue that we are discussing and the way Kafka
> > deals with async deletes (i.e. first by attempting a rename of the
> > log/index files), I think this all makes sense now. So what I think is
> > happening is, some (other?) process (not sure what/why) has already
> deleted
> > the log file that Kafka is using for the LogSegment. The LogSegment
> however
> > still has open FileChannel resource on that deleted file (and that's why
> > the open file descriptor is held on and shows up in that output). Now
> > Kafka, at some point in time, triggers an async delete of the LogSegment,
> > which involves a file rename of that (already deleted) log file. The
> rename
> > fails (because the original file path isn't there anymore). As a result,
> we
> > end up throwing that "failed to rename, KafkaStorageException" and thus
> > leave behind the open FileChannel to continue being open forever (till
> the
> > Kafka program exits).
> >
> > So I think we should:
> >
> > 1) Find what/why deletes that underlying log file(s). I'll add a reply to
> > that original mail discussion asking the user if he can provide more
> > details.
> > 2) Handle this case and close the FileChannel. The patch that's been
> > uploaded to review board https://reviews.apache.org/r/29755/ does that.
> > The "immediate delete" on failure to rename, involves (safely) closing
> the
> > open FileChannel and (safely) deleting the (possibly non-existent) file.
> >
> > By the way, this entire thing can be easily reproduced by running the
> > following program which first creates a file and open a filechannel to
> that
> > file and then waits for the user to delete that file externally (I used
> the
> > rm command) and then go and tries to rename that deleted file, which then
> > fails. In between each of these steps, you can run the lsof command
> > externally to see the open file resources (I used 'lsof | grep
> test.log'):
> >
> >     public static void main(String[] args) throws Exception {
> >         // Open a file and file channel for read/write
> >         final File originalLogFile = new
> File("/home/jaikiran/deleteme/test.log");
> > // change this path relevantly if you plan to run it
> >         final FileChannel fileChannel = new
> RandomAccessFile(originalLogFile,
> > "rw").getChannel();
> >         System.out.println("Opened file channel to " + originalLogFile);
> >         // wait for underlying file to be deleted externally
> >         System.out.println("Waiting for the " + originalLogFile + " to be
> > deleted externally. Press any key after the file is deleted");
> >         System.in.read();
> >         // wait for the user to check the lsof output
> >         System.out.println(originalLogFile + " seems to have been deleted
> > externally, check lsof command output to see open file resources.");
> >         System.out.println("Press any key to try renaming this already
> > deleted file, from the program");
> >         System.in.read();
> >         // try rename
> >         final File fileToRenameTo = new File(originalLogFile.getPath() +
> > ".deleted");
> >         System.out.println("Trying to rename " + originalLogFile + " to "
> > + fileToRenameTo);
> >         final boolean renamedSucceeded = originalLogFile.renameTo(
> > fileToRenameTo);
> >         if (renamedSucceeded) {
> >             System.out.println("Rename SUCCEEDED. Renamed file exists? "
> +
> > fileToRenameTo.exists());
> >         } else {
> >             System.out.println("FAILED to rename file " + originalLogFile
> > + " to " + fileToRenameTo);
> >         }
> >         // wait for the user to check the lsof output, after our rename
> > failed
> >         System.out.println("Check the lsof output and press any key to
> > close the open file channel to a deleted file");
> >         System.in.read();
> >         // close the file channel
> >         fileChannel.close();
> >         // let user check the lsof output one final time. This time he
> > won't see open file resources from this program
> >         System.out.println("File channel closed. Check the lsof output
> and
> > press any key to terminate the program");
> >         System.in.read();
> >         // all done, exit
> >         System.out.println("Program will terminate");
> >     }
> >
> >
> >
> > [1] http://mail-archives.apache.org/mod_mbox/kafka-users/
> > 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE%
> > 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E
> > [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8
> >
> >
> > -Jaikiran
> >
> > On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote:
> >
> >> Hey guys,
> >>
> >> Jaikiran posted a patch on KAFKA-1853 to improve the handling of
> failures
> >> during delete.
> >> https://issues.apache.org/jira/browse/KAFKA-1853
> >>
> >> The core problem here is that we are doing File.rename() as part of the
> >> delete sequence which returns false if the rename failed. Or file delete
> >> sequence is something like the following:
> >> 1. Remove the file from the index so no new reads can begin on it
> >> 2. Rename the file to xyz.deleted so that if we crash it will get
> cleaned
> >> up
> >> 3. Schedule a task to delete the file in 30 seconds or so when any
> >> in-progress reads have likely completed. The goal here is to avoid
> errors
> >> on in progress reads but also avoid locking on all reads.
> >>
> >> The question is what to do when rename fails? Previously if this
> happened
> >> we actually didn't pay attention and would fail to delete the file
> >> entirely. This patch changes it so that if the rename fails we log an
> error
> >> and force an immediate delete.
> >>
> >> I think this is the right thing to do, but I guess the real question is
> >> why would rename fail? Some possibilities:
> >> http://stackoverflow.com/questions/2372374/why-would-a-
> >> file-rename-fail-in-java
> >>
> >> An alternative would be to treat this as a filesystem error and shutdown
> >> as we do elsewhere.
> >>
> >> Thoughts?
> >>
> >> -Jay
> >>
> >>
> >
>

Reply via email to