Agree with Sriram / Jun, I think the error should be treated as fatal and
we should shutdown the broker gracefully.

On Mon, Jan 26, 2015 at 8:41 AM, Jun Rao <j...@confluent.io> wrote:

> We probably can default the log dir to a relative path, sth like
> ../kafka-logs.
>
> As for I/O errors on rename, I agree that we probably should just shut down
> the broker since it's not expected to happen.
>
> Thanks,
>
> Jun
>
> On Mon, Jan 26, 2015 at 5:54 AM, Jaikiran Pai <jai.forums2...@gmail.com>
> wrote:
>
> > Having looked at the logs the user posted, I don't think this specific
> > issue has to do with /tmp path.
> >
> > However, now that the /tmp path is being discussed, I think it's a good
> > idea that we default the Kafka logs to a certain folder. As Jay notes, it
> > makes it very easy to just download and start the servers without having
> to
> > fiddle with the configs when you are just starting out. Having said that,
> > when I started out with Kafka, I found /tmp to be a odd place to default
> > the path to. I expected them to be defaulted to a folder within the Kafka
> > install. Somewhere like KAFKA_INSTALL_FOLDER/data/kafka-logs/ folder. Is
> > that something we should do?
> >
> > -Jaikiran
> >
> > On Monday 26 January 2015 12:23 AM, Jay Kreps wrote:
> >
> >> Hmm, but I don't think tmp gets cleaned while the server is running...
> >>
> >> The reason for using tmp was because we don't know which directory they
> >> will use and we don't want them to have to edit configuration for the
> >> simple "out of the box" getting started tutorial. I actually do think
> that
> >> is important. Maybe an intermediate step we could do is just call out
> this
> >> setting in the quickstart so people know where data is going and know
> they
> >> need to configure it later...
> >>
> >> -Jay
> >>
> >> On Sun, Jan 25, 2015 at 9:32 AM, Joe Stein <joe.st...@stealth.ly>
> wrote:
> >>
> >>  This feels like another type of symptom from people using /tmp/ for
> their
> >>> logs.  Perosnally, I would rather use /mnt/data or something and if
> that
> >>> doesn't exist on their machine we can exception, or no default and
> force
> >>> set it.
> >>>
> >>> /*******************************************
> >>> Joe Stein
> >>> Founder, Principal Consultant
> >>> Big Data Open Source Security LLC
> >>> http://www.stealth.ly
> >>> Twitter: @allthingshadoop
> >>> ********************************************/
> >>> On Jan 25, 2015 11:37 AM, "Jay Kreps" <jay.kr...@gmail.com> wrote:
> >>>
> >>>  I think you are right, good catch. It could be that this user deleted
> >>>> the
> >>>> files manually, but I wonder if there isn't some way that is a Kafka
> >>>> bug--e.g. if multiple types of retention policies kick in at the same
> >>>>
> >>> time
> >>>
> >>>> do we synchronize that properly?
> >>>>
> >>>> -Jay
> >>>>
> >>>> On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <
> jai.forums2...@gmail.com
> >>>> >
> >>>> wrote:
> >>>>
> >>>>  Hi Jay,
> >>>>>
> >>>>> I spent some more time over this today and went back to the original
> >>>>> thread which brought up the issue with file leaks [1]. I think that
> >>>>>
> >>>> output
> >>>>
> >>>>> of lsof in that logs has a very important hint:
> >>>>>
> >>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
> >>>>> topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted)
> java
> >>>>> 8446 root 725u REG 253,2 536910838 26087364
> >>>>>
> >>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
> >>>>> topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted)
> java
> >>>>> 8446 root 726u REG 253,2 536917902 26087368
> >>>>>
> >>>>> Notice the "(deleted)" text in that output. The last time I looked at
> >>>>>
> >>>> that
> >>>>
> >>>>> output, I thought it was the user who had added that "deleted" text
> to
> >>>>>
> >>>> help
> >>>>
> >>>>> us understand that problem. But today I read up on the output format
> of
> >>>>> lsof and it turns out that it's lsof which itself adds that hint
> >>>>>
> >>>> whenever a
> >>>>
> >>>>> file has already been deleted possibly by a different process but
> some
> >>>>> other process is still holding on to open resources of that (deleted)
> >>>>>
> >>>> file
> >>>>
> >>>>> [2].
> >>>>>
> >>>>> So in the context of the issue that we are discussing and the way
> Kafka
> >>>>> deals with async deletes (i.e. first by attempting a rename of the
> >>>>> log/index files), I think this all makes sense now. So what I think
> is
> >>>>> happening is, some (other?) process (not sure what/why) has already
> >>>>>
> >>>> deleted
> >>>>
> >>>>> the log file that Kafka is using for the LogSegment. The LogSegment
> >>>>>
> >>>> however
> >>>>
> >>>>> still has open FileChannel resource on that deleted file (and that's
> >>>>>
> >>>> why
> >>>
> >>>> the open file descriptor is held on and shows up in that output). Now
> >>>>> Kafka, at some point in time, triggers an async delete of the
> >>>>>
> >>>> LogSegment,
> >>>
> >>>> which involves a file rename of that (already deleted) log file. The
> >>>>>
> >>>> rename
> >>>>
> >>>>> fails (because the original file path isn't there anymore). As a
> >>>>>
> >>>> result,
> >>>
> >>>> we
> >>>>
> >>>>> end up throwing that "failed to rename, KafkaStorageException" and
> thus
> >>>>> leave behind the open FileChannel to continue being open forever
> (till
> >>>>>
> >>>> the
> >>>>
> >>>>> Kafka program exits).
> >>>>>
> >>>>> So I think we should:
> >>>>>
> >>>>> 1) Find what/why deletes that underlying log file(s). I'll add a
> reply
> >>>>>
> >>>> to
> >>>
> >>>> that original mail discussion asking the user if he can provide more
> >>>>> details.
> >>>>> 2) Handle this case and close the FileChannel. The patch that's been
> >>>>> uploaded to review board https://reviews.apache.org/r/29755/ does
> >>>>>
> >>>> that.
> >>>
> >>>> The "immediate delete" on failure to rename, involves (safely) closing
> >>>>>
> >>>> the
> >>>>
> >>>>> open FileChannel and (safely) deleting the (possibly non-existent)
> >>>>>
> >>>> file.
> >>>
> >>>> By the way, this entire thing can be easily reproduced by running the
> >>>>> following program which first creates a file and open a filechannel
> to
> >>>>>
> >>>> that
> >>>>
> >>>>> file and then waits for the user to delete that file externally (I
> used
> >>>>>
> >>>> the
> >>>>
> >>>>> rm command) and then go and tries to rename that deleted file, which
> >>>>>
> >>>> then
> >>>
> >>>> fails. In between each of these steps, you can run the lsof command
> >>>>> externally to see the open file resources (I used 'lsof | grep
> >>>>>
> >>>> test.log'):
> >>>>
> >>>>>      public static void main(String[] args) throws Exception {
> >>>>>          // Open a file and file channel for read/write
> >>>>>          final File originalLogFile = new
> >>>>>
> >>>> File("/home/jaikiran/deleteme/test.log");
> >>>>
> >>>>> // change this path relevantly if you plan to run it
> >>>>>          final FileChannel fileChannel = new
> >>>>>
> >>>> RandomAccessFile(originalLogFile,
> >>>>
> >>>>> "rw").getChannel();
> >>>>>          System.out.println("Opened file channel to " +
> >>>>>
> >>>> originalLogFile);
> >>>
> >>>>          // wait for underlying file to be deleted externally
> >>>>>          System.out.println("Waiting for the " + originalLogFile + "
> to
> >>>>>
> >>>> be
> >>>
> >>>> deleted externally. Press any key after the file is deleted");
> >>>>>          System.in.read();
> >>>>>          // wait for the user to check the lsof output
> >>>>>          System.out.println(originalLogFile + " seems to have been
> >>>>>
> >>>> deleted
> >>>
> >>>> externally, check lsof command output to see open file resources.");
> >>>>>          System.out.println("Press any key to try renaming this
> already
> >>>>> deleted file, from the program");
> >>>>>          System.in.read();
> >>>>>          // try rename
> >>>>>          final File fileToRenameTo = new
> File(originalLogFile.getPath()
> >>>>>
> >>>> +
> >>>
> >>>> ".deleted");
> >>>>>          System.out.println("Trying to rename " + originalLogFile + "
> >>>>>
> >>>> to "
> >>>
> >>>> + fileToRenameTo);
> >>>>>          final boolean renamedSucceeded = originalLogFile.renameTo(
> >>>>> fileToRenameTo);
> >>>>>          if (renamedSucceeded) {
> >>>>>              System.out.println("Rename SUCCEEDED. Renamed file
> exists?
> >>>>>
> >>>> "
> >>>
> >>>> +
> >>>>
> >>>>> fileToRenameTo.exists());
> >>>>>          } else {
> >>>>>              System.out.println("FAILED to rename file " +
> >>>>>
> >>>> originalLogFile
> >>>
> >>>> + " to " + fileToRenameTo);
> >>>>>          }
> >>>>>          // wait for the user to check the lsof output, after our
> >>>>> rename
> >>>>> failed
> >>>>>          System.out.println("Check the lsof output and press any key
> to
> >>>>> close the open file channel to a deleted file");
> >>>>>          System.in.read();
> >>>>>          // close the file channel
> >>>>>          fileChannel.close();
> >>>>>          // let user check the lsof output one final time. This time
> he
> >>>>> won't see open file resources from this program
> >>>>>          System.out.println("File channel closed. Check the lsof
> output
> >>>>>
> >>>> and
> >>>>
> >>>>> press any key to terminate the program");
> >>>>>          System.in.read();
> >>>>>          // all done, exit
> >>>>>          System.out.println("Program will terminate");
> >>>>>      }
> >>>>>
> >>>>>
> >>>>>
> >>>>> [1] http://mail-archives.apache.org/mod_mbox/kafka-users/
> >>>>> 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE%
> >>>>> 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E
> >>>>> [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8
> >>>>>
> >>>>>
> >>>>> -Jaikiran
> >>>>>
> >>>>> On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote:
> >>>>>
> >>>>>  Hey guys,
> >>>>>>
> >>>>>> Jaikiran posted a patch on KAFKA-1853 to improve the handling of
> >>>>>>
> >>>>> failures
> >>>>
> >>>>> during delete.
> >>>>>> https://issues.apache.org/jira/browse/KAFKA-1853
> >>>>>>
> >>>>>> The core problem here is that we are doing File.rename() as part of
> >>>>>>
> >>>>> the
> >>>
> >>>> delete sequence which returns false if the rename failed. Or file
> >>>>>>
> >>>>> delete
> >>>
> >>>> sequence is something like the following:
> >>>>>> 1. Remove the file from the index so no new reads can begin on it
> >>>>>> 2. Rename the file to xyz.deleted so that if we crash it will get
> >>>>>>
> >>>>> cleaned
> >>>>
> >>>>> up
> >>>>>> 3. Schedule a task to delete the file in 30 seconds or so when any
> >>>>>> in-progress reads have likely completed. The goal here is to avoid
> >>>>>>
> >>>>> errors
> >>>>
> >>>>> on in progress reads but also avoid locking on all reads.
> >>>>>>
> >>>>>> The question is what to do when rename fails? Previously if this
> >>>>>>
> >>>>> happened
> >>>>
> >>>>> we actually didn't pay attention and would fail to delete the file
> >>>>>> entirely. This patch changes it so that if the rename fails we log
> an
> >>>>>>
> >>>>> error
> >>>>
> >>>>> and force an immediate delete.
> >>>>>>
> >>>>>> I think this is the right thing to do, but I guess the real question
> >>>>>>
> >>>>> is
> >>>
> >>>> why would rename fail? Some possibilities:
> >>>>>> http://stackoverflow.com/questions/2372374/why-would-a-
> >>>>>> file-rename-fail-in-java
> >>>>>>
> >>>>>> An alternative would be to treat this as a filesystem error and
> >>>>>>
> >>>>> shutdown
> >>>
> >>>> as we do elsewhere.
> >>>>>>
> >>>>>> Thoughts?
> >>>>>>
> >>>>>> -Jay
> >>>>>>
> >>>>>>
> >>>>>>
> >
>



-- 
-- Guozhang

Reply via email to