We probably can default the log dir to a relative path, sth like
../kafka-logs.

As for I/O errors on rename, I agree that we probably should just shut down
the broker since it's not expected to happen.

Thanks,

Jun

On Mon, Jan 26, 2015 at 5:54 AM, Jaikiran Pai <jai.forums2...@gmail.com>
wrote:

> Having looked at the logs the user posted, I don't think this specific
> issue has to do with /tmp path.
>
> However, now that the /tmp path is being discussed, I think it's a good
> idea that we default the Kafka logs to a certain folder. As Jay notes, it
> makes it very easy to just download and start the servers without having to
> fiddle with the configs when you are just starting out. Having said that,
> when I started out with Kafka, I found /tmp to be a odd place to default
> the path to. I expected them to be defaulted to a folder within the Kafka
> install. Somewhere like KAFKA_INSTALL_FOLDER/data/kafka-logs/ folder. Is
> that something we should do?
>
> -Jaikiran
>
> On Monday 26 January 2015 12:23 AM, Jay Kreps wrote:
>
>> Hmm, but I don't think tmp gets cleaned while the server is running...
>>
>> The reason for using tmp was because we don't know which directory they
>> will use and we don't want them to have to edit configuration for the
>> simple "out of the box" getting started tutorial. I actually do think that
>> is important. Maybe an intermediate step we could do is just call out this
>> setting in the quickstart so people know where data is going and know they
>> need to configure it later...
>>
>> -Jay
>>
>> On Sun, Jan 25, 2015 at 9:32 AM, Joe Stein <joe.st...@stealth.ly> wrote:
>>
>>  This feels like another type of symptom from people using /tmp/ for their
>>> logs.  Perosnally, I would rather use /mnt/data or something and if that
>>> doesn't exist on their machine we can exception, or no default and force
>>> set it.
>>>
>>> /*******************************************
>>> Joe Stein
>>> Founder, Principal Consultant
>>> Big Data Open Source Security LLC
>>> http://www.stealth.ly
>>> Twitter: @allthingshadoop
>>> ********************************************/
>>> On Jan 25, 2015 11:37 AM, "Jay Kreps" <jay.kr...@gmail.com> wrote:
>>>
>>>  I think you are right, good catch. It could be that this user deleted
>>>> the
>>>> files manually, but I wonder if there isn't some way that is a Kafka
>>>> bug--e.g. if multiple types of retention policies kick in at the same
>>>>
>>> time
>>>
>>>> do we synchronize that properly?
>>>>
>>>> -Jay
>>>>
>>>> On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <jai.forums2...@gmail.com
>>>> >
>>>> wrote:
>>>>
>>>>  Hi Jay,
>>>>>
>>>>> I spent some more time over this today and went back to the original
>>>>> thread which brought up the issue with file leaks [1]. I think that
>>>>>
>>>> output
>>>>
>>>>> of lsof in that logs has a very important hint:
>>>>>
>>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
>>>>> topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) java
>>>>> 8446 root 725u REG 253,2 536910838 26087364
>>>>>
>>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
>>>>> topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) java
>>>>> 8446 root 726u REG 253,2 536917902 26087368
>>>>>
>>>>> Notice the "(deleted)" text in that output. The last time I looked at
>>>>>
>>>> that
>>>>
>>>>> output, I thought it was the user who had added that "deleted" text to
>>>>>
>>>> help
>>>>
>>>>> us understand that problem. But today I read up on the output format of
>>>>> lsof and it turns out that it's lsof which itself adds that hint
>>>>>
>>>> whenever a
>>>>
>>>>> file has already been deleted possibly by a different process but some
>>>>> other process is still holding on to open resources of that (deleted)
>>>>>
>>>> file
>>>>
>>>>> [2].
>>>>>
>>>>> So in the context of the issue that we are discussing and the way Kafka
>>>>> deals with async deletes (i.e. first by attempting a rename of the
>>>>> log/index files), I think this all makes sense now. So what I think is
>>>>> happening is, some (other?) process (not sure what/why) has already
>>>>>
>>>> deleted
>>>>
>>>>> the log file that Kafka is using for the LogSegment. The LogSegment
>>>>>
>>>> however
>>>>
>>>>> still has open FileChannel resource on that deleted file (and that's
>>>>>
>>>> why
>>>
>>>> the open file descriptor is held on and shows up in that output). Now
>>>>> Kafka, at some point in time, triggers an async delete of the
>>>>>
>>>> LogSegment,
>>>
>>>> which involves a file rename of that (already deleted) log file. The
>>>>>
>>>> rename
>>>>
>>>>> fails (because the original file path isn't there anymore). As a
>>>>>
>>>> result,
>>>
>>>> we
>>>>
>>>>> end up throwing that "failed to rename, KafkaStorageException" and thus
>>>>> leave behind the open FileChannel to continue being open forever (till
>>>>>
>>>> the
>>>>
>>>>> Kafka program exits).
>>>>>
>>>>> So I think we should:
>>>>>
>>>>> 1) Find what/why deletes that underlying log file(s). I'll add a reply
>>>>>
>>>> to
>>>
>>>> that original mail discussion asking the user if he can provide more
>>>>> details.
>>>>> 2) Handle this case and close the FileChannel. The patch that's been
>>>>> uploaded to review board https://reviews.apache.org/r/29755/ does
>>>>>
>>>> that.
>>>
>>>> The "immediate delete" on failure to rename, involves (safely) closing
>>>>>
>>>> the
>>>>
>>>>> open FileChannel and (safely) deleting the (possibly non-existent)
>>>>>
>>>> file.
>>>
>>>> By the way, this entire thing can be easily reproduced by running the
>>>>> following program which first creates a file and open a filechannel to
>>>>>
>>>> that
>>>>
>>>>> file and then waits for the user to delete that file externally (I used
>>>>>
>>>> the
>>>>
>>>>> rm command) and then go and tries to rename that deleted file, which
>>>>>
>>>> then
>>>
>>>> fails. In between each of these steps, you can run the lsof command
>>>>> externally to see the open file resources (I used 'lsof | grep
>>>>>
>>>> test.log'):
>>>>
>>>>>      public static void main(String[] args) throws Exception {
>>>>>          // Open a file and file channel for read/write
>>>>>          final File originalLogFile = new
>>>>>
>>>> File("/home/jaikiran/deleteme/test.log");
>>>>
>>>>> // change this path relevantly if you plan to run it
>>>>>          final FileChannel fileChannel = new
>>>>>
>>>> RandomAccessFile(originalLogFile,
>>>>
>>>>> "rw").getChannel();
>>>>>          System.out.println("Opened file channel to " +
>>>>>
>>>> originalLogFile);
>>>
>>>>          // wait for underlying file to be deleted externally
>>>>>          System.out.println("Waiting for the " + originalLogFile + " to
>>>>>
>>>> be
>>>
>>>> deleted externally. Press any key after the file is deleted");
>>>>>          System.in.read();
>>>>>          // wait for the user to check the lsof output
>>>>>          System.out.println(originalLogFile + " seems to have been
>>>>>
>>>> deleted
>>>
>>>> externally, check lsof command output to see open file resources.");
>>>>>          System.out.println("Press any key to try renaming this already
>>>>> deleted file, from the program");
>>>>>          System.in.read();
>>>>>          // try rename
>>>>>          final File fileToRenameTo = new File(originalLogFile.getPath()
>>>>>
>>>> +
>>>
>>>> ".deleted");
>>>>>          System.out.println("Trying to rename " + originalLogFile + "
>>>>>
>>>> to "
>>>
>>>> + fileToRenameTo);
>>>>>          final boolean renamedSucceeded = originalLogFile.renameTo(
>>>>> fileToRenameTo);
>>>>>          if (renamedSucceeded) {
>>>>>              System.out.println("Rename SUCCEEDED. Renamed file exists?
>>>>>
>>>> "
>>>
>>>> +
>>>>
>>>>> fileToRenameTo.exists());
>>>>>          } else {
>>>>>              System.out.println("FAILED to rename file " +
>>>>>
>>>> originalLogFile
>>>
>>>> + " to " + fileToRenameTo);
>>>>>          }
>>>>>          // wait for the user to check the lsof output, after our
>>>>> rename
>>>>> failed
>>>>>          System.out.println("Check the lsof output and press any key to
>>>>> close the open file channel to a deleted file");
>>>>>          System.in.read();
>>>>>          // close the file channel
>>>>>          fileChannel.close();
>>>>>          // let user check the lsof output one final time. This time he
>>>>> won't see open file resources from this program
>>>>>          System.out.println("File channel closed. Check the lsof output
>>>>>
>>>> and
>>>>
>>>>> press any key to terminate the program");
>>>>>          System.in.read();
>>>>>          // all done, exit
>>>>>          System.out.println("Program will terminate");
>>>>>      }
>>>>>
>>>>>
>>>>>
>>>>> [1] http://mail-archives.apache.org/mod_mbox/kafka-users/
>>>>> 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE%
>>>>> 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E
>>>>> [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8
>>>>>
>>>>>
>>>>> -Jaikiran
>>>>>
>>>>> On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote:
>>>>>
>>>>>  Hey guys,
>>>>>>
>>>>>> Jaikiran posted a patch on KAFKA-1853 to improve the handling of
>>>>>>
>>>>> failures
>>>>
>>>>> during delete.
>>>>>> https://issues.apache.org/jira/browse/KAFKA-1853
>>>>>>
>>>>>> The core problem here is that we are doing File.rename() as part of
>>>>>>
>>>>> the
>>>
>>>> delete sequence which returns false if the rename failed. Or file
>>>>>>
>>>>> delete
>>>
>>>> sequence is something like the following:
>>>>>> 1. Remove the file from the index so no new reads can begin on it
>>>>>> 2. Rename the file to xyz.deleted so that if we crash it will get
>>>>>>
>>>>> cleaned
>>>>
>>>>> up
>>>>>> 3. Schedule a task to delete the file in 30 seconds or so when any
>>>>>> in-progress reads have likely completed. The goal here is to avoid
>>>>>>
>>>>> errors
>>>>
>>>>> on in progress reads but also avoid locking on all reads.
>>>>>>
>>>>>> The question is what to do when rename fails? Previously if this
>>>>>>
>>>>> happened
>>>>
>>>>> we actually didn't pay attention and would fail to delete the file
>>>>>> entirely. This patch changes it so that if the rename fails we log an
>>>>>>
>>>>> error
>>>>
>>>>> and force an immediate delete.
>>>>>>
>>>>>> I think this is the right thing to do, but I guess the real question
>>>>>>
>>>>> is
>>>
>>>> why would rename fail? Some possibilities:
>>>>>> http://stackoverflow.com/questions/2372374/why-would-a-
>>>>>> file-rename-fail-in-java
>>>>>>
>>>>>> An alternative would be to treat this as a filesystem error and
>>>>>>
>>>>> shutdown
>>>
>>>> as we do elsewhere.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> -Jay
>>>>>>
>>>>>>
>>>>>>
>

Reply via email to