I have created a JIRA for this proposed change https://issues.apache.org/jira/browse/KAFKA-1906 and uploaded a patch for review https://reviews.apache.org/r/30403/

-Jaikiran

On Tuesday 27 January 2015 02:36 AM, Jay Kreps wrote:
Having a relative path and keeping data under /data in the kafka distro
would make sense. This would require some reworking of the shell scripts,
though, as I think right now you an actually run Kafka from any directory
and the cwd of the process will be whatever directory you start from. If we
have a relative path in the config then the working directory will HAVE to
be the kafka directory. This works for the simple download case but may
making some packaging stuff harder for other use cases.

-Jay

On Mon, Jan 26, 2015 at 5:54 AM, Jaikiran Pai <jai.forums2...@gmail.com>
wrote:

Having looked at the logs the user posted, I don't think this specific
issue has to do with /tmp path.

However, now that the /tmp path is being discussed, I think it's a good
idea that we default the Kafka logs to a certain folder. As Jay notes, it
makes it very easy to just download and start the servers without having to
fiddle with the configs when you are just starting out. Having said that,
when I started out with Kafka, I found /tmp to be a odd place to default
the path to. I expected them to be defaulted to a folder within the Kafka
install. Somewhere like KAFKA_INSTALL_FOLDER/data/kafka-logs/ folder. Is
that something we should do?

-Jaikiran

On Monday 26 January 2015 12:23 AM, Jay Kreps wrote:

Hmm, but I don't think tmp gets cleaned while the server is running...

The reason for using tmp was because we don't know which directory they
will use and we don't want them to have to edit configuration for the
simple "out of the box" getting started tutorial. I actually do think that
is important. Maybe an intermediate step we could do is just call out this
setting in the quickstart so people know where data is going and know they
need to configure it later...

-Jay

On Sun, Jan 25, 2015 at 9:32 AM, Joe Stein <joe.st...@stealth.ly> wrote:

  This feels like another type of symptom from people using /tmp/ for their
logs.  Perosnally, I would rather use /mnt/data or something and if that
doesn't exist on their machine we can exception, or no default and force
set it.

/*******************************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
********************************************/
On Jan 25, 2015 11:37 AM, "Jay Kreps" <jay.kr...@gmail.com> wrote:

  I think you are right, good catch. It could be that this user deleted
the
files manually, but I wonder if there isn't some way that is a Kafka
bug--e.g. if multiple types of retention policies kick in at the same

time

do we synchronize that properly?

-Jay

On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <jai.forums2...@gmail.com
wrote:

  Hi Jay,
I spent some more time over this today and went back to the original
thread which brought up the issue with file leaks [1]. I think that

output

of lsof in that logs has a very important hint:

/home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) java
8446 root 725u REG 253,2 536910838 26087364

/home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_
topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) java
8446 root 726u REG 253,2 536917902 26087368

Notice the "(deleted)" text in that output. The last time I looked at

that

output, I thought it was the user who had added that "deleted" text to

help

us understand that problem. But today I read up on the output format of
lsof and it turns out that it's lsof which itself adds that hint

whenever a

file has already been deleted possibly by a different process but some
other process is still holding on to open resources of that (deleted)

file

[2].

So in the context of the issue that we are discussing and the way Kafka
deals with async deletes (i.e. first by attempting a rename of the
log/index files), I think this all makes sense now. So what I think is
happening is, some (other?) process (not sure what/why) has already

deleted

the log file that Kafka is using for the LogSegment. The LogSegment

however

still has open FileChannel resource on that deleted file (and that's

why
the open file descriptor is held on and shows up in that output). Now
Kafka, at some point in time, triggers an async delete of the

LogSegment,
which involves a file rename of that (already deleted) log file. The
rename

fails (because the original file path isn't there anymore). As a

result,
we

end up throwing that "failed to rename, KafkaStorageException" and thus
leave behind the open FileChannel to continue being open forever (till

the

Kafka program exits).

So I think we should:

1) Find what/why deletes that underlying log file(s). I'll add a reply

to
that original mail discussion asking the user if he can provide more
details.
2) Handle this case and close the FileChannel. The patch that's been
uploaded to review board https://reviews.apache.org/r/29755/ does

that.
The "immediate delete" on failure to rename, involves (safely) closing
the

open FileChannel and (safely) deleting the (possibly non-existent)

file.
By the way, this entire thing can be easily reproduced by running the
following program which first creates a file and open a filechannel to

that

file and then waits for the user to delete that file externally (I used

the

rm command) and then go and tries to rename that deleted file, which

then
fails. In between each of these steps, you can run the lsof command
externally to see the open file resources (I used 'lsof | grep

test.log'):

      public static void main(String[] args) throws Exception {
          // Open a file and file channel for read/write
          final File originalLogFile = new

File("/home/jaikiran/deleteme/test.log");

// change this path relevantly if you plan to run it
          final FileChannel fileChannel = new

RandomAccessFile(originalLogFile,

"rw").getChannel();
          System.out.println("Opened file channel to " +

originalLogFile);
          // wait for underlying file to be deleted externally
          System.out.println("Waiting for the " + originalLogFile + " to

be
deleted externally. Press any key after the file is deleted");
          System.in.read();
          // wait for the user to check the lsof output
          System.out.println(originalLogFile + " seems to have been

deleted
externally, check lsof command output to see open file resources.");
          System.out.println("Press any key to try renaming this already
deleted file, from the program");
          System.in.read();
          // try rename
          final File fileToRenameTo = new File(originalLogFile.getPath()

+
".deleted");
          System.out.println("Trying to rename " + originalLogFile + "

to "
+ fileToRenameTo);
          final boolean renamedSucceeded = originalLogFile.renameTo(
fileToRenameTo);
          if (renamedSucceeded) {
              System.out.println("Rename SUCCEEDED. Renamed file exists?

"
+

fileToRenameTo.exists());
          } else {
              System.out.println("FAILED to rename file " +

originalLogFile
+ " to " + fileToRenameTo);
          }
          // wait for the user to check the lsof output, after our
rename
failed
          System.out.println("Check the lsof output and press any key to
close the open file channel to a deleted file");
          System.in.read();
          // close the file channel
          fileChannel.close();
          // let user check the lsof output one final time. This time he
won't see open file resources from this program
          System.out.println("File channel closed. Check the lsof output

and

press any key to terminate the program");
          System.in.read();
          // all done, exit
          System.out.println("Program will terminate");
      }



[1] http://mail-archives.apache.org/mod_mbox/kafka-users/
201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE%
2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E
[2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8


-Jaikiran

On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote:

  Hey guys,
Jaikiran posted a patch on KAFKA-1853 to improve the handling of

failures
during delete.
https://issues.apache.org/jira/browse/KAFKA-1853

The core problem here is that we are doing File.rename() as part of

the
delete sequence which returns false if the rename failed. Or file
delete
sequence is something like the following:
1. Remove the file from the index so no new reads can begin on it
2. Rename the file to xyz.deleted so that if we crash it will get

cleaned
up
3. Schedule a task to delete the file in 30 seconds or so when any
in-progress reads have likely completed. The goal here is to avoid

errors
on in progress reads but also avoid locking on all reads.
The question is what to do when rename fails? Previously if this

happened
we actually didn't pay attention and would fail to delete the file
entirely. This patch changes it so that if the rename fails we log an

error
and force an immediate delete.
I think this is the right thing to do, but I guess the real question

is
why would rename fail? Some possibilities:
http://stackoverflow.com/questions/2372374/why-would-a-
file-rename-fail-in-java

An alternative would be to treat this as a filesystem error and

shutdown
as we do elsewhere.
Thoughts?

-Jay




Reply via email to