We probably can default the log dir to a relative path, sth like ../kafka-logs.
As for I/O errors on rename, I agree that we probably should just shut down the broker since it's not expected to happen. Thanks, Jun On Mon, Jan 26, 2015 at 5:54 AM, Jaikiran Pai <jai.forums2...@gmail.com> wrote: > Having looked at the logs the user posted, I don't think this specific > issue has to do with /tmp path. > > However, now that the /tmp path is being discussed, I think it's a good > idea that we default the Kafka logs to a certain folder. As Jay notes, it > makes it very easy to just download and start the servers without having to > fiddle with the configs when you are just starting out. Having said that, > when I started out with Kafka, I found /tmp to be a odd place to default > the path to. I expected them to be defaulted to a folder within the Kafka > install. Somewhere like KAFKA_INSTALL_FOLDER/data/kafka-logs/ folder. Is > that something we should do? > > -Jaikiran > > On Monday 26 January 2015 12:23 AM, Jay Kreps wrote: > >> Hmm, but I don't think tmp gets cleaned while the server is running... >> >> The reason for using tmp was because we don't know which directory they >> will use and we don't want them to have to edit configuration for the >> simple "out of the box" getting started tutorial. I actually do think that >> is important. Maybe an intermediate step we could do is just call out this >> setting in the quickstart so people know where data is going and know they >> need to configure it later... >> >> -Jay >> >> On Sun, Jan 25, 2015 at 9:32 AM, Joe Stein <joe.st...@stealth.ly> wrote: >> >> This feels like another type of symptom from people using /tmp/ for their >>> logs. Perosnally, I would rather use /mnt/data or something and if that >>> doesn't exist on their machine we can exception, or no default and force >>> set it. >>> >>> /******************************************* >>> Joe Stein >>> Founder, Principal Consultant >>> Big Data Open Source Security LLC >>> http://www.stealth.ly >>> Twitter: @allthingshadoop >>> ********************************************/ >>> On Jan 25, 2015 11:37 AM, "Jay Kreps" <jay.kr...@gmail.com> wrote: >>> >>> I think you are right, good catch. It could be that this user deleted >>>> the >>>> files manually, but I wonder if there isn't some way that is a Kafka >>>> bug--e.g. if multiple types of retention policies kick in at the same >>>> >>> time >>> >>>> do we synchronize that properly? >>>> >>>> -Jay >>>> >>>> On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai <jai.forums2...@gmail.com >>>> > >>>> wrote: >>>> >>>> Hi Jay, >>>>> >>>>> I spent some more time over this today and went back to the original >>>>> thread which brought up the issue with file leaks [1]. I think that >>>>> >>>> output >>>> >>>>> of lsof in that logs has a very important hint: >>>>> >>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_ >>>>> topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) java >>>>> 8446 root 725u REG 253,2 536910838 26087364 >>>>> >>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_ >>>>> topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) java >>>>> 8446 root 726u REG 253,2 536917902 26087368 >>>>> >>>>> Notice the "(deleted)" text in that output. The last time I looked at >>>>> >>>> that >>>> >>>>> output, I thought it was the user who had added that "deleted" text to >>>>> >>>> help >>>> >>>>> us understand that problem. But today I read up on the output format of >>>>> lsof and it turns out that it's lsof which itself adds that hint >>>>> >>>> whenever a >>>> >>>>> file has already been deleted possibly by a different process but some >>>>> other process is still holding on to open resources of that (deleted) >>>>> >>>> file >>>> >>>>> [2]. >>>>> >>>>> So in the context of the issue that we are discussing and the way Kafka >>>>> deals with async deletes (i.e. first by attempting a rename of the >>>>> log/index files), I think this all makes sense now. So what I think is >>>>> happening is, some (other?) process (not sure what/why) has already >>>>> >>>> deleted >>>> >>>>> the log file that Kafka is using for the LogSegment. The LogSegment >>>>> >>>> however >>>> >>>>> still has open FileChannel resource on that deleted file (and that's >>>>> >>>> why >>> >>>> the open file descriptor is held on and shows up in that output). Now >>>>> Kafka, at some point in time, triggers an async delete of the >>>>> >>>> LogSegment, >>> >>>> which involves a file rename of that (already deleted) log file. The >>>>> >>>> rename >>>> >>>>> fails (because the original file path isn't there anymore). As a >>>>> >>>> result, >>> >>>> we >>>> >>>>> end up throwing that "failed to rename, KafkaStorageException" and thus >>>>> leave behind the open FileChannel to continue being open forever (till >>>>> >>>> the >>>> >>>>> Kafka program exits). >>>>> >>>>> So I think we should: >>>>> >>>>> 1) Find what/why deletes that underlying log file(s). I'll add a reply >>>>> >>>> to >>> >>>> that original mail discussion asking the user if he can provide more >>>>> details. >>>>> 2) Handle this case and close the FileChannel. The patch that's been >>>>> uploaded to review board https://reviews.apache.org/r/29755/ does >>>>> >>>> that. >>> >>>> The "immediate delete" on failure to rename, involves (safely) closing >>>>> >>>> the >>>> >>>>> open FileChannel and (safely) deleting the (possibly non-existent) >>>>> >>>> file. >>> >>>> By the way, this entire thing can be easily reproduced by running the >>>>> following program which first creates a file and open a filechannel to >>>>> >>>> that >>>> >>>>> file and then waits for the user to delete that file externally (I used >>>>> >>>> the >>>> >>>>> rm command) and then go and tries to rename that deleted file, which >>>>> >>>> then >>> >>>> fails. In between each of these steps, you can run the lsof command >>>>> externally to see the open file resources (I used 'lsof | grep >>>>> >>>> test.log'): >>>> >>>>> public static void main(String[] args) throws Exception { >>>>> // Open a file and file channel for read/write >>>>> final File originalLogFile = new >>>>> >>>> File("/home/jaikiran/deleteme/test.log"); >>>> >>>>> // change this path relevantly if you plan to run it >>>>> final FileChannel fileChannel = new >>>>> >>>> RandomAccessFile(originalLogFile, >>>> >>>>> "rw").getChannel(); >>>>> System.out.println("Opened file channel to " + >>>>> >>>> originalLogFile); >>> >>>> // wait for underlying file to be deleted externally >>>>> System.out.println("Waiting for the " + originalLogFile + " to >>>>> >>>> be >>> >>>> deleted externally. Press any key after the file is deleted"); >>>>> System.in.read(); >>>>> // wait for the user to check the lsof output >>>>> System.out.println(originalLogFile + " seems to have been >>>>> >>>> deleted >>> >>>> externally, check lsof command output to see open file resources."); >>>>> System.out.println("Press any key to try renaming this already >>>>> deleted file, from the program"); >>>>> System.in.read(); >>>>> // try rename >>>>> final File fileToRenameTo = new File(originalLogFile.getPath() >>>>> >>>> + >>> >>>> ".deleted"); >>>>> System.out.println("Trying to rename " + originalLogFile + " >>>>> >>>> to " >>> >>>> + fileToRenameTo); >>>>> final boolean renamedSucceeded = originalLogFile.renameTo( >>>>> fileToRenameTo); >>>>> if (renamedSucceeded) { >>>>> System.out.println("Rename SUCCEEDED. Renamed file exists? >>>>> >>>> " >>> >>>> + >>>> >>>>> fileToRenameTo.exists()); >>>>> } else { >>>>> System.out.println("FAILED to rename file " + >>>>> >>>> originalLogFile >>> >>>> + " to " + fileToRenameTo); >>>>> } >>>>> // wait for the user to check the lsof output, after our >>>>> rename >>>>> failed >>>>> System.out.println("Check the lsof output and press any key to >>>>> close the open file channel to a deleted file"); >>>>> System.in.read(); >>>>> // close the file channel >>>>> fileChannel.close(); >>>>> // let user check the lsof output one final time. This time he >>>>> won't see open file resources from this program >>>>> System.out.println("File channel closed. Check the lsof output >>>>> >>>> and >>>> >>>>> press any key to terminate the program"); >>>>> System.in.read(); >>>>> // all done, exit >>>>> System.out.println("Program will terminate"); >>>>> } >>>>> >>>>> >>>>> >>>>> [1] http://mail-archives.apache.org/mod_mbox/kafka-users/ >>>>> 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE% >>>>> 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E >>>>> [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8 >>>>> >>>>> >>>>> -Jaikiran >>>>> >>>>> On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote: >>>>> >>>>> Hey guys, >>>>>> >>>>>> Jaikiran posted a patch on KAFKA-1853 to improve the handling of >>>>>> >>>>> failures >>>> >>>>> during delete. >>>>>> https://issues.apache.org/jira/browse/KAFKA-1853 >>>>>> >>>>>> The core problem here is that we are doing File.rename() as part of >>>>>> >>>>> the >>> >>>> delete sequence which returns false if the rename failed. Or file >>>>>> >>>>> delete >>> >>>> sequence is something like the following: >>>>>> 1. Remove the file from the index so no new reads can begin on it >>>>>> 2. Rename the file to xyz.deleted so that if we crash it will get >>>>>> >>>>> cleaned >>>> >>>>> up >>>>>> 3. Schedule a task to delete the file in 30 seconds or so when any >>>>>> in-progress reads have likely completed. The goal here is to avoid >>>>>> >>>>> errors >>>> >>>>> on in progress reads but also avoid locking on all reads. >>>>>> >>>>>> The question is what to do when rename fails? Previously if this >>>>>> >>>>> happened >>>> >>>>> we actually didn't pay attention and would fail to delete the file >>>>>> entirely. This patch changes it so that if the rename fails we log an >>>>>> >>>>> error >>>> >>>>> and force an immediate delete. >>>>>> >>>>>> I think this is the right thing to do, but I guess the real question >>>>>> >>>>> is >>> >>>> why would rename fail? Some possibilities: >>>>>> http://stackoverflow.com/questions/2372374/why-would-a- >>>>>> file-rename-fail-in-java >>>>>> >>>>>> An alternative would be to treat this as a filesystem error and >>>>>> >>>>> shutdown >>> >>>> as we do elsewhere. >>>>>> >>>>>> Thoughts? >>>>>> >>>>>> -Jay >>>>>> >>>>>> >>>>>> >