Agree with Sriram / Jun, I think the error should be treated as fatal and we should shutdown the broker gracefully.
On Mon, Jan 26, 2015 at 8:41 AM, Jun Rao <j...@confluent.io> wrote: > We probably can default the log dir to a relative path, sth like > ../kafka-logs. > > As for I/O errors on rename, I agree that we probably should just shut down > the broker since it's not expected to happen. > > Thanks, > > Jun > > On Mon, Jan 26, 2015 at 5:54 AM, Jaikiran Pai <jai.forums2...@gmail.com> > wrote: > > > Having looked at the logs the user posted, I don't think this specific > > issue has to do with /tmp path. > > > > However, now that the /tmp path is being discussed, I think it's a good > > idea that we default the Kafka logs to a certain folder. As Jay notes, it > > makes it very easy to just download and start the servers without having > to > > fiddle with the configs when you are just starting out. Having said that, > > when I started out with Kafka, I found /tmp to be a odd place to default > > the path to. I expected them to be defaulted to a folder within the Kafka > > install. Somewhere like KAFKA_INSTALL_FOLDER/data/kafka-logs/ folder. Is > > that something we should do? > > > > -Jaikiran > > > > On Monday 26 January 2015 12:23 AM, Jay Kreps wrote: > > > >> Hmm, but I don't think tmp gets cleaned while the server is running... > >> > >> The reason for using tmp was because we don't know which directory they > >> will use and we don't want them to have to edit configuration for the > >> simple "out of the box" getting started tutorial. I actually do think > that > >> is important. Maybe an intermediate step we could do is just call out > this > >> setting in the quickstart so people know where data is going and know > they > >> need to configure it later... > >> > >> -Jay > >> > >> On Sun, Jan 25, 2015 at 9:32 AM, Joe Stein <joe.st...@stealth.ly> > wrote: > >> > >> This feels like another type of symptom from people using /tmp/ for > their > >>> logs. Perosnally, I would rather use /mnt/data or something and if > that > >>> doesn't exist on their machine we can exception, or no default and > force > >>> set it. > >>> > >>> /******************************************* > >>> Joe Stein > >>> Founder, Principal Consultant > >>> Big Data Open Source Security LLC > >>> http://www.stealth.ly > >>> Twitter: @allthingshadoop > >>> ********************************************/ > >>> On Jan 25, 2015 11:37 AM, "Jay Kreps" <jay.kr...@gmail.com> wrote: > >>> > >>> I think you are right, good catch. It could be that this user deleted > >>>> the > >>>> files manually, but I wonder if there isn't some way that is a Kafka > >>>> bug--e.g. if multiple types of retention policies kick in at the same > >>>> > >>> time > >>> > >>>> do we synchronize that properly? > >>>> > >>>> -Jay > >>>> > >>>> On Sat, Jan 24, 2015 at 9:26 PM, Jaikiran Pai < > jai.forums2...@gmail.com > >>>> > > >>>> wrote: > >>>> > >>>> Hi Jay, > >>>>> > >>>>> I spent some more time over this today and went back to the original > >>>>> thread which brought up the issue with file leaks [1]. I think that > >>>>> > >>>> output > >>>> > >>>>> of lsof in that logs has a very important hint: > >>>>> > >>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_ > >>>>> topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) > java > >>>>> 8446 root 725u REG 253,2 536910838 26087364 > >>>>> > >>>>> /home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_ > >>>>> topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) > java > >>>>> 8446 root 726u REG 253,2 536917902 26087368 > >>>>> > >>>>> Notice the "(deleted)" text in that output. The last time I looked at > >>>>> > >>>> that > >>>> > >>>>> output, I thought it was the user who had added that "deleted" text > to > >>>>> > >>>> help > >>>> > >>>>> us understand that problem. But today I read up on the output format > of > >>>>> lsof and it turns out that it's lsof which itself adds that hint > >>>>> > >>>> whenever a > >>>> > >>>>> file has already been deleted possibly by a different process but > some > >>>>> other process is still holding on to open resources of that (deleted) > >>>>> > >>>> file > >>>> > >>>>> [2]. > >>>>> > >>>>> So in the context of the issue that we are discussing and the way > Kafka > >>>>> deals with async deletes (i.e. first by attempting a rename of the > >>>>> log/index files), I think this all makes sense now. So what I think > is > >>>>> happening is, some (other?) process (not sure what/why) has already > >>>>> > >>>> deleted > >>>> > >>>>> the log file that Kafka is using for the LogSegment. The LogSegment > >>>>> > >>>> however > >>>> > >>>>> still has open FileChannel resource on that deleted file (and that's > >>>>> > >>>> why > >>> > >>>> the open file descriptor is held on and shows up in that output). Now > >>>>> Kafka, at some point in time, triggers an async delete of the > >>>>> > >>>> LogSegment, > >>> > >>>> which involves a file rename of that (already deleted) log file. The > >>>>> > >>>> rename > >>>> > >>>>> fails (because the original file path isn't there anymore). As a > >>>>> > >>>> result, > >>> > >>>> we > >>>> > >>>>> end up throwing that "failed to rename, KafkaStorageException" and > thus > >>>>> leave behind the open FileChannel to continue being open forever > (till > >>>>> > >>>> the > >>>> > >>>>> Kafka program exits). > >>>>> > >>>>> So I think we should: > >>>>> > >>>>> 1) Find what/why deletes that underlying log file(s). I'll add a > reply > >>>>> > >>>> to > >>> > >>>> that original mail discussion asking the user if he can provide more > >>>>> details. > >>>>> 2) Handle this case and close the FileChannel. The patch that's been > >>>>> uploaded to review board https://reviews.apache.org/r/29755/ does > >>>>> > >>>> that. > >>> > >>>> The "immediate delete" on failure to rename, involves (safely) closing > >>>>> > >>>> the > >>>> > >>>>> open FileChannel and (safely) deleting the (possibly non-existent) > >>>>> > >>>> file. > >>> > >>>> By the way, this entire thing can be easily reproduced by running the > >>>>> following program which first creates a file and open a filechannel > to > >>>>> > >>>> that > >>>> > >>>>> file and then waits for the user to delete that file externally (I > used > >>>>> > >>>> the > >>>> > >>>>> rm command) and then go and tries to rename that deleted file, which > >>>>> > >>>> then > >>> > >>>> fails. In between each of these steps, you can run the lsof command > >>>>> externally to see the open file resources (I used 'lsof | grep > >>>>> > >>>> test.log'): > >>>> > >>>>> public static void main(String[] args) throws Exception { > >>>>> // Open a file and file channel for read/write > >>>>> final File originalLogFile = new > >>>>> > >>>> File("/home/jaikiran/deleteme/test.log"); > >>>> > >>>>> // change this path relevantly if you plan to run it > >>>>> final FileChannel fileChannel = new > >>>>> > >>>> RandomAccessFile(originalLogFile, > >>>> > >>>>> "rw").getChannel(); > >>>>> System.out.println("Opened file channel to " + > >>>>> > >>>> originalLogFile); > >>> > >>>> // wait for underlying file to be deleted externally > >>>>> System.out.println("Waiting for the " + originalLogFile + " > to > >>>>> > >>>> be > >>> > >>>> deleted externally. Press any key after the file is deleted"); > >>>>> System.in.read(); > >>>>> // wait for the user to check the lsof output > >>>>> System.out.println(originalLogFile + " seems to have been > >>>>> > >>>> deleted > >>> > >>>> externally, check lsof command output to see open file resources."); > >>>>> System.out.println("Press any key to try renaming this > already > >>>>> deleted file, from the program"); > >>>>> System.in.read(); > >>>>> // try rename > >>>>> final File fileToRenameTo = new > File(originalLogFile.getPath() > >>>>> > >>>> + > >>> > >>>> ".deleted"); > >>>>> System.out.println("Trying to rename " + originalLogFile + " > >>>>> > >>>> to " > >>> > >>>> + fileToRenameTo); > >>>>> final boolean renamedSucceeded = originalLogFile.renameTo( > >>>>> fileToRenameTo); > >>>>> if (renamedSucceeded) { > >>>>> System.out.println("Rename SUCCEEDED. Renamed file > exists? > >>>>> > >>>> " > >>> > >>>> + > >>>> > >>>>> fileToRenameTo.exists()); > >>>>> } else { > >>>>> System.out.println("FAILED to rename file " + > >>>>> > >>>> originalLogFile > >>> > >>>> + " to " + fileToRenameTo); > >>>>> } > >>>>> // wait for the user to check the lsof output, after our > >>>>> rename > >>>>> failed > >>>>> System.out.println("Check the lsof output and press any key > to > >>>>> close the open file channel to a deleted file"); > >>>>> System.in.read(); > >>>>> // close the file channel > >>>>> fileChannel.close(); > >>>>> // let user check the lsof output one final time. This time > he > >>>>> won't see open file resources from this program > >>>>> System.out.println("File channel closed. Check the lsof > output > >>>>> > >>>> and > >>>> > >>>>> press any key to terminate the program"); > >>>>> System.in.read(); > >>>>> // all done, exit > >>>>> System.out.println("Program will terminate"); > >>>>> } > >>>>> > >>>>> > >>>>> > >>>>> [1] http://mail-archives.apache.org/mod_mbox/kafka-users/ > >>>>> 201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE% > >>>>> 2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E > >>>>> [2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8 > >>>>> > >>>>> > >>>>> -Jaikiran > >>>>> > >>>>> On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote: > >>>>> > >>>>> Hey guys, > >>>>>> > >>>>>> Jaikiran posted a patch on KAFKA-1853 to improve the handling of > >>>>>> > >>>>> failures > >>>> > >>>>> during delete. > >>>>>> https://issues.apache.org/jira/browse/KAFKA-1853 > >>>>>> > >>>>>> The core problem here is that we are doing File.rename() as part of > >>>>>> > >>>>> the > >>> > >>>> delete sequence which returns false if the rename failed. Or file > >>>>>> > >>>>> delete > >>> > >>>> sequence is something like the following: > >>>>>> 1. Remove the file from the index so no new reads can begin on it > >>>>>> 2. Rename the file to xyz.deleted so that if we crash it will get > >>>>>> > >>>>> cleaned > >>>> > >>>>> up > >>>>>> 3. Schedule a task to delete the file in 30 seconds or so when any > >>>>>> in-progress reads have likely completed. The goal here is to avoid > >>>>>> > >>>>> errors > >>>> > >>>>> on in progress reads but also avoid locking on all reads. > >>>>>> > >>>>>> The question is what to do when rename fails? Previously if this > >>>>>> > >>>>> happened > >>>> > >>>>> we actually didn't pay attention and would fail to delete the file > >>>>>> entirely. This patch changes it so that if the rename fails we log > an > >>>>>> > >>>>> error > >>>> > >>>>> and force an immediate delete. > >>>>>> > >>>>>> I think this is the right thing to do, but I guess the real question > >>>>>> > >>>>> is > >>> > >>>> why would rename fail? Some possibilities: > >>>>>> http://stackoverflow.com/questions/2372374/why-would-a- > >>>>>> file-rename-fail-in-java > >>>>>> > >>>>>> An alternative would be to treat this as a filesystem error and > >>>>>> > >>>>> shutdown > >>> > >>>> as we do elsewhere. > >>>>>> > >>>>>> Thoughts? > >>>>>> > >>>>>> -Jay > >>>>>> > >>>>>> > >>>>>> > > > -- -- Guozhang