I have been using it and it's great feature to have. One question I have though is, what happens when flume dies unexpectedly, does it leave .tmp files behind? How to clean those away and close it gracefully?
On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly < [email protected]> wrote: > It's also useful if you want files to get promptly closed and renamed from > the .tmp or whatever. > > We use it with something like 30seconds setting(we have a constant stream > of data) and hourly bucketing. > > There is also the issue that files closed by rollInterval are never > removed from the internal linkedList so it actually causes a small memory > leak(which can get big in the long term if you have a lot of files and > hourly renames). I believe this is what is causing the OOM Mohit is getting > in FLUME-1850 > > So I personally would recommend using it(with a setting that will close > files before rollInterval does). > > On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote: > >> Ah I see. Again something useful to have in the flume user guide. >> >> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[email protected]> >> wrote: >> >>> the rollInterval will still cause the last 01-17 file to be closed >>> eventually. The way the HDFS sink works with the different files is each >>> unique path is specified by a different BucketWriter object. The sink can >>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000), >>> and bucketwriters are only removed when you create the 5001th writer >>> (5001th >>> unique path). However, generally once a writer is closed it is never used >>> again (all of your 1-17 writers will never be used again). To avoid >>> keeping >>> them in the sink's internal list of writers, the idleTimeout is a >>> specified >>> number of seconds in which no data is received by the BucketWriter. After >>> this time, the writer will try to close itself and will then tell the >>> sink >>> to remove it, thus freeing up everything used by the bucketwriter. >>> >>> So the idleTimeout is just a setting to help limit memory usage by the >>> hdfs >>> sink. The ideal time for it is longer than the maximum time between >>> events >>> (capped at the rollInterval) - if you know you'll receive a constant >>> stream >>> of events you might just set it to a minute or something. Or if you are >>> fine >>> with having multiple files open per hour, you can set it to a lower >>> number; >>> maybe just over the average time between events. For me in just testing, >>> I >>> set it >= rollInterval for the cases when no events are received in a >>> given >>> hour (I'd rather keep the object alive for an extra hour than create >>> files >>> every 30 minutes or something). >>> >>> Hope that was helpful, >>> >>> - Connor >>> >>> >>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar >>> <[email protected]> wrote: >>> >>>> Say If I have >>>> >>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/ >>>> >>>> hdfs.rollInterval=60 >>>> >>>> Now, if there is a file >>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp >>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not >>>> up and now it's past 12 midnight, i.e. new day >>>> And events start to be written to >>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp >>>> >>>> will the file 2013-01-17 never be rolled over, unless I have something >>>> like hdfs.idleTimeout=60 ? >>>> If so how do flume sinks keep track of files they need to rollover >>>> after idealTimeout ? >>>> >>>> In short what's the exact use of idealTimeout parameter ? >>>> >>> >>> >
