Ah I see. Again something useful to have in the flume user guide.
On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[email protected]> wrote: > the rollInterval will still cause the last 01-17 file to be closed > eventually. The way the HDFS sink works with the different files is each > unique path is specified by a different BucketWriter object. The sink can > hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000), > and bucketwriters are only removed when you create the 5001th writer (5001th > unique path). However, generally once a writer is closed it is never used > again (all of your 1-17 writers will never be used again). To avoid keeping > them in the sink's internal list of writers, the idleTimeout is a specified > number of seconds in which no data is received by the BucketWriter. After > this time, the writer will try to close itself and will then tell the sink > to remove it, thus freeing up everything used by the bucketwriter. > > So the idleTimeout is just a setting to help limit memory usage by the hdfs > sink. The ideal time for it is longer than the maximum time between events > (capped at the rollInterval) - if you know you'll receive a constant stream > of events you might just set it to a minute or something. Or if you are fine > with having multiple files open per hour, you can set it to a lower number; > maybe just over the average time between events. For me in just testing, I > set it >= rollInterval for the cases when no events are received in a given > hour (I'd rather keep the object alive for an extra hour than create files > every 30 minutes or something). > > Hope that was helpful, > > - Connor > > > On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar > <[email protected]> wrote: >> >> Say If I have >> >> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/ >> >> hdfs.rollInterval=60 >> >> Now, if there is a file >> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp >> This file is not ready to be rolled over yet, i.e. 60 seconds are not >> up and now it's past 12 midnight, i.e. new day >> And events start to be written to >> /flume/events/2013-01-18/flume_XXXXXXXX.tmp >> >> will the file 2013-01-17 never be rolled over, unless I have something >> like hdfs.idleTimeout=60 ? >> If so how do flume sinks keep track of files they need to rollover >> after idealTimeout ? >> >> In short what's the exact use of idealTimeout parameter ? > >
