And @ my aside: I hadn't realized that the idleTimeout is canceled by the rollInterval occurring. That's annoying. So setting a lower idleTimeout, and drastically decreasing maxOpenFiles to at most 2 * possible open files, is probably necessary.
On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <[email protected]>wrote: > @Mohit: > > For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir > property. The default location is /tmp/hadoop-${user.name} To change this > you can add -Dhadoop.tmp.dir=<path> to your Flume command line call, or you > can specify the property in the core-site.xml of wherever your HADOOP_HOME > environment variable points to. > > - Connor > > > On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <[email protected]>wrote: > >> Whether idleTimeout is lower or higher than rollInterval is a preference; >> set it before, and assume you get one message right on the turn of the >> hour, then you will have some part of that hour without any bucket writers; >> but if you get another message at the end of the hour, you will end up with >> two files instead of one. Set it idleTimeout to be longer and you will get >> just one file, but also (at worst case) you will have twice as many >> bucketwriters open; so it all depends on how many files you want/how much >> memory you have to spare. >> >> - Connor >> >> An aside: >> bucketwriters, after being closed by rollInterval, aren't really a memory >> leak; they just are very rarely useful to keep around (your path could rely >> on hostname, and you could use a rollinterval, and then those bucketwriters >> will still remain useful). And they will get removed eventually; by default >> after you've created your 5001st bucketwriter, the first (or whichever was >> used longest ago) will be removed. >> >> And I don't think that's the cause behind 1850 as he did have an >> idleTimeout set at 15 minutes. >> >> >> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly < >> [email protected]> wrote: >> >>> It's also useful if you want files to get promptly closed and renamed >>> from the .tmp or whatever. >>> >>> We use it with something like 30seconds setting(we have a constant >>> stream of data) and hourly bucketing. >>> >>> There is also the issue that files closed by rollInterval are never >>> removed from the internal linkedList so it actually causes a small memory >>> leak(which can get big in the long term if you have a lot of files and >>> hourly renames). I believe this is what is causing the OOM Mohit is getting >>> in FLUME-1850 >>> >>> So I personally would recommend using it(with a setting that will close >>> files before rollInterval does). >>> >>> >>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote: >>> >>>> Ah I see. Again something useful to have in the flume user guide. >>>> >>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[email protected]> >>>> wrote: >>>> >>>>> the rollInterval will still cause the last 01-17 file to be closed >>>>> eventually. The way the HDFS sink works with the different files is >>>>> each >>>>> unique path is specified by a different BucketWriter object. The sink >>>>> can >>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: >>>>> 5000), >>>>> and bucketwriters are only removed when you create the 5001th writer >>>>> (5001th >>>>> unique path). However, generally once a writer is closed it is never >>>>> used >>>>> again (all of your 1-17 writers will never be used again). To avoid >>>>> keeping >>>>> them in the sink's internal list of writers, the idleTimeout is a >>>>> specified >>>>> number of seconds in which no data is received by the BucketWriter. >>>>> After >>>>> this time, the writer will try to close itself and will then tell the >>>>> sink >>>>> to remove it, thus freeing up everything used by the bucketwriter. >>>>> >>>>> So the idleTimeout is just a setting to help limit memory usage by the >>>>> hdfs >>>>> sink. The ideal time for it is longer than the maximum time between >>>>> events >>>>> (capped at the rollInterval) - if you know you'll receive a constant >>>>> stream >>>>> of events you might just set it to a minute or something. Or if you >>>>> are fine >>>>> with having multiple files open per hour, you can set it to a lower >>>>> number; >>>>> maybe just over the average time between events. For me in just >>>>> testing, I >>>>> set it >= rollInterval for the cases when no events are received in a >>>>> given >>>>> hour (I'd rather keep the object alive for an extra hour than create >>>>> files >>>>> every 30 minutes or something). >>>>> >>>>> Hope that was helpful, >>>>> >>>>> - Connor >>>>> >>>>> >>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar >>>>> <[email protected]> wrote: >>>>> >>>>>> Say If I have >>>>>> >>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/ >>>>>> >>>>>> hdfs.rollInterval=60 >>>>>> >>>>>> Now, if there is a file >>>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp >>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not >>>>>> up and now it's past 12 midnight, i.e. new day >>>>>> And events start to be written to >>>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp >>>>>> >>>>>> will the file 2013-01-17 never be rolled over, unless I have something >>>>>> like hdfs.idleTimeout=60 ? >>>>>> If so how do flume sinks keep track of files they need to rollover >>>>>> after idealTimeout ? >>>>>> >>>>>> In short what's the exact use of idealTimeout parameter ? >>>>>> >>>>> >>>>> >>> >> >
