Re: hdfs.idleTimeout ,what's it used for ?

Connor Woodson Thu, 17 Jan 2013 18:24:07 -0800

And @ my aside: I hadn't realized that the idleTimeout is canceled by the
rollInterval occurring. That's annoying. So setting a lower idleTimeout,
and drastically decreasing maxOpenFiles to at most 2 * possible open files,
is probably necessary.



On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <[email protected]>wrote:

> @Mohit:
>
> For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir
> property. The default location is /tmp/hadoop-${user.name} To change this
> you can add -Dhadoop.tmp.dir=<path> to your Flume command line call, or you
> can specify the property in the core-site.xml of wherever your HADOOP_HOME
> environment variable points to.
>
> - Connor
>
>
> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <[email protected]>wrote:
>
>> Whether idleTimeout is lower or higher than rollInterval is a preference;
>> set it before, and assume you get one message right on the turn of the
>> hour, then you will have some part of that hour without any bucket writers;
>> but if you get another message at the end of the hour, you will end up with
>> two files instead of one. Set it idleTimeout to be longer and you will get
>> just one file, but also (at worst case) you will have twice as many
>> bucketwriters open; so it all depends on how many files you want/how much
>> memory you have to spare.
>>
>> - Connor
>>
>> An aside:
>> bucketwriters, after being closed by rollInterval, aren't really a memory
>> leak; they just are very rarely useful to keep around (your path could rely
>> on hostname, and you could use a rollinterval, and then those bucketwriters
>> will still remain useful). And they will get removed eventually; by default
>> after you've created your 5001st bucketwriter, the first (or whichever was
>> used longest ago) will be removed.
>>
>> And I don't think that's the cause behind 1850 as he did have an
>> idleTimeout set at 15 minutes.
>>
>>
>> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
>> [email protected]> wrote:
>>
>>> It's also useful if you want files to get promptly closed and renamed
>>> from the .tmp or whatever.
>>>
>>> We use it with something like 30seconds setting(we have a constant
>>> stream of data) and hourly bucketing.
>>>
>>> There is also the issue that files closed by rollInterval are never
>>> removed from the internal linkedList so it actually causes a small memory
>>> leak(which can get big in the long term if you have a lot of files and
>>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>>> in FLUME-1850
>>>
>>> So I personally would recommend using it(with a setting that will close
>>> files before rollInterval does).
>>>
>>>
>>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>>
>>>> Ah I see. Again something useful to have in the flume user guide.
>>>>
>>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[email protected]>
>>>> wrote:
>>>>
>>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>>> eventually. The way the HDFS sink works with the different files is
>>>>> each
>>>>> unique path is specified by a different BucketWriter object. The sink
>>>>> can
>>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>>> 5000),
>>>>> and bucketwriters are only removed when you create the 5001th writer
>>>>> (5001th
>>>>> unique path). However, generally once a writer is closed it is never
>>>>> used
>>>>> again (all of your 1-17 writers will never be used again). To avoid
>>>>> keeping
>>>>> them in the sink's internal list of writers, the idleTimeout is a
>>>>> specified
>>>>> number of seconds in which no data is received by the BucketWriter.
>>>>> After
>>>>> this time, the writer will try to close itself and will then tell the
>>>>> sink
>>>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>>>
>>>>> So the idleTimeout is just a setting to help limit memory usage by the
>>>>> hdfs
>>>>> sink. The ideal time for it is longer than the maximum time between
>>>>> events
>>>>> (capped at the rollInterval) - if you know you'll receive a constant
>>>>> stream
>>>>> of events you might just set it to a minute or something. Or if you
>>>>> are fine
>>>>> with having multiple files open per hour, you can set it to a lower
>>>>> number;
>>>>> maybe just over the average time between events. For me in just
>>>>> testing, I
>>>>> set it >= rollInterval for the cases when no events are received in a
>>>>> given
>>>>> hour (I'd rather keep the object alive for an extra hour than create
>>>>> files
>>>>> every 30 minutes or something).
>>>>>
>>>>> Hope that was helpful,
>>>>>
>>>>> - Connor
>>>>>
>>>>>
>>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Say If I have
>>>>>>
>>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>>
>>>>>> hdfs.rollInterval=60
>>>>>>
>>>>>> Now, if there is a file
>>>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp
>>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>>>> up and now it's past 12 midnight, i.e. new day
>>>>>> And events start to be written to
>>>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp
>>>>>>
>>>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>>>> like hdfs.idleTimeout=60  ?
>>>>>> If so how do flume sinks keep track of files they need to rollover
>>>>>> after idealTimeout ?
>>>>>>
>>>>>> In short what's the exact use of idealTimeout parameter ?
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: hdfs.idleTimeout ,what's it used for ?

Reply via email to