That breaks the use case idleTimeout was originally made for: making sure the file is closed promptly after data stops arriving. We use this to make sure the files ready for our batches which run quite soon after. The time that rollInterval will trigger is unpredictable as it will reset every time any other type of roll is triggered(event count or size).

By making rollInterval behave properly all of this is a non-issue. My recommendation to users woudl be not to use rollInterval if they're bucketing by time(it's redundant behavior).

Documentation could definitely be improved. Once we sort out the approach we want to take I can write it up to make the difference and usage clearer.

On 01/18/2013 12:24 PM, Connor Woodson wrote:
The way idleTimeout works right now is that it's another rollInterval; it will work best when rollInterval is not set and so it seems that it's use is best for when you don't want to use a rollInterval and just want to have your bucketwriters close when no events are coming through (caused by path change or something else; and you can still roll reliably with either count or size)

As such, perhaps it is more clear if idleTimeout is renamed to idleRoll or such?

And then change idleTimeout to only count seconds since it was closed; if a bucketwriter is closed for long enough it will automatically remove itself. This type of idle will then work well with rollInterval, while the other one doesn't (idleRoll + rollInterval creates two time-based rollers. There are certainly times for that, but not all of the time).

- Connor


On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly <[email protected] <mailto:[email protected]>> wrote:

    It seemed neater at the time. It's only an issue because
    rollInterval doesn't remove the entry in sfWriters. We could
    change it so that close doesn't cancel it, and have it check
    whether or not the writer is already closed, but that'd be kind of
    ugly.

    @Mohit:

    When flume dies unexpectedly the .tmp file remains. When it
    restarts there is some logic in HDFS sink to recover it(and
    continue writing from there). I'm not actually sure of the
    specifics. You may want to try and just kill -9 a running flume
    process on a test machine and then start it up, look at the logs
    and see what happens with the output.

    If flume dies cleanly the file is properly closed.


    On 01/18/2013 11:23 AM, Connor Woodson wrote:
    And @ my aside: I hadn't realized that the idleTimeout is
    canceled by the rollInterval occurring. That's annoying. So
    setting a lower idleTimeout, and drastically decreasing
    maxOpenFiles to at most 2 * possible open files, is probably
    necessary.


    On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson
    <[email protected] <mailto:[email protected]>> wrote:

        @Mohit:

        For the HDFS Sink, the tmp files are placed based on the
        hadoop.tmp.dir property. The default location is
        /tmp/hadoop-${user.name <http://user.name>} To change this
        you can add -Dhadoop.tmp.dir=<path> to your Flume command
        line call, or you can specify the property in the
        core-site.xml of wherever your HADOOP_HOME environment
        variable points to.

        - Connor


        On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson
        <[email protected] <mailto:[email protected]>> wrote:

            Whether idleTimeout is lower or higher than rollInterval
            is a preference; set it before, and assume you get one
            message right on the turn of the hour, then you will have
            some part of that hour without any bucket writers; but if
            you get another message at the end of the hour, you will
            end up with two files instead of one. Set it idleTimeout
            to be longer and you will get just one file, but also (at
            worst case) you will have twice as many bucketwriters
            open; so it all depends on how many files you want/how
            much memory you have to spare.

            - Connor

            An aside:
            bucketwriters, after being closed by rollInterval, aren't
            really a memory leak; they just are very rarely useful to
            keep around (your path could rely on hostname, and you
            could use a rollinterval, and then those bucketwriters
            will still remain useful). And they will get removed
            eventually; by default after you've created your 5001st
            bucketwriter, the first (or whichever was used longest
            ago) will be removed.

            And I don't think that's the cause behind 1850 as he did
            have an idleTimeout set at 15 minutes.


            On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly
            <[email protected]
            <mailto:[email protected]>> wrote:

                It's also useful if you want files to get promptly
                closed and renamed from the .tmp or whatever.

                We use it with something like 30seconds setting(we
                have a constant stream of data) and hourly bucketing.

                There is also the issue that files closed by
                rollInterval are never removed from the internal
                linkedList so it actually causes a small memory
                leak(which can get big in the long term if you have a
                lot of files and hourly renames). I believe this is
                what is causing the OOM Mohit is getting in FLUME-1850

                So I personally would recommend using it(with a
                setting that will close files before rollInterval does).


                On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:

                    Ah I see. Again something useful to have in the
                    flume user guide.

                    On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson
                    <[email protected]
                    <mailto:[email protected]>> wrote:

                        the rollInterval will still cause the last
                        01-17 file to be closed
                        eventually. The way the HDFS sink works with
                        the different files is each
                        unique path is specified by a different
                        BucketWriter object. The sink can
                        hold as many objects as specified by
                        hdfs.maxOpenWorkers (default: 5000),
                        and bucketwriters are only removed when you
                        create the 5001th writer (5001th
                        unique path). However, generally once a
                        writer is closed it is never used
                        again (all of your 1-17 writers will never be
                        used again). To avoid keeping
                        them in the sink's internal list of writers,
                        the idleTimeout is a specified
                        number of seconds in which no data is
                        received by the BucketWriter. After
                        this time, the writer will try to close
                        itself and will then tell the sink
                        to remove it, thus freeing up everything used
                        by the bucketwriter.

                        So the idleTimeout is just a setting to help
                        limit memory usage by the hdfs
                        sink. The ideal time for it is longer than
                        the maximum time between events
                        (capped at the rollInterval) - if you know
                        you'll receive a constant stream
                        of events you might just set it to a minute
                        or something. Or if you are fine
                        with having multiple files open per hour, you
                        can set it to a lower number;
                        maybe just over the average time between
                        events. For me in just testing, I
                        set it >= rollInterval for the cases when no
                        events are received in a given
                        hour (I'd rather keep the object alive for an
                        extra hour than create files
                        every 30 minutes or something).

                        Hope that was helpful,

                        - Connor


                        On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V.
                        Karambelkar
                        <[email protected]
                        <mailto:[email protected]>> wrote:

                            Say If I have

                            a1.sinks.k1.hdfs.path =
                            /flume/events/%y-%m-%d/

                            hdfs.rollInterval=60

                            Now, if there is a file
                            /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
                            This file is not ready to be rolled over
                            yet, i.e. 60 seconds are not
                            up and now it's past 12 midnight, i.e.
                            new day
                            And events start to be written to
                            /flume/events/2013-01-18/flume_XXXXXXXX.tmp

                            will the file 2013-01-17 never be rolled
                            over, unless I have something
                            like hdfs.idleTimeout=60  ?
                            If so how do flume sinks keep track of
                            files they need to rollover
                            after idealTimeout ?

                            In short what's the exact use of
                            idealTimeout parameter ?









Reply via email to