Flume log rolling when you need to do rollups for multiple time zones

Gary Malouf Tue, 29 Jul 2014 07:30:42 -0700

We are an ad tech company that buys and sells digital media.  To date, we
have been using Apache Flume 1.4.x to ingest all of our bid request,
response, impression and attribution data.


The logs currently 'roll' hourly for each data type, meaning that at some
point during each hour (if Flume is behaving) the tmp file in HDFS is
closed/renamed with a new one being opened.  This is done for each of 5
running Flume instances.

One problem that has been a challenge to date is effectively bounding our
data queries to make sure we capture all of the data for a given interval
without pulling in the world.  To date, our structure (all in UTC) for each
data type is:

/datatype/yr=2014/mo=06/d=15/{files}

The challenge for us is that Flume is not perfect.

1) It can and will often write data that came in on the new UTC day into
the previous one if that log file has not rolled yet.

2) Since it does not roll perfectly at the top of each hour, we are having
trouble determining the best way to tightly bound a query for data that is
within a few [3-6] hour window properly.

3) When we are doing data rollups in timezones other than UTC, we end up
reading in all of the data for both UTC containing that data to be on the
safe-side.  It would be nice to bound this as described in (2).


One of the major problems affecting the first two cases is that Flume
sometimes gets 'stuck' - that is, the data will hang out in the file
channel for longer than we anticipate.

Anyway, I was just wondering how others have approached these problems to
date.  If not for the edge cases when data can get stuck in Flume, I think
this would be straightforward.

Flume log rolling when you need to do rollups for multiple time zones

Reply via email to