Can you send your config? There are a couple of params that allow the files to be rolled faster - idleTimeout and rollInterval. I am assuming you are using rollInterval already. idleTimeout will close a file when it is not written to for the configured time. That might help with the rolling. Remember though that if events arrive "late" for a bucket due to failures or network issues, new files will be opened in that bucket if none are currently open.
On Tue, Jul 29, 2014 at 7:29 AM, Gary Malouf <malouf.g...@gmail.com> wrote: > We are an ad tech company that buys and sells digital media. To date, we > have been using Apache Flume 1.4.x to ingest all of our bid request, > response, impression and attribution data. > > The logs currently 'roll' hourly for each data type, meaning that at some > point during each hour (if Flume is behaving) the tmp file in HDFS is > closed/renamed with a new one being opened. This is done for each of 5 > running Flume instances. > > One problem that has been a challenge to date is effectively bounding our > data queries to make sure we capture all of the data for a given interval > without pulling in the world. To date, our structure (all in UTC) for each > data type is: > > /datatype/yr=2014/mo=06/d=15/{files} > > The challenge for us is that Flume is not perfect. > > 1) It can and will often write data that came in on the new UTC day into > the previous one if that log file has not rolled yet. > > 2) Since it does not roll perfectly at the top of each hour, we are having > trouble determining the best way to tightly bound a query for data that is > within a few [3-6] hour window properly. > > 3) When we are doing data rollups in timezones other than UTC, we end up > reading in all of the data for both UTC containing that data to be on the > safe-side. It would be nice to bound this as described in (2). > > > One of the major problems affecting the first two cases is that Flume > sometimes gets 'stuck' - that is, the data will hang out in the file > channel for longer than we anticipate. > > Anyway, I was just wondering how others have approached these problems to > date. If not for the edge cases when data can get stuck in Flume, I think > this would be straightforward. > >