We are an ad tech company that buys and sells digital media. To date, we have been using Apache Flume 1.4.x to ingest all of our bid request, response, impression and attribution data.
The logs currently 'roll' hourly for each data type, meaning that at some point during each hour (if Flume is behaving) the tmp file in HDFS is closed/renamed with a new one being opened. This is done for each of 5 running Flume instances. One problem that has been a challenge to date is effectively bounding our data queries to make sure we capture all of the data for a given interval without pulling in the world. To date, our structure (all in UTC) for each data type is: /datatype/yr=2014/mo=06/d=15/{files} The challenge for us is that Flume is not perfect. 1) It can and will often write data that came in on the new UTC day into the previous one if that log file has not rolled yet. 2) Since it does not roll perfectly at the top of each hour, we are having trouble determining the best way to tightly bound a query for data that is within a few [3-6] hour window properly. 3) When we are doing data rollups in timezones other than UTC, we end up reading in all of the data for both UTC containing that data to be on the safe-side. It would be nice to bound this as described in (2). One of the major problems affecting the first two cases is that Flume sometimes gets 'stuck' - that is, the data will hang out in the file channel for longer than we anticipate. Anyway, I was just wondering how others have approached these problems to date. If not for the edge cases when data can get stuck in Flume, I think this would be straightforward.