I've been using Hive in production for two months now.  We're mainly using
it for processing server logs, about 1-2GB per day (2-2.5 million
requests).  Typically we import a day's worth of logs at once.  That said,
sometimes we decide to tweak a calculated column.  When that happens, we
modify our transformation script and re-import the entire set of logs (~200
days) into ~600 partitions.

A few days ago I noticed that simple queries, such as a count of page views
over a given week, were giving results up to 10% higher than they yielded
just a week before.  I suspected that we may have "found" unprocessed log
files, so I set up a script to re-import the entire inventory of logs and
re-run the queries.  I got identical results for some weeks, but different
results for some errors.  I repeated this experiment and got different
results.

In the course of this I found that sometimes Hive will create all of the
partitions but write no data to them while not reporting any errors in the
job tracker.  Other times it will fail and leave a stack trace blaming a
broken pipe.

Does anyone have any ideas what I may be doing wrong?  I can change our
practices whichever way; all I want is confidence that all of my data has
been properly imported.
Thanks,
Tim

Reply via email to