I've been using Hive in production for two months now. We're mainly using it for processing server logs, about 1-2GB per day (2-2.5 million requests). Typically we import a day's worth of logs at once. That said, sometimes we decide to tweak a calculated column. When that happens, we modify our transformation script and re-import the entire set of logs (~200 days) into ~600 partitions.
A few days ago I noticed that simple queries, such as a count of page views over a given week, were giving results up to 10% higher than they yielded just a week before. I suspected that we may have "found" unprocessed log files, so I set up a script to re-import the entire inventory of logs and re-run the queries. I got identical results for some weeks, but different results for some errors. I repeated this experiment and got different results. In the course of this I found that sometimes Hive will create all of the partitions but write no data to them while not reporting any errors in the job tracker. Other times it will fail and leave a stack trace blaming a broken pipe. Does anyone have any ideas what I may be doing wrong? I can change our practices whichever way; all I want is confidence that all of my data has been properly imported. Thanks, Tim