We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some of
our queries. You could do that, and then do another select where the row
number/rank is 1 to get all the "unique" rows.
There are probably a bunch of other ways to do this, but this is the one that
first came to mind
Hive has no update & delete statements.
You can drop a table, and that is as closes to a delete as you get.
The only "update" you get is to append more data to a table. There is INSERT
OVERWRITE & INSERT. The first will create the first set of rows in the table,
the second will append more data
We do a similar process with our log files in Hive. We only handle 30 to 60
files (similar structure) at a time, but it sounds like it would fit your
model…..
We create an external table, then do hdfs puts to add the files to the table:
CREATE EXTERNAL TABLE log_import(
date STRING,
time ST