> If I have a series of entries that look like ... > { "update", {"baz" : "bar" }}
Due to the way the split distribution works, you need a global ordering key for each operation. 0, "ADD", "baz", "" 1, "SET", "baz", "bar" 2, "DEL", "baz", null If you do not have updates coming in within a second, you could store a timestamp. Then you can write a windowing function for Hive to merge/order them. select flatten_txns(op, key, value) over (partition by key order by ts) from txns; At this point, you're nearly reinventing what Hive's own insert/update/delete statements do. Except, compared to that, these updates are faster (since it's really an unconditional SET). Cheers, Gopal