Re: MapReduce Output File Names

Gopal Vijayaraghavan Fri, 27 Jul 2018 12:35:09 -0700

>    Would it be problematic to simply prefix a random number, or
>    timestamp, on the front of the file name to make it unique?


For bucketed tables  - they rely on the prefix to determine which bucket it 
belongs to.

So if you have a bucketed table and insert into it twice, then this turns into 

0000_0 + 0000_0_Copy_1

which is logically the 1st bucket (if this is a sorted table, then it is a 
sort-merge to read out, not a one-after-other).

There's a set of race conditions with that loop when it comes to something with 
weak consistency like S3, which is why hive managed tables have switched to a 
delta_<id>/0000_0 instead of _Copy_<n> starting in Hive 3.0.

And where "id" is actually stored in the table metadata (so that no two queries 
will use the same delta_<id> dir).

Cheers,
Gopal

Re: MapReduce Output File Names

Reply via email to