[ https://issues.apache.org/jira/browse/HIVE-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926851#comment-15926851 ]
Sergey Shelukhin commented on HIVE-16223: ----------------------------------------- cc [~ekoifman] [~ashutoshc] > deterministic file naming for bucketing in Hive > ----------------------------------------------- > > Key: HIVE-16223 > URL: https://issues.apache.org/jira/browse/HIVE-16223 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > > Bucketing in Hive is currently very fragile. > 1) Some places determine bucket number from file name. > 2) Some places determine bucket number from a file's "index" in a sorted list > of files in the directory. > 3) It is possible to import files into a bucketed table without any regard > for either. > On top of that, weird rename paths (like _copy_1), subdirectories (e.g. from > Tez union, or just tables read with recursive input enabled), repeated > inserts into the same table, etc. can mess with either scheme. > Therefore I propose we include bucket index and count explicitly in the file > name (e.g. 000003_0_bucket_3of32). It will alleviate the above, and also may > simplify some pieces of code that try to account for missing bucket files, > multiple files, etc. > This will require changes to load table logic that is used in ctas, insert, > load, import etc.; change in logic when getting buckets, as well as when > altering table bucketing (to rename the files). > Users will still be able to use old-style bucketing by specifying a > non-strict config setting (not on by default). > The conversion of existing tables is the biggest issue. Perhaps the existing > tables can be "grandfathered" into the non-strict bucketing, with some > warnings asking the users to convert, and a command to do so in alter > table/analyze table. -- This message was sent by Atlassian JIRA (v6.3.15#6346)