Re: does anyone care about list bucketing stored as directories?

Sergey Shelukhin Mon, 09 Oct 2017 13:07:26 -0700

Ok, here’s synopsis that is hopefully clearer.
1) LB, when stored as directories, adds a lot of low-level complexity to Hive 
tables that has to be accounted for in many places in the code where the files 
are written or modified - from FSOP to ACID/replication/export.
2) While working on some FSOP code I noticed that some of that logic is broken 
- e.g. the duplicate file removal from tasks, a pretty fundamental correctness 
feature in Hive, may be broken. LB also doesn’t appear to be compatible with 
e.g. regular bucketing.
3) The feature hasn’t seen development activity in a while; it also doesn’t 
appear to be used a lot.


Keeping with the theme of cleaning up “legacy” code for 3.0, I was proposing we 
remove it.

(2) also suggested that, if needed, it might be easier to implement similar 
functionality by adding some flexibility to partitions (which LB directories 
look like anyway); that would also keep the logic on a higher level of 
abstraction (split generation, partition pruning) as opposed to many low-level 
places like FSOP, etc.



From: Xuefu Zhang <xu...@apache.org<mailto:xu...@apache.org>>
Date: Sunday, October 8, 2017 at 20:56
To: "dev@hive.apache.org<mailto:dev@hive.apache.org>" 
<dev@hive.apache.org<mailto:dev@hive.apache.org>>
Cc: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
<u...@hive.apache.org<mailto:u...@hive.apache.org>>, Sergey Shelukhin 
<ser...@hortonworks.com<mailto:ser...@hortonworks.com>>
Subject: Re: does anyone care about list bucketing stored as directories?

Lack a response doesn't necessarily means "don't care". Maybe you can have a 
good description of the problem and proposed solution. Frankly I cannot make 
much sense out of the previous email.

Thanks,
Xuefu

On Fri, Oct 6, 2017 at 5:05 PM, Sergey Shelukhin 
<ser...@hortonworks.com<mailto:ser...@hortonworks.com>> wrote:
Looks like nobody does… I’ll file a ticket to remove it shortly.

From: Sergey Shelukhin 
<ser...@hortonworks.com<mailto:ser...@hortonworks.com><mailto:ser...@hortonworks.com<mailto:ser...@hortonworks.com>>>
Date: Tuesday, October 3, 2017 at 12:59
To: 
"u...@hive.apache.org<mailto:u...@hive.apache.org><mailto:u...@hive.apache.org<mailto:u...@hive.apache.org>>"
 
<u...@hive.apache.org<mailto:u...@hive.apache.org><mailto:u...@hive.apache.org<mailto:u...@hive.apache.org>>>,
 
"dev@hive.apache.org<mailto:dev@hive.apache.org><mailto:dev@hive.apache.org<mailto:dev@hive.apache.org>>"
 
<dev@hive.apache.org<mailto:dev@hive.apache.org><mailto:dev@hive.apache.org<mailto:dev@hive.apache.org>>>
Subject: does anyone care about list bucketing stored as directories?

1) There seem to be some bugs and limitations in LB (e.g. incorrect cleanup - 
https://issues.apache.org/jira/browse/HIVE-14886) and nobody appears to as much 
as watch JIRAs ;) Does anyone actually use this stuff? Should we nuke it in 
3.0, and by 3.0 I mean I’ll remove it from master in a few weeks? :)

2) I actually wonder, on top of the same SQL syntax, wouldn’t it be much easier 
to add logic to partitioning to write skew values into partitions and non-skew 
values into a new type of default partition? It won’t affect nearly as many low 
level codepaths in obscure and unobvious ways, instead keeping all the logic in 
metastore and split generation, and would integrate with Hive features like PPD 
automatically.
Esp. if we are ok with the same limitations - e.g. if you add a new skew value 
right now, I’m not sure what happens to the rows with that value already 
sitting in the non-skew directories, but I don’t expect anything reasonable...

Re: does anyone care about list bucketing stored as directories?

Reply via email to