[ https://issues.apache.org/jira/browse/HIVE-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122735#comment-14122735 ]
Rui Li commented on HIVE-7956: ------------------------------ Hi [~brocknoland] & [~xuefuz], The problem is {{RowContainer}} spills records to disk as we collect the output from mappers. Since we're using {{BytesWritable}} as the key type, we read the key back as {{BytesWritable}} (while it should be {{HiveKey}} to hold the proper hash code) when spark tries to partition the data to reducers. So if I copy the key as {{HiveKey}}, when spark does the partition, some keys are {{BytesWritable}} (those spilled to disk) and some are {{HiveKey}} (those cached in RowContainer). Then the partition really gets messed up. Changing our key type to {{HiveKey}} requires changing {{SparkTran}} which needs much refactor to current code. Even if we can use {{HiveKey}} as key, we may need to add {{write}} and {{readFields}} to {{HiveKey}} in order to keep the extra hash code. I doubt such a change is acceptable. Lazy computing (HIVE-7873) may help mitigate the issue by reducing the chance that a record be spilled to disk, so spark can partition the record by (possibly) {{HiveKey}}. Any ideas on this? > When inserting into a bucketed table, all data goes to a single bucket [Spark > Branch] > ------------------------------------------------------------------------------------- > > Key: HIVE-7956 > URL: https://issues.apache.org/jira/browse/HIVE-7956 > Project: Hive > Issue Type: Bug > Components: Spark > Reporter: Rui Li > Assignee: Rui Li > > I created a bucketed table: > {code} > create table testBucket(x int,y string) clustered by(x) into 10 buckets; > {code} > Then I run a query like: > {code} > set hive.enforce.bucketing = true; > insert overwrite table testBucket select intCol,stringCol from src; > {code} > Here {{src}} is a simple textfile-based table containing 40000000 records > (not bucketed). The query launches 10 reduce tasks but all the data goes to > only one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)