[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

Hive QA (JIRA) Wed, 11 Jan 2017 03:00:32 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817998#comment-15817998
 ]


Hive QA commented on HIVE-15527:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12846750/HIVE-15527.7.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 10766 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=101)
        
[limit_pushdown2.q,skewjoin_noskew.q,leftsemijoin_mr.q,bucket3.q,skewjoinopt13.q,bucketmapjoin9.q,auto_join15.q,ptf.q,join22.q,vectorized_nested_mapjoin.q,sample4.q,union18.q,multi_insert_gby.q,join33.q,join_cond_pushdown_unqual2.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=108)
        
[union_remove_1.q,ppd_outer_join2.q,date_udf.q,groupby1_noskew.q,join20.q,smb_mapjoin_13.q,groupby_rollup1.q,temp_table_gb1.q,vector_string_concat.q,smb_mapjoin_6.q,metadata_only_queries.q,auto_sortmerge_join_12.q,groupby_bigdata.q,groupby3_map_multi_distinct.q,innerjoin.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=114)
        
[escape_distributeby1.q,join9.q,groupby2.q,groupby4_map.q,udf_max.q,vectorization_pushdown.q,cbo_gby_empty.q,join_cond_pushdown_unqual3.q,vectorization_short_regress.q,join8.q,sample10.q,cross_product_check_1.q,auto_join_stats.q,input_part2.q,groupby_multi_single_reducer3.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=115)
        
[groupby_map_ppr_multi_distinct.q,vectorization_13.q,mapjoin_mapjoin.q,union2.q,join41.q,groupby8_map.q,cbo_subq_not_in.q,identity_project_remove_skip.q,stats5.q,groupby8_map_skew.q,nullgroup2.q,mapjoin_subquery.q,bucket2.q,smb_mapjoin_1.q,union_remove_8.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=117)
        
[timestamp_lazy.q,union29.q,runtime_skewjoin_mapjoin_spark.q,auto_join22.q,union8.q,groupby5_map.q,dynamic_rdd_cache.q,auto_join29.q,groupby6.q,merge1.q,mapjoin_distinct.q,vector_decimal_mapjoin.q,sample5.q,multi_insert_move_tasks_share_dependencies.q,join_array.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=119)
        
[groupby4_noskew.q,groupby3_map_skew.q,join_cond_pushdown_2.q,union19.q,union24.q,union_remove_5.q,groupby7_noskew_multi_single_reducer.q,vectorization_1.q,index_auto_self_join.q,auto_smb_mapjoin_14.q,script_env_var2.q,pcr.q,auto_join_filters.q,join0.q,join37.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=120)
        
[stats12.q,groupby4.q,union_top_level.q,stats2.q,groupby10.q,mapjoin_filter_on_outerjoin.q,auto_sortmerge_join_4.q,limit_partition_metadataonly.q,load_dyn_part4.q,union3.q,groupby_multi_single_reducer.q,smb_mapjoin_14.q,groupby3_noskew_multi_distinct.q,stats18.q,union_remove_21.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=123)
        
[groupby_complex_types.q,multigroupby_singlemr.q,union11.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=126)
        
[smb_mapjoin_15.q,script_pipe.q,auto_join24.q,filter_join_breaktask.q,bucket4.q,ppd_multi_insert.q,skewjoinopt20.q,join_thrift.q,multi_insert_gby3.q,groupby8.q,join_map_ppr.q,auto_sortmerge_join_8.q,escape_clusterby1.q,groupby_multi_insert_common_distinct.q,join6.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=131)
        
[join2.q,join36.q,avro_joins_native.q,join18.q,smb_mapjoin_10.q,temp_table.q,union_remove_13.q,auto_sortmerge_join_5.q,groupby5_noskew.q,auto_join0.q,vectorization_17.q,auto_join_stats2.q,skewjoin_union_remove_1.q,union16.q,join_literals.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=94)
        
[bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q]
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=98)
        
[groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part]
 (batchId=148)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_varchar_simple]
 (batchId=151)
org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver.org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver
 (batchId=228)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2880/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2880/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2880/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 19 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12846750 - PreCommit-HIVE-Build

> Memory usage is unbound in SortByShuffler for Spark
> ---------------------------------------------------
>
>                 Key: HIVE-15527
>                 URL: https://issues.apache.org/jira/browse/HIVE-15527
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>            Assignee: Chao Sun
>         Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, 
> HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, 
> HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, 
> HIVE-15527.7.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
>             @Override
>             public Tuple2<HiveKey, Iterable<BytesWritable>> next() {
>               // TODO: implement this by accumulating rows with the same key 
> into a list.
>               // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>               // can be done in later phase.
>               while (it.hasNext()) {
>                 Tuple2<HiveKey, BytesWritable> pair = it.next();
>                 if (curKey != null && !curKey.equals(pair._1())) {
>                   HiveKey key = curKey;
>                   List<BytesWritable> values = curValues;
>                   curKey = pair._1();
>                   curValues = new ArrayList<BytesWritable>();
>                   curValues.add(pair._2());
>                   return new Tuple2<HiveKey, Iterable<BytesWritable>>(key, 
> values);
>                 }
>                 curKey = pair._1();
>                 curValues.add(pair._2());
>               }
>               if (curKey == null) {
>                 throw new NoSuchElementException();
>               }
>               // if we get here, this should be the last element we have
>               HiveKey key = curKey;
>               curKey = null;
>               return new Tuple2<HiveKey, Iterable<BytesWritable>>(key, 
> curValues);
>             }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

Reply via email to