[ https://issues.apache.org/jira/browse/HIVE-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857172#comment-15857172 ]
Xuefu Zhang edited comment on HIVE-15682 at 2/8/17 1:23 AM: ------------------------------------------------------------ Hi [~Ferd], when I ran the query, I had two day's data which is about 25m rows. I just ran the query again, with about 10 day's data, the runtime is about 600s with 130m rows. I have 32 executors, each having 4 cores. The query spends most of the time on the second stage where sorting via a single reducer occurs. I don't think the scale matters much as long as the query runs for sometime (in minutes at least). Thus, you should be able to use TPC-DS (or its alternatives) data for this exercise. was (Author: xuefuz): Hi [~Ferd], when I ran the query, I had two day's data which is about 25m rows. I just ran the query again, with about 10 day's data, the runtime is about 600s with 130m rows. I have 32 executors, each having 4 cores. The query spends most of the time on the second stage where sorting via a single reducer occurs. I don't think the scale matters much as long as the query runs for sometime (in minutes at least). Thus, you should be able to use TPC-DS data for this exercise. > Eliminate per-row based dummy iterator creation > ----------------------------------------------- > > Key: HIVE-15682 > URL: https://issues.apache.org/jira/browse/HIVE-15682 > Project: Hive > Issue Type: Improvement > Components: Spark > Affects Versions: 2.2.0 > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Fix For: 2.2.0 > > Attachments: HIVE-15682.patch > > > HIVE-15580 introduced a dummy iterator per input row which can be eliminated. > This is because {{SparkReduceRecordHandler}} is able to handle single key > value pairs. We can refactor this part of code 1. to remove the need for a > iterator and 2. to optimize the code path for per (key, value) based (instead > of (key, value iterator)) processing. It would be also great if we can > measure the performance after the optimizations and compare to performance > prior to HIVE-15580. -- This message was sent by Atlassian JIRA (v6.3.15#6346)