[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746530#comment-13746530
 ] 

Phabricator commented on HIVE-4963:
-----------------------------------

ashutoshc has commented on the revision "HIVE-4963 [jira] Support in memory PTF 
partitions".

  Seems like there are more opportunities to make this efficient, but those can 
be digged into later. This patch is a step in a right direction by reusing 
existing infra. Any improvements we now make may benefit other spilling 
operators like join too. Really makes me happy : )
  Apart from code comments, I will also request you to add a testcase which 
sets the config value (cachesize) to zero, so that it spills for every record 
and exercise all these new codepath.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:89 This I think 
you need to do because current RowContainers can only hold crisp java objects. 
Seems like we can improve this by writing RowContainer which can hold 
writables, thus avoiding unnecessary deserialization and mem-cpy here. 
Something worth exploring as follow-up issue.
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:57 this config 
should really govern how much memory we are willing to allocate (in bytes), not 
in number of rows, but thats a topic for another jira since you are reusing 
existing code.
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:148 This sanity 
check is in tight loop. Ideally we should not have such checks in inner loop. 
But lets leave it here till we get more confidence in the code. Will be good to 
add a note about what will be the assumption if we are to get rid of this check 
in future.
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:137 Instead of 
try-catch-rethrow, shall we just add throws in method signature, makes code 
readable and arguably faster.
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:160 Similar 
comment about try-catch-rethrow.
  
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:80 
Awesome comments!
  
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:94 
If I get this right, this function will again do serialization before spilling, 
so in case of memory pressure, we are doing a  round trip of ser-deser without 
performing useful work. This ties back to my earlier comment on eager 
deserialization.
  This whole mechanism is worth exploring later.

REVISION DETAIL
  https://reviews.facebook.net/D12279

To: JIRA, ashutoshc, hbutani

                
> Support in memory PTF partitions
> --------------------------------
>
>                 Key: HIVE-4963
>                 URL: https://issues.apache.org/jira/browse/HIVE-4963
>             Project: Hive
>          Issue Type: Bug
>          Components: PTF-Windowing
>            Reporter: Harish Butani
>         Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to