[ https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746530#comment-13746530 ]
Phabricator commented on HIVE-4963: ----------------------------------- ashutoshc has commented on the revision "HIVE-4963 [jira] Support in memory PTF partitions". Seems like there are more opportunities to make this efficient, but those can be digged into later. This patch is a step in a right direction by reusing existing infra. Any improvements we now make may benefit other spilling operators like join too. Really makes me happy : ) Apart from code comments, I will also request you to add a testcase which sets the config value (cachesize) to zero, so that it spills for every record and exercise all these new codepath. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:89 This I think you need to do because current RowContainers can only hold crisp java objects. Seems like we can improve this by writing RowContainer which can hold writables, thus avoiding unnecessary deserialization and mem-cpy here. Something worth exploring as follow-up issue. ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:57 this config should really govern how much memory we are willing to allocate (in bytes), not in number of rows, but thats a topic for another jira since you are reusing existing code. ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:148 This sanity check is in tight loop. Ideally we should not have such checks in inner loop. But lets leave it here till we get more confidence in the code. Will be good to add a note about what will be the assumption if we are to get rid of this check in future. ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:137 Instead of try-catch-rethrow, shall we just add throws in method signature, makes code readable and arguably faster. ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:160 Similar comment about try-catch-rethrow. ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:80 Awesome comments! ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:94 If I get this right, this function will again do serialization before spilling, so in case of memory pressure, we are doing a round trip of ser-deser without performing useful work. This ties back to my earlier comment on eager deserialization. This whole mechanism is worth exploring later. REVISION DETAIL https://reviews.facebook.net/D12279 To: JIRA, ashutoshc, hbutani > Support in memory PTF partitions > -------------------------------- > > Key: HIVE-4963 > URL: https://issues.apache.org/jira/browse/HIVE-4963 > Project: Hive > Issue Type: Bug > Components: PTF-Windowing > Reporter: Harish Butani > Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, > HIVE-4963.D12279.2.patch, PTFRowContainer.patch > > > PTF partitions apply the defensive mode of assuming that partitions will not > fit in memory. Because of this there is a significant deserialization > overhead when accessing elements. > Allow the user to specify that there is enough memory to hold partitions > through a 'hive.ptf.partition.fits.in.mem' option. > Savings depends on partition size and in case of windowing the number of > UDAFs and the window ranges. For eg for the following (admittedly extreme) > case the PTFOperator exec times went from 39 secs to 8 secs. > > {noformat} > select t, s, i, b, f, d, > min(t) over(partition by 1 rows between unbounded preceding and current row), > min(s) over(partition by 1 rows between unbounded preceding and current row), > min(i) over(partition by 1 rows between unbounded preceding and current row), > min(b) over(partition by 1 rows between unbounded preceding and current row) > from over10k > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira