[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879349#comment-15879349
 ] 

Misha Dmitriev commented on HIVE-15882:
---------------------------------------

Yes, I did take a heap dump and rerun the tool after applying the patch 
(actually, two patches: the one mentioned in this ticket so far and another 
one, that interns some data within PartitionDesc objects). The results are 
predictable: the sources of memory waste that I fixed went away, though some 
other remain.

As for how much memory is saved: this question is more tricky than it sounds. 
In general, the more memory you give to the Java app (the higher is -Xmx), the 
more it would use (it would just run the GC less frequently). So you have to 
measure the size of "live set" (what remains after full GC), but this is more 
difficult. In my benchmark, with its high concurrency, I suspect it's further 
complicated by the fact that higher -Xmx will allow more concurrency, and thus 
even the live set will probably grow. So far I think that the metric that would 
be reasonably accurate is how many more concurrent requests I will be able to 
run in this benchmark before OOMing its again. Any other suggestions are 
welcome.

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-15882
>                 URL: https://issues.apache.org/jira/browse/HIVE-15882
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to