Misha Dmitriev created HDFS-11383:
-------------------------------------

             Summary: String duplication in org.apache.hadoop.fs.BlockLocation
                 Key: HDFS-11383
                 URL: https://issues.apache.org/jira/browse/HDFS-11383
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Misha Dmitriev


I am working on Hive performance, investigating the problem of high memory 
pressure when (a) a table consists of a high number (thousands) of partitions 
and (b) multiple queries run against it concurrently. It turns out that a lot 
of memory is wasted due to data duplication. One source of duplicate strings is 
class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, 
topologyPaths, hosts, names, may collectively use up to 6% of memory in my 
benchmark, causing (together with other problematic classes) a huge memory 
spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are 
wasted due to duplication.

I think we need to add calls to String.intern() in the BlockLocation 
constructor, like:

{code}
this.hosts = internStringsInArray(hosts);
...

private void internStringsInArray(String[] sar) {
  for (int i = 0; i < sar.length; i++) {
    sar[i] = sar[i].intern();
  }
}
{code}

String.intern() performs very well starting from JDK 7. I've found some 
articles explaining the progress that was made by the HotSpot JVM developers in 
this area, verified that with benchmarks myself, and finally added quite a bit 
of interning to one of the Cloudera products without any issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to