[ https://issues.apache.org/jira/browse/HIVE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430748#comment-13430748 ]
Rohini Palaniswamy commented on HIVE-3098: ------------------------------------------ Daryn, bq. trying to cache UGIs will provide a negligible performance increase. Hence why I question if this is premature optimization. Ran a small test. For a 10 node cluster with kerberos: - First FileSystem.get() took 350-700ms. On one node it was ~350ms and other ~700ms. Did not spend time trying to analyze the reason behind the difference. Some of the time I assume is spent on fetching service ticket to NN, opening socket, etc. - Further FileSystem.get() for the same namenode took ~75ms (With fs.cache disabled or different UGIs with cache enabled). - If FileSystem is fetched from cache because of same UGI, it is 0-1 ms. Time might be slightly higher for a bigger cluster with namenode under heavy usage. ~75ms is negligible at the moment. But as we move towards using hive for more realtime queries and we look at reducing the response times of the hive metastore operations, we can consider the UGI-FSCache patch as 75ms is ~25-40% of the response time for some hive metastore operations. Hope this data is useful for HADOOP-7973 too. > Memory leak from large number of FileSystem instances in FileSystem.CACHE. > (Must cache UGIs.) > --------------------------------------------------------------------------------------------- > > Key: HIVE-3098 > URL: https://issues.apache.org/jira/browse/HIVE-3098 > Project: Hive > Issue Type: Bug > Components: Shims > Affects Versions: 0.9.0 > Environment: Running with Hadoop 20.205.0.3+ / 1.0.x with security > turned on. > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > Attachments: Hive-3098_(FS_closeAllForUGI()).patch, Hive_3098.patch > > > The problem manifested from stress-testing HCatalog 0.4.1 (as part of testing > the Oracle backend). > The HCatalog server ran out of memory (-Xmx2048m) when pounded by 60-threads, > in under 24 hours. The heap-dump indicates that hadoop::FileSystem.CACHE had > 1000000 instances of FileSystem, whose combined retained-mem consumed the > entire heap. > It boiled down to hadoop::UserGroupInformation::equals() being implemented > such that the "Subject" member is compared for equality ("=="), and not > equivalence (".equals()"). This causes equivalent UGI instances to compare as > unequal, and causes a new FileSystem instance to be created and cached. > The UGI.equals() is so implemented, incidentally, as a fix for yet another > problem (HADOOP-6670); so it is unlikely that that implementation can be > modified. > The solution for this is to check for UGI equivalence in HCatalog (i.e. in > the Hive metastore), using an cache for UGI instances in the shims. > I have a patch to fix this. I'll upload it shortly. I just ran an overnight > test to confirm that the memory-leak has been arrested. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira