[ 
https://issues.apache.org/jira/browse/HIVE-24569?focusedWorklogId=541681&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-541681
 ]

ASF GitHub Bot logged work on HIVE-24569:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Jan/21 04:09
            Start Date: 26/Jan/21 04:09
    Worklog Time Spent: 10m 
      Work Description: zabetak commented on a change in pull request #1858:
URL: https://github.com/apache/hive/pull/1858#discussion_r563642345



##########
File path: llap-server/src/main/resources/llap-daemon-log4j2.properties
##########
@@ -81,23 +81,21 @@ appender.query-routing.type = Routing
 appender.query-routing.name = query-routing
 appender.query-routing.routes.type = Routes
 appender.query-routing.routes.pattern = $${ctx:queryId}
-#Purge polciy for query-based Routing Appender
-appender.query-routing.purgePolicy.type = LlapRoutingAppenderPurgePolicy
-# Note: Do not change this name without changing the corresponding entry in 
LlapConstants
-appender.query-routing.purgePolicy.name = llapLogPurgerQueryRouting
+#Purge policy for query-based Routing Appender
+appender.query-routing.purgePolicy.type = IdlePurgePolicy
+appender.query-routing.purgePolicy.timeToLive = 5

Review comment:
       @prasanthj I increased the `timeToLive` for production to 60s and made 
appropriate changes to keep the `.done` when the query completes. I tested the 
changes on the cluster and things seem to work properly (no descriptor leak, 
yarn log aggregation working normally). Can you have another look when you find 
some time? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 541681)
    Time Spent: 1.5h  (was: 1h 20m)

> LLAP daemon leaks file descriptors/log4j appenders
> --------------------------------------------------
>
>                 Key: HIVE-24569
>                 URL: https://issues.apache.org/jira/browse/HIVE-24569
>             Project: Hive
>          Issue Type: Bug
>          Components: llap
>    Affects Versions: 2.2.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>         Attachments: llap-appender-gc-roots.png
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> With HIVE-9756 query logs in LLAP are directed to different files (file per 
> query) using a Log4j2 routing appender. Without a purge policy in place, 
> appenders are created dynamically by the routing appender, one for each 
> query, and remain in memory forever. The dynamic appenders write to files so 
> each appender holds to a file descriptor. 
> Further work HIVE-14224 has mitigated the issue by introducing a custom 
> purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic 
> appenders (and closes the respective files) when the query is completed 
> (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
>  
> However, in the presence of multiple threads appending to the logs there are 
> race conditions. In an internal Hive cluster the number of file descriptors 
> started going up approx one descriptor leaking per query. After some 
> debugging it turns out that one thread (running the 
> QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
> and thus the purge policy should get rid of the respective appender (and 
> close the file) while another (Task-Executor-0) attempts to append another 
> log message for the same query. The initial appender is closed after the 
> request from the query tracker but a new one is created to accomodate the 
> message from the task executor and the latter is never removed thus creating 
> a leak. 
> Similar leaks have been identified and fixed for HS2 with the most similar 
> one being that described 
> [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
>  
> The problem relies on the timing of threads so it may not manifestate in all 
> versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via 
> lsof (or other similar command) with the following output:
> {noformat}
> # 1494391 is the PID of the LLAP daemon process
> ls -ltr /proc/1494391/fd
> ...
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 978 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 977 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 974 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 989 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 984 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 983 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 981 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 980 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log
> lrwx------ 1 hive hadoop 64 Dec 24 12:08 979 -> 
> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121837_e80c0024-f6bc-4b3c-85ed-5c0c85c55787-dag_1608659125567_0008_199.log
> {noformat}
> or in the heap dump with many appenders (in my case {{LlapWrappedAppender}}) 
> holding indirectly open file descriptors:
> !llap-appender-gc-roots.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to