Stamatis Zampetakis created HIVE-24569:
------------------------------------------

             Summary: LLAP daemon leaks file descriptors/log4j appenders
                 Key: HIVE-24569
                 URL: https://issues.apache.org/jira/browse/HIVE-24569
             Project: Hive
          Issue Type: Bug
          Components: llap
    Affects Versions: 2.2.0
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis
             Fix For: 4.0.0
         Attachments: llap-appender-gc-roots.png

With HIVE-9756 query logs in LLAP are directed to different files (file per 
query) using a Log4j2 routing appender. Without a purge policy in place, 
appenders are created dynamically by the routing appender, one for each query, 
and remain in memory forever. The dynamic appenders write to files so each 
appender holds to a file descriptor. 

Further work HIVE-14224 has mitigated the issue by introducing a custom purging 
policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic appenders 
(and closes the respective files) when the query is completed 
(org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).
 

However, in the presence of multiple threads appending to the logs there are 
race conditions. In an internal Hive cluster the number of file descriptors 
started going up approx one descriptor leaking per query. After some debugging 
it turns out that one thread (running the 
QueryTracker#handleLogOnQueryCompletion) signals that the query has finished 
and thus the purge policy should get rid of the respective appender (and close 
the file) while another (Task-Executor-0) attempts to append another log 
message for the same query. The initial appender is closed after the request 
from the query tracker but a new one is created to accomodate the message from 
the task executor and the latter is never removed thus creating a leak. 

Similar leaks have been identified and fixed for HS2 with the most similar one 
being that described 
[here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041].
 

The problem relies on the timing of threads so it may not manifestate in all 
versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via lsof 
(or other similar command) with the following output:

{noformat}
# 1494391 is the PID of the LLAP daemon process
ls -ltr /proc/1494391/fd
...
lrwx------ 1 hive hadoop 64 Dec 24 12:08 978 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
lrwx------ 1 hive hadoop 64 Dec 24 12:08 977 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
lrwx------ 1 hive hadoop 64 Dec 24 12:08 974 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
lrwx------ 1 hive hadoop 64 Dec 24 12:08 989 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
lrwx------ 1 hive hadoop 64 Dec 24 12:08 984 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
lrwx------ 1 hive hadoop 64 Dec 24 12:08 983 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
lrwx------ 1 hive hadoop 64 Dec 24 12:08 981 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
lrwx------ 1 hive hadoop 64 Dec 24 12:08 980 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log
lrwx------ 1 hive hadoop 64 Dec 24 12:08 979 -> 
/hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121837_e80c0024-f6bc-4b3c-85ed-5c0c85c55787-dag_1608659125567_0008_199.log
{noformat}

or in the heap dump with many appenders (in my case {{LlapWrappedAppender}}) 
holding indirectly open file descriptors.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to