[ https://issues.apache.org/jira/browse/HIVE-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rui Li updated HIVE-16854: -------------------------- Resolution: Fixed Fix Version/s: 3.0.0 Status: Resolved (was: Patch Available) Pushed to master. Thanks Xuefu for the review. > SparkClientFactory is locked too aggressively > --------------------------------------------- > > Key: HIVE-16854 > URL: https://issues.apache.org/jira/browse/HIVE-16854 > Project: Hive > Issue Type: Bug > Components: Spark > Affects Versions: 1.1.0 > Reporter: Xuefu Zhang > Assignee: Rui Li > Fix For: 3.0.0 > > Attachments: 15763.jstack, HIVE-16854.2.patch, HIVE-16854.patch > > > Most methods in SparkClientFactory are synchronized on the SparkClientFactory > singleton. However, some methods are very expensive, such as createClient(), > which returns a SparkClientImpl instance. However, creating a SparkClientImpl > instance requires starting a remote driver to connect back to RPCServer. This > process can take a long time such as in case of a busy yarn queue. When this > happens, all pending calls on SparkClientFactory will have to wait for a > long time. > In our case, hive.spark.client.server.connect.timeout is set to 1hr. This > makes some queries waiting for hours before starting. > The current implementation seems pretty much making all remote driver > launches serialized. If one of them takes time, the following ones will have > to wait. > HS2 stacktrace is attached for reference. It's based on earlier version of > Hive, so the line numbers might be slightly off. The following shows the > locking effect: > {code} > xuefu@hadoopservice20-sjc1:~$ grep > org.apache.hive.spark.client.SparkClientFactory 15763.jstack > at > org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79) > - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for > org.apache.hive.spark.client.SparkClientFactory) > at > org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79) > - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for > org.apache.hive.spark.client.SparkClientFactory) > at > org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) > - locked <0x00007f78fa1a9cc0> (a java.lang.Class for > org.apache.hive.spark.client.SparkClientFactory) > at > org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79) > - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for > org.apache.hive.spark.client.SparkClientFactory) > at > org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79) > - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for > org.apache.hive.spark.client.SparkClientFactory) > at > org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79) > - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for > org.apache.hive.spark.client.SparkClientFactory) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)