[ 
https://issues.apache.org/jira/browse/HIVE-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-16854:
-------------------------------
    Description: 
Most methods in SparkClientFactory are synchronized on the SparkClientFactory 
singleton. However, some methods are very expensive, such as createClient(), 
which returns a SparkClientImpl instance. However, creating a SparkClientImpl 
instance requires starting a remote driver to connect back to RPCServer. This 
process can take a long time such as in case of a busy yarn queue. When this 
happens, all pending  calls on SparkClientFactory will have to wait for a long 
time.

In our case, hive.spark.client.server.connect.timeout is set to 1hr. This makes 
some queries waiting for hours before starting.

The current implementation seems pretty much making all remote driver launches 
serialized. If one of them takes time, the following ones will have to wait.

HS2 stacktrace is attached for reference. It's based on earlier version of 
Hive, so the line numbers might be slightly off. The following shows the 
locking effect:

{code}
xuefu@hadoopservice20-sjc1:~$ grep 
org.apache.hive.spark.client.SparkClientFactory 15763.jstack 
        at 
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
        - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
org.apache.hive.spark.client.SparkClientFactory)
        at 
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
        - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
org.apache.hive.spark.client.SparkClientFactory)
        at 
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
        - locked <0x00007f78fa1a9cc0> (a java.lang.Class for 
org.apache.hive.spark.client.SparkClientFactory)
        at 
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
        - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
org.apache.hive.spark.client.SparkClientFactory)
        at 
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
        - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
org.apache.hive.spark.client.SparkClientFactory)
        at 
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
        - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
org.apache.hive.spark.client.SparkClientFactory)
{code}

  was:
Most methods in SparkClientFactory are synchronized on the SparkClientFactory 
singleton. However, some methods are very expensive, such as createClient(), 
which returns a SparkClientImpl instance. However, creating a SparkClientImpl 
instance requires starting a remote driver to connect back to RPCServer. This 
process can take a long time such as in case of a busy yarn queue. When this 
happens, all pending  calls on SparkClientFactory will have to wait for a long 
time.

In our case, hive.spark.client.server.connect.timeout is set to 1hr. This makes 
some queries waiting for hours before starting.

The current implementation seems pretty much making all remote driver launches 
serialized. If one of them takes time, the following ones will have to wait.

HS2 stacktrace is attached for reference. It's based on earlier version of 
Hive, so the line numbers might be slightly off.


> SparkClientFactory is locked too aggressively
> ---------------------------------------------
>
>                 Key: HIVE-16854
>                 URL: https://issues.apache.org/jira/browse/HIVE-16854
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>            Assignee: Rui Li
>         Attachments: 15763.jstack
>
>
> Most methods in SparkClientFactory are synchronized on the SparkClientFactory 
> singleton. However, some methods are very expensive, such as createClient(), 
> which returns a SparkClientImpl instance. However, creating a SparkClientImpl 
> instance requires starting a remote driver to connect back to RPCServer. This 
> process can take a long time such as in case of a busy yarn queue. When this 
> happens, all pending  calls on SparkClientFactory will have to wait for a 
> long time.
> In our case, hive.spark.client.server.connect.timeout is set to 1hr. This 
> makes some queries waiting for hours before starting.
> The current implementation seems pretty much making all remote driver 
> launches serialized. If one of them takes time, the following ones will have 
> to wait.
> HS2 stacktrace is attached for reference. It's based on earlier version of 
> Hive, so the line numbers might be slightly off. The following shows the 
> locking effect:
> {code}
> xuefu@hadoopservice20-sjc1:~$ grep 
> org.apache.hive.spark.client.SparkClientFactory 15763.jstack 
>       at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
>       - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
> org.apache.hive.spark.client.SparkClientFactory)
>       at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
>       - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
> org.apache.hive.spark.client.SparkClientFactory)
>       at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>       - locked <0x00007f78fa1a9cc0> (a java.lang.Class for 
> org.apache.hive.spark.client.SparkClientFactory)
>       at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
>       - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
> org.apache.hive.spark.client.SparkClientFactory)
>       at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
>       - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
> org.apache.hive.spark.client.SparkClientFactory)
>       at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:79)
>       - waiting to lock <0x00007f78fa1a9cc0> (a java.lang.Class for 
> org.apache.hive.spark.client.SparkClientFactory)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to