[ https://issues.apache.org/jira/browse/HIVE-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sahil Takiar updated HIVE-17837: -------------------------------- Resolution: Fixed Fix Version/s: 3.0.0 Status: Resolved (was: Patch Available) Merged to master. Thanks Rui for the review. > Explicitly check if the HoS Remote Driver has been lost in the > RemoteSparkJobMonitor > ------------------------------------------------------------------------------------- > > Key: HIVE-17837 > URL: https://issues.apache.org/jira/browse/HIVE-17837 > Project: Hive > Issue Type: Sub-task > Components: Hive > Reporter: Sahil Takiar > Assignee: Sahil Takiar > Priority: Major > Fix For: 3.0.0 > > Attachments: HIVE-17837.1.patch, HIVE-17837.2.patch > > > Right now the {{RemoteSparkJobMonitor}} implicitly checks if the connection > to the Spark remote driver is active. It does this everytime it triggers an > invocation of the {{Rpc#call}} method (so any call to {{SparkClient#run}}). > There are scenarios where we have seen the {{RemoteSparkJobMonitor}} when the > connection to the driver dies, because the implicit call fails to be invoked > (see HIVE-15860). > It would be ideal if we made this call explicit, so we fail as soon as we > know that the connection to the driver has died. > The fix has the added benefit that it allows us to fail faster in the case > where the {{RemoteSparkJobMonitor}} is in the QUEUED / SENT state. If its > stuck in that state, it won't fail until it hits the monitor timeout (by > default 1 minute), even though we already know the connection has died. The > error message that is thrown is also a little imprecise, it says there could > be queue contention, even though we know the real reason is that the > connection was lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005)