[ https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai updated HIVE-15947: ------------------------------ Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.2.0 Status: Resolved (was: Patch Available) +1ed on RB. Precommit test fail to publish result, there is one unrelated failure: org.apache.hive.service.server.TestHS2HttpServer.testContextRootUrlRewrite. Other tests all pass. Link: https://builds.apache.org/job/PreCommit-HIVE-Build/4180/ Patch pushed to master. Thanks Subramanyam, Kiran! > Enhance Templeton service job operations reliability > ---------------------------------------------------- > > Key: HIVE-15947 > URL: https://issues.apache.org/jira/browse/HIVE-15947 > Project: Hive > Issue Type: Bug > Reporter: Subramanyam Pattipaka > Assignee: Subramanyam Pattipaka > Fix For: 2.2.0 > > Attachments: HIVE-15947.10.patch, HIVE-15947.2.patch, > HIVE-15947.3.patch, HIVE-15947.4.patch, HIVE-15947.6.patch, > HIVE-15947.7.patch, HIVE-15947.8.patch, HIVE-15947.9.patch, HIVE-15947.patch > > > Currently Templeton service doesn't restrict number of job operation > requests. It simply accepts and tries to run all operations. If more number > of concurrent job submit requests comes then the time to submit job > operations can increase significantly. Templetonused hdfs to store staging > file for job. If HDFS storage can't respond to large number of requests and > throttles then the job submission can take very large times in order of > minutes. > This behavior may not be suitable for all applications and client > applications may be looking for predictable and low response for successful > request or send throttle response to client to wait for some time before > re-requesting job operation. > In this JIRA, I am trying to address following job operations > 1) Submit new Job > 2) Get Job Status > 3) List jobs > These three operations has different complexity due to variance in use of > cluster resources like YARN/HDFS. > The idea is to introduce a new config templeton.job.submit.exec.max-procs > which controls maximum number of concurrent active job submissions within > Templeton and use this config to control better response times. If a new job > submission request sees that there are already > templeton.job.submit.exec.max-procs jobs getting submitted concurrently then > the request will fail with Http error 503 with reason > βToo many concurrent job submission requests received. Please wait for > some time before retrying.β > > The client is expected to catch this response and retry after waiting for > some time. The default value for the config > templeton.job.submit.exec.max-procs is set to β0β. This means by default job > submission requests are always accepted. The behavior needs to be enabled > based on requirements. > We can have similar behavior for Status and List operations with configs > templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs > respectively. > Once the job operation is started, the operation can take longer time. The > client which has requested for job operation may not be waiting for > indefinite amount of time. This work introduces configurations > templeton.exec.job.submit.timeout > templeton.exec.job.status.timeout > templeton.exec.job.list.timeout > to specify maximum amount of time job operation can execute. If time out > happens then list and status job requests returns to client with message > "List job request got timed out. Please retry the operation after waiting for > some time." > If submit job request gets timed out then > i) The job submit request thread which receives time out will check if > valid job id is generated in job request. > ii) If it is generated then issue kill job request on cancel thread > pool. Don't wait for operation to complete and returns to client with time > out message. > Side effects of enabling time out for submit operations > 1) This has a possibility for having active job for some time by the client > gets response and a list operation from client could potential show the newly > created job before it gets killed. > 2) We do best effort to kill the job and no guarantees. This means there is a > possibility of duplicate job created. One possible reason for this could be a > case where job is created and then operation timed out but kill request > failed due to resource manager unavailability. When resource manager > restarts, it will restarts the job which got created. > Fixing this scenario is not part of the scope of this JIRA. The job operation > functionality can be enabled only if above side effects are acceptable. -- This message was sent by Atlassian JIRA (v6.3.15#6346)