[ https://issues.apache.org/jira/browse/HIVE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15842353#comment-15842353 ]
anishek edited comment on HIVE-15473 at 1/27/17 7:16 AM: --------------------------------------------------------- There are few observations / limitations that [~thejas] had cited while reviewing this. Writing down the reasoning here and steps of how we can move forward. Given that we use SynchronizedHandler for the client on beeline side, only one operation / api at a time can be in execution from a single beeline session to hiveserver2. Current flow of how the progress bar is updated on the client side is Thread 1 -- does statement execution: This is achieved by calling GetOperationStatus for the operation from beeline till the execution of the operation is complete. The server side implementation of GetOperationStatus uses a timeout mechanism (which waits for the query execution to finish), before it sends the status to the client. The time value is decided by a step function, where for long running queries this can lead to a approx wait time of 5 seconds per call to GetOperationStatus . Thread 2 -- prints query Logs and progress logs. *Problem Space:* # Since the client synchronizes the various api calls, This effectively means that only one api from either Thread 1 / Thread 2 is executed at at time and the notion of trying to project concurrent execution capability in code for beeline seems misleading and hence with the current patch the progress bar / query log updates can be delayed by at least 5+ seconds ( _I dont think we can avoid this anyways, as i will discuss later_ ). # Additionally, since there is no *order* of threads requesting synchronization on a object is maintained, there is a possibility that Thread 1 can get the next lock on the object without Thread 2 getting a chance to obtain the lock, thus leading to long delays in updating the Query Log or Progress log ( _I am not sure how this will happen for use case of long running queries as while Thread 1 is executing , Thread 2 would already have blocked on the synchronize of the object. Once Thread 1 completes and before it comes around the while loop in_ {code} HiveStatement.waitForOperationToComplete() {code} _Thread 2 should start executing, it seems highly improbable that, thread 1 completes and executes additional statements and gets the lock again before Thread 2 gets a chance to acquire the lock_ ) So in summary: * Prevent multi threaded code in beeline for interactions with hiveserver2, as no concurrency is supported by the Thrift protocol, unless we move to ThriftHttpCliService using Http based connection, or use NonBlockingThrift server for binary protocol on the server side. * Address the issue of responsiveness if we can. *Solution Space:* Since concurrent execution is not supported programming anything, to that effect should be avoided in beeline client. Hence, we strive to remove the multi threaded code from beeline side, in effect, moving the query log and progress bar log to merge with the GetOperationStatus api. This would still not address the issue of responsiveness as indicated in 1. above as the GetOperationStatus will use the wait time before responding to calls from beeline side, unless we decide to remove this, or reduce the wait time to a default value of say 500 milliseconds, not sure why the step function is used -- _to prevent server from wasting CPU resources on non-critical operations ?_ . This will address 2. above though since we are going to get all the information in a single call. *Implementation Considerations:* # Merge QueryLog and ProgressBarLog request / response as part of GetOperationStatus. # To get this working we have to extend HiveStatement to include few non JDBC compliant setters ( one interface for displaying progress bar, other for displaying query logs) -- default implementations for these will be _do nothing_ implementations # Have setters on hive statement for both the interfaces, used by beeline to provide required implementations. # As part of hive statement execute(*) call, we create appropriate request if custom implementations of the interfaces are provided above. # There will be additional function signature for GetOperationStatus that we might need to create to allow for backward compatibility reasons. # _Not related to above_ : make sure we pass the vertex progress as string (for progress bar display) and query progress as custom enum for decision making(and implementations on server side to map from execution engine based state to our generic enum state). If we are too worried about the responsiveness of the progress bar, or *2. in Problem Space* being a major impediment for hive usage, we should go with the new implementation proposal else just additionally implement with *5. in Implementation Considerations* was (Author: anishek): There are few observations / limitations that [~thejas] had cited while reviewing this. Writing down the reasoning here and steps of how we can move forward. Given that we use SynchronizedHandler for the client on beeline side, only one operation / api at a time can be in execution from a single beeline session to hiveserver2. Current flow of how the progress bar is updated on the client side is Thread 1 -- does statement execution: This is achieved by calling GetOperationStatus for the operation from beeline till the execution of the operation is complete. The server side implementation of GetOperationStatus uses a timeout mechanism (which waits for the query execution to finish), before it sends the status to the client. The time value is decided by a step function, where for long running queries this can lead to a approx wait time of 5 seconds per call to GetOperationStatus . Thread 2 -- prints query Logs and progress logs. *Problem Space:* # Since the client synchronizes the various api calls, This effectively means that only one api from either Thread 1 / Thread 2 is executed at at time and the notion of trying to project concurrent execution capability in code for beeline seems misleading and hence with the current patch the progress bar / query log updates can be delayed by at least 5+ seconds ( _I dont think we can avoid this anyways, as i will discuss later_ ). # Additionally, since there is no *order* of threads requesting synchronization on a object is maintained, there is a possibility that Thread 1 can get the next lock on the object without Thread 2 getting a chance to obtain the lock, thus leading to long delays in updating the Query Log or Progress log ( _I am not sure how this will happen for use case of long running queries as while Thread 1 is executing , Thread 2 would already have blocked on the synchronize of the object. Once Thread 1 completes and before it comes around the while loop in_ {code} HiveStatement.waitForOperationToComplete() {code} _Thread 2 should start executing, it seems highly improbable that, thread 1 completes and executes additional statements and gets the lock again before Thread 2 gets a chance to acquire the lock_ ) So in summary: * Prevent multi threaded code in beeline for interactions with hiveserver2, as no concurrency is supported by the Thrift protocol, unless we move to ThriftHttpCliService using Http based connection, or use NonBlockingThrift server for binary protocol on the server side. * Address the issue of responsiveness if we can. *Solution Space:* Since concurrent execution is not supported programming anything, to that effect should be avoided in beeline client. Hence, we strive to remove the multi threaded code from beeline side, in effect, moving the query log and progress bar log to merge with the GetOperationStatus api. This would still not address the issue of responsiveness as indicated in 1. above as the GetOperationStatus will use the wait time before responding to calls from beeline side, unless we decide to remove this, or reduce the wait time to a default value of say 500 milliseconds, not sure why the step function is used -- _to prevent server from wasting CPU resources on non-critical operations ?_ . This will address 2. above though since we are going to get all the information in a single call. *Implementation Considerations:* # Merge QueryLog and ProgressBarLog request / response as part of GetOperationStatus. # To get this working we have to extend HiveStatement to include few non JDBC compliant setters ( one interface for displaying progress bar, other for displaying query logs) -- default implementations for these will be _do nothing_ implementations # Have setters on hive statement for both the interfaces, used by beeline to provide required implementations. # As part of hive statement execute(*) call, we create appropriate request if custom implementations of the interfaces are provided above. # _Not related to above_ : make sure we pass the vertex progress as string (for progress bar display) and query progress as custom enum for decision making(and implementations on server side to map from execution engine based state to our generic enum state). If we are too worried about the responsiveness of the progress bar, or *2. in Problem Space* being a major impediment for hive usage, we should go with the new implementation proposal else just additionally implement with *5. in Implementation Considerations* > Progress Bar on Beeline client > ------------------------------ > > Key: HIVE-15473 > URL: https://issues.apache.org/jira/browse/HIVE-15473 > Project: Hive > Issue Type: Improvement > Components: Beeline, HiveServer2 > Affects Versions: 2.1.1 > Reporter: anishek > Assignee: anishek > Priority: Minor > Attachments: HIVE-15473.2.patch, HIVE-15473.3.patch, > HIVE-15473.4.patch, HIVE-15473.5.patch, screen_shot_beeline.jpg > > > Hive Cli allows showing progress bar for tez execution engine as shown in > https://issues.apache.org/jira/secure/attachment/12678767/ux-demo.gif > it would be great to have similar progress bar displayed when user is > connecting via beeline command line client as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)