Hi Benyi,

The quote from the HiveServer2 proposal reads in full:

"In fact, it's impossible for HiveServer to support concurrent connections
using the current Thrift API, *a result of the fact that Thrift doesn't
provide server-side access to connection handles*"

The point I'm trying to make with this statement is that HiveServer
maintains session state using thread-local variables and implicitly relies
on Thrift consistently mapping the same connection to the same Thrift
worker thread, but this isn't a valid assumption to make. For example, if a
client executes "set mapred.reduce.tasks=1" followed by "select .....", you
can't assume that both of these statements will be executed by the same
worker thread. Furthermore, the Thrift API doesn't provide any mechanism
for detecting client disconnects (see THRIFT-1195), which results in
incorrect behavior like this:

% hive -h localhost -p 10000
[localhost:10000] hive> set x=1;
set x=1;
[localhost:10000] hive> set x;
set x;
x=1
[localhost:10000] hive> quit;
quit;
% hive -h localhost -p 10000
[localhost:10000] hive> set x;
set x;
x=1
[localhost:10000] hive> quit;
quit;

In this example I opened a connection to HiveServer and modified my
sessions state on the server by setting x=1. I then killed the connection
and reconnected, and then printed the value of x again. Since I'm creating
a new connection/session I expect x to be undefined, however I actually see
the value of x which I set in the previous connection. This happens because
Thrift assigns the same worker thread to service the second connection, and
since there's no way of detecting client disconnects, HiveServer was unable
clear the thread-local session state associated with that worker thread
before Thrift reassigned it to the second connection.

While it's tempting to try to solve these problems by modifying Thrift to
provide direct access to the connection handle (which would allow us map
connections to session state on the server-side), this approach makes it
really hard to support HA since it depends on the physical connection
lasting as long as the user session, which isn't a fair assumption to make
in the context of queries that can take many hours to complete.

Instead, the approach we're taking with HiveServer2 is to provide explicit
support for sessions in the client API, e.g every RPC call references a
session ID which the server then maps to persistent session state. This
makes it possible for any worker thread to service any request from any
client connection.

I hope this clarifies the limitations of the current HiveServer
implementation as well as the motivations for implementing HiveServer2.
Please let me know if you have any more questions.

Thanks.

Carl

On Thu, Apr 26, 2012 at 11:55 AM, Benyi Wang <bewang.t...@gmail.com> wrote:

> I'm a little confused with "In fact, it's impossible for HiveServer to
> support concurrent connections using the current Thrift API" in hive wiki
> page
> https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Thrift+API.
>
> I started a hive server on hostA using cdh3u3
>
> hadoop-hive.noarch                  0.7.1+42.36-2
>  installed
>
> Then I logged on two nodes: hostB, and hostC, then start hive client
>
> $ hive -h hostA -p 10000
>
> It seems that both of two hive clients work normally.
>
> Am I wrong? or the issue in the wiki page has been resolved?
>

Reply via email to