Re: Advice using Spark SQL and Thrift JDBC Server

Michael Armbrust Tue, 07 Apr 2015 11:31:28 -0700

>
> 1) What exactly is the relationship between the thrift server and Hive?
> I'm guessing Spark is just making use of the Hive metastore to access table
> definitions, and maybe some other things, is that the case?
>


Underneath the covers, the Spark SQL thrift server is executing queries
using a HiveContext.  In this mode, nearly all computation is done with
Spark SQL but we try to maintain compatibility with Hive wherever
possible.  This means that you can write your queries in HiveQL, read
tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.

The one exception here is Hive DDL operations (CREATE TABLE, etc).  These
are passed directly to Hive code and executed there.  The Spark SQL DDL is
sufficiently different that we always try to parse that first, and fall
back to Hive when it does not parse.

One possibly confusing point here, is that you can persist Spark SQL tables
into the Hive metastore, but this is not the same as a Hive table.  We are
only use the metastore as a repo for metadata, but are not using their
format for the information in this case (as we have datasources that hive
does not understand, including things like schema auto discovery).

HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
INT) SORTED AS PARQUET
Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
hive: CREATE TABLE t USING parquet (path '/path/to/data')


> 2) Am I therefore right in thinking that SQL queries sent to the thrift
> server are still executed on the Spark cluster, using Spark SQL, and Hive
> plays no active part in computation of results?
>

Correct.

3) What SQL flavour is actually supported by the Thrift Server? Is it Spark
> SQL, Hive, or both? I've confused, because I've seen it accepting Hive
> CREATE TABLE syntax, but Spark SQL seems to work too?
>

HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL parser
by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do
this.  The included SQL parser is mostly there for people who have
dependency conflicts with Hive.


> 4) When I run SQL queries using the Scala or Python shells, Spark seems to
> figure out the schema by itself from my Parquet files very well, if I use
> createTempTable on the DataFrame. It seems when running the thrift server,
> I need to create a Hive table definition first? Is that the case, or did I
> miss something? If it is, is there some sensible way to automate this?
>

Temporary tables are only visible to the SQLContext that creates them.  If
you want it to be visible to the server, you need to either start the
thrift server with the same context your program is using
(see HiveThriftServer2.createWithContext) or make a metastore table.  This
can be done using Spark SQL DDL:

CREATE TABLE t USING parquet (path '/path/to/data')

Michael

Re: Advice using Spark SQL and Thrift JDBC Server

Reply via email to