Re: Advice using Spark SQL and Thrift JDBC Server

James Aley Tue, 07 Apr 2015 13:10:11 -0700

Excellent, thanks for your help, I appreciate your advice!
On 7 Apr 2015 20:43, "Michael Armbrust" <mich...@databricks.com> wrote:


> That should totally work.  The other option would be to run a persistent
> metastore that multiple contexts can talk to and periodically run a job
> that creates missing tables.  The trade-off here would be more complexity,
> but less downtime due to the server restarting.
>
> On Tue, Apr 7, 2015 at 12:34 PM, James Aley <james.a...@swiftkey.com>
> wrote:
>
>> Hi Michael,
>>
>> Thanks so much for the reply - that really cleared a lot of things up for
>> me!
>>
>> Let me just check that I've interpreted one of your suggestions for (4)
>> correctly... Would it make sense for me to write a small wrapper app that
>> pulls in hive-thriftserver as a dependency, iterates my Parquet
>> directory structure to discover "tables" and registers each as a temp table
>> in some context, before calling HiveThriftServer2.createWithContext as
>> you suggest?
>>
>> This would mean that to add new content, all I need to is restart that
>> app, which presumably could also be avoided fairly trivially by
>> periodically restarting the server with a new context internally. That
>> certainly beats manual curation of Hive table definitions, if it will work?
>>
>>
>> Thanks again,
>>
>> James.
>>
>> On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> 1) What exactly is the relationship between the thrift server and Hive?
>>>> I'm guessing Spark is just making use of the Hive metastore to access table
>>>> definitions, and maybe some other things, is that the case?
>>>>
>>>
>>> Underneath the covers, the Spark SQL thrift server is executing queries
>>> using a HiveContext.  In this mode, nearly all computation is done with
>>> Spark SQL but we try to maintain compatibility with Hive wherever
>>> possible.  This means that you can write your queries in HiveQL, read
>>> tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.
>>>
>>> The one exception here is Hive DDL operations (CREATE TABLE, etc).
>>> These are passed directly to Hive code and executed there.  The Spark SQL
>>> DDL is sufficiently different that we always try to parse that first, and
>>> fall back to Hive when it does not parse.
>>>
>>> One possibly confusing point here, is that you can persist Spark SQL
>>> tables into the Hive metastore, but this is not the same as a Hive table.
>>> We are only use the metastore as a repo for metadata, but are not using
>>> their format for the information in this case (as we have datasources that
>>> hive does not understand, including things like schema auto discovery).
>>>
>>> HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
>>> INT) SORTED AS PARQUET
>>> Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
>>> hive: CREATE TABLE t USING parquet (path '/path/to/data')
>>>
>>>
>>>> 2) Am I therefore right in thinking that SQL queries sent to the thrift
>>>> server are still executed on the Spark cluster, using Spark SQL, and Hive
>>>> plays no active part in computation of results?
>>>>
>>>
>>> Correct.
>>>
>>> 3) What SQL flavour is actually supported by the Thrift Server? Is it
>>>> Spark SQL, Hive, or both? I've confused, because I've seen it accepting
>>>> Hive CREATE TABLE syntax, but Spark SQL seems to work too?
>>>>
>>>
>>> HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL
>>> parser by `SET spark.sql.dialect=sql`, but honestly you probably don't want
>>> to do this.  The included SQL parser is mostly there for people who have
>>> dependency conflicts with Hive.
>>>
>>>
>>>> 4) When I run SQL queries using the Scala or Python shells, Spark seems
>>>> to figure out the schema by itself from my Parquet files very well, if I
>>>> use createTempTable on the DataFrame. It seems when running the thrift
>>>> server, I need to create a Hive table definition first? Is that the case,
>>>> or did I miss something? If it is, is there some sensible way to automate
>>>> this?
>>>>
>>>
>>> Temporary tables are only visible to the SQLContext that creates them.
>>> If you want it to be visible to the server, you need to either start the
>>> thrift server with the same context your program is using
>>> (see HiveThriftServer2.createWithContext) or make a metastore table.  This
>>> can be done using Spark SQL DDL:
>>>
>>> CREATE TABLE t USING parquet (path '/path/to/data')
>>>
>>> Michael
>>>
>>
>>
>

Re: Advice using Spark SQL and Thrift JDBC Server

Reply via email to