Excellent, thanks for your help, I appreciate your advice! On 7 Apr 2015 20:43, "Michael Armbrust" <mich...@databricks.com> wrote:
> That should totally work. The other option would be to run a persistent > metastore that multiple contexts can talk to and periodically run a job > that creates missing tables. The trade-off here would be more complexity, > but less downtime due to the server restarting. > > On Tue, Apr 7, 2015 at 12:34 PM, James Aley <james.a...@swiftkey.com> > wrote: > >> Hi Michael, >> >> Thanks so much for the reply - that really cleared a lot of things up for >> me! >> >> Let me just check that I've interpreted one of your suggestions for (4) >> correctly... Would it make sense for me to write a small wrapper app that >> pulls in hive-thriftserver as a dependency, iterates my Parquet >> directory structure to discover "tables" and registers each as a temp table >> in some context, before calling HiveThriftServer2.createWithContext as >> you suggest? >> >> This would mean that to add new content, all I need to is restart that >> app, which presumably could also be avoided fairly trivially by >> periodically restarting the server with a new context internally. That >> certainly beats manual curation of Hive table definitions, if it will work? >> >> >> Thanks again, >> >> James. >> >> On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> 1) What exactly is the relationship between the thrift server and Hive? >>>> I'm guessing Spark is just making use of the Hive metastore to access table >>>> definitions, and maybe some other things, is that the case? >>>> >>> >>> Underneath the covers, the Spark SQL thrift server is executing queries >>> using a HiveContext. In this mode, nearly all computation is done with >>> Spark SQL but we try to maintain compatibility with Hive wherever >>> possible. This means that you can write your queries in HiveQL, read >>> tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc. >>> >>> The one exception here is Hive DDL operations (CREATE TABLE, etc). >>> These are passed directly to Hive code and executed there. The Spark SQL >>> DDL is sufficiently different that we always try to parse that first, and >>> fall back to Hive when it does not parse. >>> >>> One possibly confusing point here, is that you can persist Spark SQL >>> tables into the Hive metastore, but this is not the same as a Hive table. >>> We are only use the metastore as a repo for metadata, but are not using >>> their format for the information in this case (as we have datasources that >>> hive does not understand, including things like schema auto discovery). >>> >>> HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x >>> INT) SORTED AS PARQUET >>> Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by >>> hive: CREATE TABLE t USING parquet (path '/path/to/data') >>> >>> >>>> 2) Am I therefore right in thinking that SQL queries sent to the thrift >>>> server are still executed on the Spark cluster, using Spark SQL, and Hive >>>> plays no active part in computation of results? >>>> >>> >>> Correct. >>> >>> 3) What SQL flavour is actually supported by the Thrift Server? Is it >>>> Spark SQL, Hive, or both? I've confused, because I've seen it accepting >>>> Hive CREATE TABLE syntax, but Spark SQL seems to work too? >>>> >>> >>> HiveQL++ (with Spark SQL DDL). You can make it use our simple SQL >>> parser by `SET spark.sql.dialect=sql`, but honestly you probably don't want >>> to do this. The included SQL parser is mostly there for people who have >>> dependency conflicts with Hive. >>> >>> >>>> 4) When I run SQL queries using the Scala or Python shells, Spark seems >>>> to figure out the schema by itself from my Parquet files very well, if I >>>> use createTempTable on the DataFrame. It seems when running the thrift >>>> server, I need to create a Hive table definition first? Is that the case, >>>> or did I miss something? If it is, is there some sensible way to automate >>>> this? >>>> >>> >>> Temporary tables are only visible to the SQLContext that creates them. >>> If you want it to be visible to the server, you need to either start the >>> thrift server with the same context your program is using >>> (see HiveThriftServer2.createWithContext) or make a metastore table. This >>> can be done using Spark SQL DDL: >>> >>> CREATE TABLE t USING parquet (path '/path/to/data') >>> >>> Michael >>> >> >> >