> > 1) What exactly is the relationship between the thrift server and Hive? > I'm guessing Spark is just making use of the Hive metastore to access table > definitions, and maybe some other things, is that the case? >
Underneath the covers, the Spark SQL thrift server is executing queries using a HiveContext. In this mode, nearly all computation is done with Spark SQL but we try to maintain compatibility with Hive wherever possible. This means that you can write your queries in HiveQL, read tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc. The one exception here is Hive DDL operations (CREATE TABLE, etc). These are passed directly to Hive code and executed there. The Spark SQL DDL is sufficiently different that we always try to parse that first, and fall back to Hive when it does not parse. One possibly confusing point here, is that you can persist Spark SQL tables into the Hive metastore, but this is not the same as a Hive table. We are only use the metastore as a repo for metadata, but are not using their format for the information in this case (as we have datasources that hive does not understand, including things like schema auto discovery). HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x INT) SORTED AS PARQUET Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by hive: CREATE TABLE t USING parquet (path '/path/to/data') > 2) Am I therefore right in thinking that SQL queries sent to the thrift > server are still executed on the Spark cluster, using Spark SQL, and Hive > plays no active part in computation of results? > Correct. 3) What SQL flavour is actually supported by the Thrift Server? Is it Spark > SQL, Hive, or both? I've confused, because I've seen it accepting Hive > CREATE TABLE syntax, but Spark SQL seems to work too? > HiveQL++ (with Spark SQL DDL). You can make it use our simple SQL parser by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do this. The included SQL parser is mostly there for people who have dependency conflicts with Hive. > 4) When I run SQL queries using the Scala or Python shells, Spark seems to > figure out the schema by itself from my Parquet files very well, if I use > createTempTable on the DataFrame. It seems when running the thrift server, > I need to create a Hive table definition first? Is that the case, or did I > miss something? If it is, is there some sensible way to automate this? > Temporary tables are only visible to the SQLContext that creates them. If you want it to be visible to the server, you need to either start the thrift server with the same context your program is using (see HiveThriftServer2.createWithContext) or make a metastore table. This can be done using Spark SQL DDL: CREATE TABLE t USING parquet (path '/path/to/data') Michael