>> and then use the Hive's dynamic partitioned insert syntax What does this entail? Same sql but you need to do
set hive.exec.dynamic.partition = true; in the hive/sql context (along with several other related dynamic partition settings.) Is there anything else/special required? 2015-11-22 17:32 GMT-08:00 Deenar Toraskar <deenar.toras...@gmail.com>: > Thanks Michael > > Thanks for the response. Here is my understanding, correct me if I am wrong > > 1) Spark SQL written partitioned tables do not write metadata to the Hive > metastore. Spark SQL discovers partitions from the table location on the > underlying DFS, and not the metastore. It does this the first time a table > is accessed, so if the underlying partitions change a refresh table > <tableName> is required. Is there a way to see partitions discovered by > Spark SQL, show partitions <tableName> does not work on Spark SQL > partitioned tables. Also hive allows different partitions in different > physical locations, I guess this wont be possibly in Spark SQL. > > 2) If you want to retain compatibility with other SQL on Hadoop engines, > register your dataframe as a temp table and then use the Hive's dynamic > partitioned insert syntax. SparkSQL uses this for Hive style tables. > > 3) Automatic schema discovery. I presume this is parquet only and only if > spark.sql.parquet.mergeSchema > / mergeSchema is set to true. What happens when mergeSchema is set to > false ( i guess i can check this out). > > My two cents > > a) it would help if there was kind of the hive nonstrict mode equivalent, > which would enforce schema compatibility for all partitions written to a > table. > b) refresh table is annoying for tables where partitions are being written > frequently, for other reasons, not sure if there is way around this. > c) it would be great if DataFrameWriter had an option to maintain > compatibility with the HiveMetastore. registerTempTable and "insert > overwrite table select from" is quite ugly and cumbersome > d) It would be helpful to resurrect the > https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/, > happy to help out with the Spark SQL portions. > > Regards > Deenar > > > On 22 November 2015 at 18:54, Michael Armbrust <mich...@databricks.com> > wrote: > >> Is it possible to add a new partition to a persistent table using Spark >>> SQL ? The following call works and data gets written in the correct >>> directories, but no partition metadata is not added to the Hive metastore. >>> >> I believe if you use Hive's dynamic partitioned insert syntax then we >> will fall back on metastore and do the update. >> >>> In addition I see nothing preventing any arbitrary schema being appended >>> to the existing table. >>> >> This is perhaps kind of a feature, we do automatic schema discovery and >> merging when loading a new parquet table. >> >>> Does SparkSQL not need partition metadata when reading data back? >>> >> No, we dynamically discover it in a distributed job when the table is >> loaded. >> > >