Re: Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

Stephen Boesch Sun, 22 Nov 2015 18:48:53 -0800

>> and then use the  Hive's dynamic partitioned insert syntax

What does this entail?  Same sql but you need to do


               set  hive.exec.dynamic.partition = true;
in the hive/sql context  (along with several other related dynamic
partition settings.)

Is there anything else/special required?


2015-11-22 17:32 GMT-08:00 Deenar Toraskar <deenar.toras...@gmail.com>:

> Thanks Michael
>
> Thanks for the response. Here is my understanding, correct me if I am wrong
>
> 1) Spark SQL written partitioned tables do not write metadata to the Hive
> metastore. Spark SQL discovers partitions from the table location on the
> underlying DFS, and not the metastore. It does this the first time a table
> is accessed, so if the underlying partitions change a refresh table
> <tableName> is required. Is there a way to see partitions discovered by
> Spark SQL, show partitions <tableName> does not work on Spark SQL
> partitioned tables. Also hive allows different partitions in different
> physical locations, I guess this wont be possibly in Spark SQL.
>
> 2) If you want to retain compatibility with other SQL on Hadoop engines,
> register your dataframe as a temp table and then use the  Hive's dynamic
> partitioned insert syntax. SparkSQL uses this for Hive style tables.
>
> 3) Automatic schema discovery. I presume this is parquet only and only if 
> spark.sql.parquet.mergeSchema
> / mergeSchema is set to true. What happens when mergeSchema is set to
> false ( i guess i can check this out).
>
> My two cents
>
> a) it would help if there was kind of the hive nonstrict mode equivalent,
> which would enforce schema compatibility for all partitions written to a
> table.
> b) refresh table is annoying for tables where partitions are being written
> frequently, for other reasons, not sure if there is way around this.
> c) it would be great if DataFrameWriter had an option to maintain
> compatibility with the HiveMetastore. registerTempTable and "insert
> overwrite table select from" is quite ugly and cumbersome
> d) It would be helpful to resurrect the
> https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/,
> happy to help out with the Spark SQL portions.
>
> Regards
> Deenar
>
>
> On 22 November 2015 at 18:54, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Is it possible to add a new partition to a persistent table using Spark
>>> SQL ? The following call works and data gets written in the correct
>>> directories, but no partition metadata is not added to the Hive metastore.
>>>
>> I believe if you use Hive's dynamic partitioned insert syntax then we
>> will fall back on metastore and do the update.
>>
>>> In addition I see nothing preventing any arbitrary schema being appended
>>> to the existing table.
>>>
>> This is perhaps kind of a feature, we do automatic schema discovery and
>> merging when loading a new parquet table.
>>
>>> Does SparkSQL not need partition metadata when reading data back?
>>>
>> No, we dynamically discover it in a distributed job when the table is
>> loaded.
>>
>
>

Re: Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

Reply via email to