Re: Parquet schema changes

Cheng Lian Mon, 22 Dec 2014 21:35:11 -0800

I must missed something important here, could you please provide moreclue on Parquet “schema versioning”? I wasn’t aware of this feature(which sounds really useful).


Especially, are you referring the following scenario:


1. Write some data whose schema is A to “t.parquet”, resulting a file
   “t.parquet/parquet-r-1.part” on HDFS
2. Append more data whose schema B “contains” A, but has more columns
   to “t.parquet”, resulting another file “t.parquet/parquet-r-2.part”
   on HDFS
3. Now read “t.parquet”, and schema A and B are expected to be merged

If this is the case, then current Spark SQL doesn’t support this. Weassume schemas of all data within a single Parquet file (which is anHDFS directory with multiple part-files) are identical.


On 12/22/14 1:11 PM, Adam Gilmore wrote:

Hi all,
I understand that parquet allows for schema versioning automaticallyin the format; however, I'm not sure whether Spark supports this.
I'm saving a SchemaRDD to a parquet file, registering it as a table,then doing an insertInto with a SchemaRDD with an extra column.
The second SchemaRDD does in fact get inserted, but the extra columnisn't present when I try to query it with Spark SQL.
Is there anything I can do to get this working how I'm hoping?

Re: Parquet schema changes

Reply via email to