I must missed something important here, could you please provide more clue on Parquet “schema versioning”? I wasn’t aware of this feature (which sounds really useful).

Especially, are you referring the following scenario:

1. Write some data whose schema is A to “t.parquet”, resulting a file
   “t.parquet/parquet-r-1.part” on HDFS
2. Append more data whose schema B “contains” A, but has more columns
   to “t.parquet”, resulting another file “t.parquet/parquet-r-2.part”
   on HDFS
3. Now read “t.parquet”, and schema A and B are expected to be merged

If this is the case, then current Spark SQL doesn’t support this. We assume schemas of all data within a single Parquet file (which is an HDFS directory with multiple part-files) are identical.

On 12/22/14 1:11 PM, Adam Gilmore wrote:

Hi all,

I understand that parquet allows for schema versioning automatically in the format; however, I'm not sure whether Spark supports this.

I'm saving a SchemaRDD to a parquet file, registering it as a table, then doing an insertInto with a SchemaRDD with an extra column.

The second SchemaRDD does in fact get inserted, but the extra column isn't present when I try to query it with Spark SQL.

Is there anything I can do to get this working how I'm hoping?

Reply via email to