I must missed something important here, could you please provide more
clue on Parquet “schema versioning”? I wasn’t aware of this feature
(which sounds really useful).
Especially, are you referring the following scenario:
1. Write some data whose schema is A to “t.parquet”, resulting a file
“t.parquet/parquet-r-1.part” on HDFS
2. Append more data whose schema B “contains” A, but has more columns
to “t.parquet”, resulting another file “t.parquet/parquet-r-2.part”
on HDFS
3. Now read “t.parquet”, and schema A and B are expected to be merged
If this is the case, then current Spark SQL doesn’t support this. We
assume schemas of all data within a single Parquet file (which is an
HDFS directory with multiple part-files) are identical.
On 12/22/14 1:11 PM, Adam Gilmore wrote:
Hi all,
I understand that parquet allows for schema versioning automatically
in the format; however, I'm not sure whether Spark supports this.
I'm saving a SchemaRDD to a parquet file, registering it as a table,
then doing an insertInto with a SchemaRDD with an extra column.
The second SchemaRDD does in fact get inserted, but the extra column
isn't present when I try to query it with Spark SQL.
Is there anything I can do to get this working how I'm hoping?