I saw that in the source, which is why I was wondering. I was mainly reading:
http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/ "A query that tries to parse the organizationId and userId from the 2 logTypes should be able to do so correctly, though they are positioned differently in the schema. With Parquet, it’s not a problem. It will merge ‘A’ and ‘V’ schemas and project columns accordingly. It does so by maintaining a file schema in addition to merged schema and parsing the columns by referencing the 2." I know that each part file can have its own schema, but I saw in the implementation for Spark, if there was no metadata file, it'd just pick the first file and use that schema across the board. I'm not quite sure how other implementations like Impala etc. deal with this, but I was really hoping there'd be a way to "version" the schema as new records are added and just project it through. Would be a godsend for semi-structured data. On Tue, Dec 23, 2014 at 3:33 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > I must missed something important here, could you please provide more > clue on Parquet “schema versioning”? I wasn’t aware of this feature (which > sounds really useful). > > Especially, are you referring the following scenario: > > 1. Write some data whose schema is A to “t.parquet”, resulting a file > “t.parquet/parquet-r-1.part” on HDFS > 2. Append more data whose schema B “contains” A, but has more columns > to “t.parquet”, resulting another file “t.parquet/parquet-r-2.part” on HDFS > 3. Now read “t.parquet”, and schema A and B are expected to be merged > > If this is the case, then current Spark SQL doesn’t support this. We > assume schemas of all data within a single Parquet file (which is an HDFS > directory with multiple part-files) are identical. > > On 12/22/14 1:11 PM, Adam Gilmore wrote: > > Hi all, > > I understand that parquet allows for schema versioning automatically in > the format; however, I'm not sure whether Spark supports this. > > I'm saving a SchemaRDD to a parquet file, registering it as a table, > then doing an insertInto with a SchemaRDD with an extra column. > > The second SchemaRDD does in fact get inserted, but the extra column > isn't present when I try to query it with Spark SQL. > > Is there anything I can do to get this working how I'm hoping? > > >