Re: Parquet schema migrations

2014-10-24 Thread Gary Malouf
Hi Michael, Does this affect people who use Hive for their metadata store as well? I'm wondering if the issue is as bad as I think it is - namely that if you build up a year's worth of data, adding a field forces you to have to migrate that entire year's data. Gary On Wed, Oct 8, 2014 at 5:08 P

Re: Parquet schema migrations

2014-10-08 Thread Cody Koeninger
On Wed, Oct 8, 2014 at 3:19 PM, Michael Armbrust wrote: > > I was proposing you manually convert each different format into one > unified format (by adding literal nulls and such for missing columns) and > then union these converted datasets. It would be weird to have union all > try and do thi

Re: Parquet schema migrations

2014-10-08 Thread Michael Armbrust
> > The kind of change we've made that it probably makes most sense to support > is adding a nullable column. I think that also implies supporting > "removing" a nullable column, as long as you don't end up with columns of > the same name but different type. > Filed here: https://issues.apache.org

Re: Parquet schema migrations

2014-10-06 Thread Cody Koeninger
Sorry, by "raw parquet" I just meant there is no external metadata store, only the schema written as part of the parquet format. We've done several different kinds of changes, including column rename and widening the data type of an existing column. I don't think it's feasible to support those.

Re: Parquet schema migrations

2014-10-05 Thread Michael Armbrust
Hi Cody, Assuming you are talking about 'safe' changes to the schema (i.e. existing column names are never reused with incompatible types), this is something I'd love to support. Perhaps you can describe more what sorts of changes you are making, and if simple merging of the schemas would be suff

Re: Parquet schema migrations

2014-10-05 Thread Andrew Ash
Hi Cody, I wasn't aware there were different versions of the parquet format. What's the difference between "raw parquet" and the Hive-written parquet files? As for your migration question, the approaches I've often seen are convert-on-read and convert-all-at-once. Apache Cassandra for example d