I think what Michael means is people often use this to read existing partitioned Parquet tables that are defined in a Hive metastore rather than data generated directly from within Spark and then reading it back as a table. I'd expect the latter case to become more common, but for now most users connect to an existing metastore.
I think you could go this route by creating a partitioned external table based on the on-disk layout you create. The downside is that you'd have to go through a hive metastore whereas what you are doing now doesn't need hive at all. We should also just fix the case you are mentioning where a union is used directly from within spark. But that's the context. - Patrick On Tue, Sep 9, 2014 at 12:01 PM, Cody Koeninger <c...@koeninger.org> wrote: > Maybe I'm missing something, I thought parquet was generally a write-once > format and the sqlContext interface to it seems that way as well. > > d1.saveAsParquetFile("/foo/d1") > > // another day, another table, with same schema > d2.saveAsParquetFile("/foo/d2") > > Will give a directory structure like > > /foo/d1/_metadata > /foo/d1/part-r-1.parquet > /foo/d1/part-r-2.parquet > /foo/d1/_SUCCESS > > /foo/d2/_metadata > /foo/d2/part-r-1.parquet > /foo/d2/part-r-2.parquet > /foo/d2/_SUCCESS > > // ParquetFileReader will fail, because /foo/d1 is a directory, not a > parquet partition > sqlContext.parquetFile("/foo") > > // works, but has the noted lack of pushdown > sqlContext.parquetFile("/foo/d1").unionAll(sqlContext.parquetFile("/foo/d2")) > > > Is there another alternative? > > > > On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> I think usually people add these directories as multiple partitions of the >> same table instead of union. This actually allows us to efficiently prune >> directories when reading in addition to standard column pruning. >> >> On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf <malouf.g...@gmail.com> >> wrote: >> >>> I'm kind of surprised this was not run into before. Do people not >>> segregate their data by day/week in the HDFS directory structure? >>> >>> >>> On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust <mich...@databricks.com> >>> wrote: >>> >>>> Thanks! >>>> >>>> On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger <c...@koeninger.org> >>>> wrote: >>>> >>>> > Opened >>>> > >>>> > https://issues.apache.org/jira/browse/SPARK-3462 >>>> > >>>> > I'll take a look at ColumnPruning and see what I can do >>>> > >>>> > On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust < >>>> mich...@databricks.com> >>>> > wrote: >>>> > >>>> >> On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger <c...@koeninger.org> >>>> >> wrote: >>>> >>> >>>> >>> Is there a reason in general not to push projections and predicates >>>> down >>>> >>> into the individual ParquetTableScans in a union? >>>> >>> >>>> >> >>>> >> This would be a great case to add to ColumnPruning. Would be awesome >>>> if >>>> >> you could open a JIRA or even a PR :) >>>> >> >>>> > >>>> > >>>> >>> >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org