I've run some tests with some real and some synthetic parquet data with nested columns with and without the hive metastore on our Spark 1.5, 1.6 and 2.0 versions. I haven't seen any unexpected performance surprises, except that Spark 2.0 now does schema inference across all files in a partitioned parquet metastore table. Granted, you aren't using a metastore table, but maybe Spark does that for partitioned non-metastore tables as well.
Michael > On Jul 20, 2016, at 2:16 PM, Maciej Bryński <mac...@brynski.pl> wrote: > > @Michael, > I answered in Jira and could repeat here. > I think that my problem is unrelated to Hive, because I'm using read.parquet > method. > I also attached some VisualVM snapshots to SPARK-16321 (I think I should > merge both issues) > And code profiling suggest bottleneck when reading parquet file. > > I wonder if there are any other benchmarks related to parquet performance. > > Regards, > -- > Maciek Bryński --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org