1.9.0 includes some fixes intended specifically for Spark:

* PARQUET-389: Evaluates push-down predicates for missing columns as though
they are null. This is to address Spark's work-around that requires reading
and merging file schemas, even for metastore tables.
* PARQUET-654: Adds an option to disable record-level predicate push-down,
but keep row group evaluation. This allows Spark to skip row groups based
on stats and dictionaries, but implement its own vectorized record
filtering.

The Parquet community also evaluated performance to ensure no performance
regressions from moving to the ByteBuffer read path.

There is one concern about 1.9.0 that will be addressed in 1.9.1, which is
that stats calculations were incorrectly using unsigned byte order for
string comparison. This means that min/max stats can't be used if the data
contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
return the bad min/max values for correctness, but there is a property to
override this behavior for data that doesn't use the affected code points.

Upgrading to 1.9.0 depends on how the community wants to handle the sort
order bug: whether correctness or performance should be the default.

rb

On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <so...@cloudera.com> wrote:

> Yes this came up from a different direction: https://issues.
> apache.org/jira/browse/SPARK-18140
>
> I think it's fine to pursue an upgrade to fix these several issues. The
> question is just how well it will play with other components, so bears some
> testing and evaluation of the changes from 1.8, but yes this would be good.
>
> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <mich...@videoamp.com>
> wrote:
>
>> Hi All,
>>
>> Is anyone working on updating Spark's Parquet library dep to 1.9? If not,
>> I can at least get started on it and publish a PR.
>>
>> Cheers,
>>
>> Michael
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to