Since it's a backport from master to branch-2.3 for ORC 1.4.3, I made a backport PR.
https://github.com/apache/spark/pull/21093 Thank you for raising this issues and confirming, Henry and Xiao. :) Bests, Dongjoon. On Tue, Apr 17, 2018 at 12:01 AM, Xiao Li <gatorsm...@gmail.com> wrote: > Yes, it sounds good to me. We can upgrade both Parquet 1.8.2 to 1.8.3 and > ORC 1.4.1 to 1.4.3 in our upcoming Spark 2.3.1 release. > > Thanks for your efforts! @Henry and @Dongjoon > > Xiao > > 2018-04-16 14:41 GMT-07:00 Henry Robinson <he...@apache.org>: > >> Seems like there aren't any objections. I'll pick this thread back up >> when a Parquet maintenance release has happened. >> >> Henry >> >> On 11 April 2018 at 14:00, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: >> >>> Great. >>> >>> If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache >>> Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together. >>> >>> Currently, the patch is only merged into master branch now. 1.4.1 has >>> the following issue. >>> >>> https://issues.apache.org/jira/browse/SPARK-23340 >>> >>> Bests, >>> Dongjoon. >>> >>> >>> >>> On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> Seems like this would make sense... we usually make maintenance >>>> releases for bug fixes after a month anyway. >>>> >>>> >>>> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <he...@apache.org> >>>> wrote: >>>> >>>>> >>>>> >>>>> On 11 April 2018 at 12:47, Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> >>>>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of >>>>>> Spark. >>>>>> >>>>>> To be clear though, this only affects Spark when reading data written >>>>>> by Impala, right? Or does Parquet CPP also produce data like this? >>>>>> >>>>> >>>>> I don't know about parquet-cpp, but yeah, the only implementation I've >>>>> seen writing the half-completed stats is Impala. (as you know, that's >>>>> compliant with the spec, just an unusual choice). >>>>> >>>>> >>>>>> >>>>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi all - >>>>>>> >>>>>>> SPARK-23852 (where a query can silently give wrong results thanks to >>>>>>> a predicate pushdown bug in Parquet) is a fairly bad bug. In other >>>>>>> projects >>>>>>> I've been involved with, we've released maintenance releases for bugs of >>>>>>> this severity. >>>>>>> >>>>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there >>>>>>> was any consensus over whether we should consider (at least) a 2.3.1. >>>>>>> >>>>>>> The reason this particular issue is a bit tricky is that the Parquet >>>>>>> community haven't yet produced a maintenance release that fixes the >>>>>>> underlying bug, but they are in the process of releasing a new minor >>>>>>> version, 1.10, which includes a fix. Having spoken to a couple of >>>>>>> Parquet >>>>>>> developers, they'd be willing to consider a maintenance release, but >>>>>>> would >>>>>>> probably only bother if we (or another affected project) asked them to. >>>>>>> >>>>>>> My guess is that we wouldn't want to upgrade to a new minor version >>>>>>> of Parquet for a Spark maintenance release, so asking for a Parquet >>>>>>> maintenance release makes sense. >>>>>>> >>>>>>> What does everyone think? >>>>>>> >>>>>>> Best, >>>>>>> Henry >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>>> >>>> >>> >> >