this is column names containing dots that do not target fields inside structs? so not a.b as in field b inside struct a, but somehow a field called a.b? i didnt even know it is supported at all. its something i would never try because it sounds like a bad idea to go there...
On Fri, Apr 28, 2017 at 12:17 PM, Andrew Ash <and...@andrewash.com> wrote: > -1 due to regression from 2.1.1 > > In 2.2.0-rc1 we bumped the Parquet version from 1.8.1 to 1.8.2 in commit > 26a4cba3ff <https://github.com/apache/spark/commit/26a4cba3ff>. Parquet > 1.8.2 includes a backport from 1.9.0: PARQUET-389 > <https://issues.apache.org/jira/browse/PARQUET-389> in commit 2282c22c > <https://github.com/apache/parquet-mr/commit/2282c22c> > > This backport caused a regression in Spark, where filtering on columns > containing dots in the column name pushes the filter down into Parquet > where Parquet incorrectly handles the predicate. Spark pushes the String > "col.dots" as the column name, but Parquet interprets this as > "struct.field" where the predicate is on a field of a struct. The ultimate > result is that the predicate always returns zero results, causing a data > correctness issue. > > This issue is filed in Spark as SPARK-20364 > <https://issues.apache.org/jira/browse/SPARK-20364> and has a PR fix up > at PR #17680 <https://github.com/apache/spark/pull/17680>. > > I nominate SPARK-20364 <https://issues.apache.org/jira/browse/SPARK-20364> as > a release blocker due to the data correctness regression. > > Thanks! > Andrew > > On Thu, Apr 27, 2017 at 6:49 PM, Sean Owen <so...@cloudera.com> wrote: > >> By the way the RC looks good. Sigs and license are OK, tests pass with >> -Phive -Pyarn -Phadoop-2.7. +1 from me. >> >> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> Please vote on releasing the following candidate as Apache Spark >>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST >>> and passes if a majority of at least 3 +1 PMC votes are cast. >>> >>> [ ] +1 Release this package as Apache Spark 2.2.0 >>> [ ] -1 Do not release this package because ... >>> >>> >>> To learn more about Apache Spark, please see http://spark.apache.org/ >>> >>> The tag to be voted on is v2.2.0-rc1 >>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c >>> 1a8f8966c7e64010cf5632cb6) >>> >>> List of JIRA tickets resolved can be found with this filter >>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1> >>> . >>> >>> The release files, including signatures, digests, etc. can be found at: >>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/ >>> >>> Release artifacts are signed with the following key: >>> https://people.apache.org/keys/committer/pwendell.asc >>> >>> The staging repository for this release can be found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1235/ >>> >>> The documentation corresponding to this release can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/ >>> >>> >>> *FAQ* >>> >>> *How can I help test this release?* >>> >>> If you are a Spark user, you can help us test this release by taking an >>> existing Spark workload and running on this release candidate, then >>> reporting any regressions. >>> >>> *What should happen to JIRA tickets still targeting 2.2.0?* >>> >>> Committers should look at those and triage. Extremely important bug >>> fixes, documentation, and API tweaks that impact compatibility should be >>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >>> >>> *But my bug isn't fixed!??!* >>> >>> In order to make timely releases, we will typically not hold the release >>> unless the bug in question is a regression from 2.1.1. >>> >> >