[jira] [Resolved] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

Reynold Xin (JIRA) Thu, 01 Dec 2016 22:03:08 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Reynold Xin resolved SPARK-17213.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.1.0

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -----------------------------------------------------
>
>                 Key: SPARK-17213
>                 URL: https://issues.apache.org/jira/browse/SPARK-17213
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Andrew Duffy
>            Assignee: Cheng Lian
>             Fix For: 2.1.0
>
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
>     > Seq("a", "é").toDF("name").where("name > 'a'").count
>     1
> {code}
> Querying from a parquet dataset:
> {code}
>     > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
>     > spark.read.parquet("/tmp/bad").where("name > 'a'").count
>     0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

Reply via email to