[
https://issues.apache.org/jira/browse/SPARK-17913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663462#comment-15663462
]
Apache Spark commented on SPARK-17913:
--------------------------------------
User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15880
> Filter/join expressions can return incorrect results when comparing strings
> to longs
> ------------------------------------------------------------------------------------
>
> Key: SPARK-17913
> URL: https://issues.apache.org/jira/browse/SPARK-17913
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.2, 2.0.0
> Reporter: Ming Beckwith
>
> Reproducer:
> {code}
> case class E(subject: Long, predicate: String, objectNode: String)
> def test(sc: SparkContext) = {
> val sqlContext: SQLContext = new SQLContext(sc)
> import sqlContext.implicits._
> val broken = List(
> (19157170390056969L, "right", 19157170390056969L),
> (19157170390056973L, "wrong", 19157170390056971L),
> (19157190254313477L, "wrong", 19157190254313475L),
> (19157180859056133L, "wrong", 19157180859056131L),
> (19157170390056969L, "number", 161),
> (19157170390056971L, "string", "a string"),
> (19157190254313475L, "string", "another string"),
> (19157180859056131L, "number", 191)
> )
> val brokenDF = sc.parallelize(broken).map(b => E(b._1, b._2,
> b._3.toString)).toDF()
> val brokenFilter = brokenDF.filter($"subject" === $"objectNode")
> val fixed = brokenDF.filter(brokenDF("subject").cast("string") ===
> brokenDF("objectNode"))
> println("***** incorrect filter results *****")
> println(brokenFilter.show())
> println("***** correct filter results *****")
> println(fixed.show())
> println("***** both sides cast to double *****")
> println(brokenFilter.explain())
> }
> Broken filter returns:
> +-----------------+---------+-----------------+
> | subject|predicate| objectNode|
> +-----------------+---------+-----------------+
> |19157170390056969| right|19157170390056969|
> |19157170390056973| wrong|19157170390056971|
> |19157190254313477| wrong|19157190254313475|
> |19157180859056133| wrong|19157180859056131|
> +-----------------+---------+-----------------+
> {code}
> The physical plan shows both sides of the expression are being cast to Double
> before evaluation. So while comparing numbers to a string number appears to
> work in many cases, when the numbers are sufficiently large and close
> together there is enough loss of precision to cause incorrect results.
> {code}
> == Physical Plan ==
> Filter (cast(subject#0L as double) = cast(objectNode#2 as double))
> After casting the left side into strings, the filter returns the expected
> result:
> +-----------------+---------+-----------------+
> | subject|predicate| objectNode|
> +-----------------+---------+-----------------+
> |19157170390056969| right|19157170390056969|
> +-----------------+---------+-----------------+
> {code}
> Expected behavior in this case is probably to choose one side and cast the
> other (compare string to string or long to long) instead of using a data type
> with less precision.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]