[
https://issues.apache.org/jira/browse/SPARK-50525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-50525:
-----------------------------------
Labels: correctness pull-request-available (was: correctness)
> Do not allow repartition by map
> -------------------------------
>
> Key: SPARK-50525
> URL: https://issues.apache.org/jira/browse/SPARK-50525
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Herman van Hövell
> Priority: Blocker
> Labels: correctness, pull-request-available
>
>
> We allow users to repartition by a map column. This leads to incorrect
> results.
> {code:java}
> // Create a sequence of maps that all have the same element, but a different
> insertion order.
> import scala.util.Random
> val elements = Seq.tabulate(4)(i => i -> s"v$i") // 4 elements uses a
> scala.collection.immutable.Map$Map4, this retains the insertion order.
> val maps = Seq.fill(10)(Random.shuffle(elements).toMap)
> // Check if they are all the same in scala land.
> assert(maps.distinct.size == 1)
> // This fails, which is good.
> maps.toDF.distinct.show()
> // This should return a single partition. However it returns multiple
> partitions.
> maps.toDF.repartition(4, $"value").groupBy(spark_partition_id()).show()
> // +--------------------+-----+
> // |SPARK_PARTITION_ID()|count|
> // +--------------------+-----+
> // | 0| 2|
> // | 1| 4|
> // | 2| 2|
> // | 3| 2|
> // +--------------------+-----+{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]