[
https://issues.apache.org/jira/browse/SPARK-47650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-47650:
----------------------------------
Priority: Critical (was: Blocker)
> Spark DataFrame processed the same partition twice, which can be reproduced
> with very simple code.
> --------------------------------------------------------------------------------------------------
>
> Key: SPARK-47650
> URL: https://issues.apache.org/jira/browse/SPARK-47650
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 3.5.1
> Reporter: yinan zhan
> Priority: Critical
> Labels: pull-request-available
>
> The data in Partition 0 was executed twice.
> Here is the reproduction code; the issue occurs every time.
> It appears that the string starting with "0cool" is printed twice.
>
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder \
> .appName("llm hard negs records") \
> .master(f"local[8]") \
> .getOrCreate()
> def process_partition(index, partition):
> results = []
> s = 0
> for _ in partition:
> row = {"Result": "cool"}
> results.append(row)
> s += 1
> print(str(index) + "cool" + str(s))
> return results
> data = list(range(2000))
> results_rdd =
> spark.sparkContext.parallelize(data).repartition(8).mapPartitionsWithIndex(process_partition)
> results_df = results_rdd.toDF(["Query"])
> output_path = "/tmp/bc_inputs6"
> results_df.write.json(output_path, mode="overwrite")
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]