Does partitioned write preserve in-partition order?

Enrico Minack Tue, 11 Oct 2022 03:16:06 -0700

Hi Devs,

this has been raised by Swetha on the user mailing list, which also hitus recently.


Here is the question again:

*Is it guaranteed that written files are sorted as stated in**sortWithinPartitions**?*


ds.repartition($"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .csv("interleaved.csv")

This construct is a common use case to generate partitioned and sortedfiles, where downstream systems depend on guaranteed order.

It used to work until 3.0.3. *Was this guaranteed to work or justhappened to be correct?*

It stopped working with 3.1.0, but we can workaround settingspark.sql.adaptive.coalescePartitions.enabled="false". *Is thatguaranteed to fix it?*


With 3.2.x and 3.3.x, the workaround does not work. *Is there a workaround?*

It has been fixed in 3.4.0-SNAPSHOT. *Was that fixed intentionally oraccidentally?*



Code to reproduce:

import org.apache.spark.sql.Encoders
import org.apache.spark.sql.SaveMode

val ids = 10000000
val days = 10

case class Value(day: Long, id: Long)

val ds = spark.range(days).withColumnRenamed("id","day").join(spark.range(ids)).as[Value]

// days * 10 is required, as well as a sufficiently large value for ids(10m) and day (10)

ds.repartition(days * 10, $"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .mode(SaveMode.Overwrite)
  .csv("interleaved.csv")

val df =spark.read.schema(Encoders.product[Value].schema).csv("interleaved.csv")


Check the written files are sorted (says OK when they are sorted):

for file in interleaved.csv/day\=*/part-*
do
  echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
done | md5sum -c


Thanks for your background knowledge on this.

Cheers,
Enrico

Does partitioned write preserve in-partition order?

Reply via email to