Re: [PR] chore: Improve shuffle configuration [datafusion-comet]

via GitHub Wed, 08 Jan 2025 13:56:17 -0800


kazuyukitanimura commented on code in PR #1207:
URL: https://github.com/apache/datafusion-comet/pull/1207#discussion_r1907901646



##########
docs/source/user-guide/tuning.md:
##########
@@ -78,43 +78,47 @@ It must be set before the Spark context is created. You can 
enable or disable Co
 at runtime by setting `spark.comet.exec.shuffle.enabled` to `true` or `false`.
 Once it is disabled, Comet will fall back to the default Spark shuffle manager.
 
-### Shuffle Mode
+### Shuffle Implementations
 
-Comet provides three shuffle modes: Columnar Shuffle, Native Shuffle and Auto 
Mode.
+Comet provides two shuffle implementations.
 
-#### Auto Mode
-
-`spark.comet.exec.shuffle.mode` to `auto` will let Comet choose the best 
shuffle mode based on the query plan. This
-is the default.
+#### Native Shuffle
 
-#### Columnar (JVM) Shuffle
+Comet Native Shuffle reads columnar batches and repartitions them before 
writing the shuffled columnar data
+to the output file. Native Shuffle supports `HashPartitioning` and 
`SinglePartition`. Only primitive types 
+are supported for partitioning expressions (`Boolean`, `Byte`, `Short`, 
`Integer`, `Long`, `Float`, `Double`, 
+`Decimal`, `Date`, `Timestamp`, `String`, and `Binary`).
 
-Comet Columnar shuffle is JVM-based and supports `HashPartitioning`, 
`RoundRobinPartitioning`, `RangePartitioning`, and
-`SinglePartitioning`. This mode has the highest query coverage.
+Native shuffle is enabled by default and can be disabled by setting 
`spark.comet.exec.shuffle.native.enabled=false`.
 
-Columnar shuffle can be enabled by setting `spark.comet.exec.shuffle.mode` to 
`jvm`. If this mode is explicitly set,
-then any shuffle operations that cannot be supported in this mode will fall 
back to Spark.
+#### Columnar (JVM) Shuffle
 
-#### Native Shuffle
+Comet Columnar Shuffle is used for cases where Native Shuffle is not 
supported. Columnar Shuffle supports
+`HashPartitioning`, `RangePartitioning`, `RoundRobinPartitioning` and 
`SinglePartition` and supports complex
+types for hash-partitioning expressions in addition to the primitive types 
supported by Native Shuffle.
 
-Comet also provides a fully native shuffle implementation, which generally 
provides the best performance. However,
-native shuffle currently only supports `HashPartitioning` and 
`SinglePartitioning`.
+Columnar Shuffle inserts a `ColumnarToRowExec` transition on the input data 
(this does not appear in the query
+plan) and delegates the partitioning to Spark. The partitioned output rows are 
then converted back into columnar
+format before being written to the shuffle output file.
 
-To enable native shuffle, set `spark.comet.exec.shuffle.mode` to `native`. If 
this mode is explicitly set,
-then any shuffle operations that cannot be supported in this mode will fall 
back to Spark.
+Columnar shuffle is enabled by default and can be disabled by setting 
`spark.comet.exec.shuffle.columnar.enabled=false`.
 
 ### Shuffle Compression
 
 By default, Spark compresses shuffle files using LZ4 compression. Comet 
overrides this behavior with ZSTD compression.

Review Comment:
   `Comet overrides this behavior with ZSTD compression.`
   I think this part is now different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] chore: Improve shuffle configuration [datafusion-comet]

Reply via email to