Re: [PR] docs: various improvements to tuning guide [datafusion-comet]

via GitHub Tue, 18 Mar 2025 23:13:40 -0700


kazuyukitanimura commented on code in PR #1525:
URL: https://github.com/apache/datafusion-comet/pull/1525#discussion_r2002475656



##########
common/src/main/scala/org/apache/comet/CometConf.scala:
##########
@@ -274,11 +272,9 @@ object CometConf extends ShimCometConf {
       .createWithDefault(true)
 
   val COMET_SHUFFLE_MODE: ConfigEntry[String] = 
conf(s"$COMET_EXEC_CONFIG_PREFIX.shuffle.mode")
-    .doc("The mode of Comet shuffle. This config is only effective if Comet 
shuffle " +
-      "is enabled. Available modes are 'native', 'jvm', and 'auto'. " +
-      "'native' is for native shuffle which has best performance in general. " 
+
-      "'jvm' is for jvm-based columnar shuffle which has higher coverage than 
native shuffle. " +
-      "'auto' is for Comet to choose the best shuffle mode based on the query 
plan.")
+    .doc(
+      "This is test config to allow tests to force a particular shuffle 
implementation to be " +
+        "used.")

Review Comment:
   Should we put `$TUNING_GUIDE` as we are removing explanations on what each 
mode means?
   



##########
docs/source/user-guide/tuning.md:
##########
@@ -141,30 +191,22 @@ It must be set before the Spark context is created. You 
can enable or disable Co
 at runtime by setting `spark.comet.exec.shuffle.enabled` to `true` or `false`.
 Once it is disabled, Comet will fall back to the default Spark shuffle manager.
 
-### Shuffle Mode
+### Shuffle Implementations
 
-Comet provides three shuffle modes: Columnar Shuffle, Native Shuffle and Auto 
Mode.
+Comet provides two shuffle implementations: Native Shuffle and Columnar 
Shuffle. Comet will use Native Shuffle 
+where possible, then will use Columnar Shuffle where possible, and will fall 
back to Spark for shuffle operations 
+that cannot be supported by either. 
 
-#### Auto Mode

Review Comment:
   Should we keep the explanation how the auto mode works? e.g. native -> 
columnar -> spark



##########
docs/source/user-guide/tuning.md:
##########
@@ -141,30 +191,22 @@ It must be set before the Spark context is created. You 
can enable or disable Co
 at runtime by setting `spark.comet.exec.shuffle.enabled` to `true` or `false`.
 Once it is disabled, Comet will fall back to the default Spark shuffle manager.
 
-### Shuffle Mode
+### Shuffle Implementations
 
-Comet provides three shuffle modes: Columnar Shuffle, Native Shuffle and Auto 
Mode.
+Comet provides two shuffle implementations: Native Shuffle and Columnar 
Shuffle. Comet will use Native Shuffle 
+where possible, then will use Columnar Shuffle where possible, and will fall 
back to Spark for shuffle operations 
+that cannot be supported by either. 
 
-#### Auto Mode
+#### Native Shuffle
 
-`spark.comet.exec.shuffle.mode` to `auto` will let Comet choose the best 
shuffle mode based on the query plan. This
-is the default.
+Comet provides a fully native shuffle implementation, which generally provides 
the best performance. However,
+native shuffle currently only supports `HashPartitioning` and 
`SinglePartitioning` and has some restrictions on
+supported data types.
 
 #### Columnar (JVM) Shuffle
 
 Comet Columnar shuffle is JVM-based and supports `HashPartitioning`, 
`RoundRobinPartitioning`, `RangePartitioning`, and
-`SinglePartitioning`. This mode has the highest query coverage.
-
-Columnar shuffle can be enabled by setting `spark.comet.exec.shuffle.mode` to 
`jvm`. If this mode is explicitly set,
-then any shuffle operations that cannot be supported in this mode will fall 
back to Spark.

Review Comment:
   I feel this is still informative to say it will fallback to Spark if 
necessary 



##########
docs/source/user-guide/tuning.md:
##########
@@ -17,18 +17,96 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Tuning Guide
+# Comet Tuning Guide
 
 Comet provides some tuning options to help you get the best performance from 
your queries.
 
 ## Memory Tuning
 
-### Unified Memory Management with Off-Heap Memory
+It is necessary to specify how much memory Comet can use in addition to memory 
already allocated to Spark. In some
+cases, it may be possible to reduce the amount of memory allocated to Spark so 
that overall memory allocation is
+the same or lower than the original configuration. In other cases, enabling 
Comet may require allocating more memory
+than before. See the [Determining How Much Memory to Allocate] section for 
more details.
 
-The recommended way to share memory between Spark and Comet is to set 
`spark.memory.offHeap.enabled=true`. This allows
-Comet to share an off-heap memory pool with Spark. The size of the pool is 
specified by `spark.memory.offHeap.size`. For more details about Spark off-heap 
memory mode, please refer to Spark documentation: 
https://spark.apache.org/docs/latest/configuration.html.
+[Determining How Much Memory to Allocate]: 
#determining-how-much-memory-to-allocate
 
-The type of pool can be specified with `spark.comet.exec.memoryPool`.
+Comet supports Spark's on-heap (the default) and off-heap mode for allocating 
memory. However, we strongly recommend
+using off-heap mode. Comet has some limitations when running in on-heap mode, 
such as requiring more memory overall,
+and requiring shuffle memory to be separately configured.
+
+### Configuring Comet Memory in Off-Heap Mode
+
+The recommended way to allocate memory for Comet is to set 
`spark.memory.offHeap.enabled=true`. This allows
+Comet to share an off-heap memory pool with Spark, reducing the overall memory 
overhead. The size of the pool is
+specified by `spark.memory.offHeap.size`. For more details about Spark 
off-heap memory mode, please refer to
+Spark documentation: https://spark.apache.org/docs/latest/configuration.html.

Review Comment:
   Should we mention that `spark.comet.memoryOverhead` is used for off-heap as 
well? Otherwise, it will involve `spark.comet.memory.overhead.factor`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] docs: various improvements to tuning guide [datafusion-comet]

Reply via email to