Shekharrajak opened a new issue, #3596:
URL: https://github.com/apache/datafusion-comet/issues/3596
### What is the problem the feature request solves?
Replace disk-based shuffle with Arrow Flight for direct memory-to-memory
data exchange between executors, eliminating intermediate disk I/O and
leveraging Arrow's native IPC
format for efficient shuffle.
#### Motivation
Comet already uses Arrow RecordBatches internally, but shuffle still goes
through disk:
Current Flow:
Arrow RecordBatch → Arrow IPC → Compress → DISK → Network → DISK →
Decompress → Arrow RecordBatch
Proposed Flow:
Arrow RecordBatch → Arrow Flight (gRPC) → Arrow RecordBatch
### Describe the potential solution
_No response_
### Additional context
#### Configuration
```
# Enable Arrow Flight shuffle
spark.shuffle.manager=org.apache.comet.shuffle.CometFlightShuffleManager
# Flight server configuration
spark.comet.shuffle.flight.enabled=true
spark.comet.shuffle.flight.port=50051
# Memory management
spark.comet.shuffle.flight.memoryFraction=0.3
spark.comet.shuffle.flight.spillThreshold=0.8
# Network configuration
spark.comet.shuffle.flight.maxMessageSize=67108864 # 64MB
spark.comet.shuffle.flight.compression=zstd
# Fault tolerance
spark.comet.shuffle.flight.retryAttempts=3
spark.comet.shuffle.flight.retryDelayMs=1000
```
#### Related Work
- Ballista: DataFusion's distributed query engine uses Arrow Flight
- Dask: Exploring Arrow Flight for task communication
- Ray: Uses gRPC for object transfer (similar concept)
- Spark 3.2 Push-Based Shuffle: Inspiration for push model
#### References
- https://arrow.apache.org/docs/format/Flight.html
- https://arrow.apache.org/docs/format/FlightSql.html
-
https://github.com/apache/datafusion-comet/tree/main/native/core/src/execution/shuffle
-
https://spark.apache.org/docs/latest/api/java/org/apache/spark/shuffle/ShuffleManager.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]