Shekharrajak opened a new issue, #3596:
URL: https://github.com/apache/datafusion-comet/issues/3596

   ### What is the problem the feature request solves?
   
   
    Replace disk-based shuffle with Arrow Flight for direct memory-to-memory 
data exchange between executors, eliminating intermediate disk I/O and 
leveraging Arrow's native IPC      
     format for efficient shuffle.
   
    #### Motivation
   
     Comet already uses Arrow RecordBatches internally, but shuffle still goes 
through disk:
   
     Current Flow:
     Arrow RecordBatch → Arrow IPC → Compress → DISK → Network → DISK → 
Decompress → Arrow RecordBatch
   
     Proposed Flow:
     Arrow RecordBatch → Arrow Flight (gRPC) → Arrow RecordBatch
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   #### Configuration
   ```
   
     # Enable Arrow Flight shuffle
     spark.shuffle.manager=org.apache.comet.shuffle.CometFlightShuffleManager
   
     # Flight server configuration
     spark.comet.shuffle.flight.enabled=true
     spark.comet.shuffle.flight.port=50051
   
     # Memory management
     spark.comet.shuffle.flight.memoryFraction=0.3
     spark.comet.shuffle.flight.spillThreshold=0.8
   
     # Network configuration
     spark.comet.shuffle.flight.maxMessageSize=67108864  # 64MB
     spark.comet.shuffle.flight.compression=zstd
   
     # Fault tolerance
     spark.comet.shuffle.flight.retryAttempts=3
     spark.comet.shuffle.flight.retryDelayMs=1000
   ```
   
   
   #### Related Work
   
     - Ballista: DataFusion's distributed query engine uses Arrow Flight
     - Dask: Exploring Arrow Flight for task communication
     - Ray: Uses gRPC for object transfer (similar concept)
     - Spark 3.2 Push-Based Shuffle: Inspiration for push model
   
    ####  References
   
     - https://arrow.apache.org/docs/format/Flight.html
     - https://arrow.apache.org/docs/format/FlightSql.html
     - 
https://github.com/apache/datafusion-comet/tree/main/native/core/src/execution/shuffle
     - 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/shuffle/ShuffleManager.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to