Re: [PR] Expose Arrow C stream and DataFrame iterator (zero‑copy streaming to PyArrow) [datafusion-python]

via GitHub Wed, 03 Sep 2025 10:03:38 -0700


kylebarron commented on code in PR #1222:
URL: 
https://github.com/apache/datafusion-python/pull/1222#discussion_r2319615655



##########
python/datafusion/dataframe.py:
##########
@@ -1098,21 +1102,42 @@ def unnest_columns(self, *columns: str, preserve_nulls: 
bool = True) -> DataFram
         return DataFrame(self.df.unnest_columns(columns, 
preserve_nulls=preserve_nulls))
 
     def __arrow_c_stream__(self, requested_schema: object | None = None) -> 
object:
-        """Export an Arrow PyCapsule Stream.
+        """Export the DataFrame as an Arrow C Stream.
 
-        This will execute and collect the DataFrame. We will attempt to 
respect the
-        requested schema, but only trivial transformations will be applied 
such as only
-        returning the fields listed in the requested schema if their data 
types match
-        those in the DataFrame.
+        The DataFrame is executed using DataFusion's streaming APIs and 
exposed via
+        Arrow's C Stream interface. Record batches are produced incrementally, 
so the
+        full result set is never materialized in memory. When 
``requested_schema`` is
+        provided, only straightforward projections such as column selection or
+        reordering are applied.
 
         Args:
             requested_schema: Attempt to provide the DataFrame using this 
schema.
 
         Returns:
-            Arrow PyCapsule object.
+            Arrow PyCapsule object representing an ``ArrowArrayStream``.
         """
+        # ``DataFrame.__arrow_c_stream__`` in the Rust extension leverages
+        # ``execute_stream_partitioned`` under the hood to stream batches while
+        # preserving the original partition order.
         return self.df.__arrow_c_stream__(requested_schema)
 
+    def __iter__(self) -> Iterator[pa.RecordBatch]:

Review Comment:
   `RecordBatchStream` already has `__iter__` and `__aiter__` methods 
https://datafusion.apache.org/python/autoapi/datafusion/record_batch/index.html#datafusion.record_batch.RecordBatchStream
   
   Can we just have a method that converts a `DataFrame` into a 
`RecordBatchStream`? Then an `__iter__` on `DataFrame` would just convert to a 
`RecordBatchStream` under the hood.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Expose Arrow C stream and DataFrame iterator (zero‑copy streaming to PyArrow) [datafusion-python]

Reply via email to