Re: [PR] [SPARK-51883][DOCS][PYTHON] Python Data Source user guide for filter pushdown [spark]

via GitHub Mon, 28 Apr 2025 11:07:51 -0700


allisonwang-db commented on code in PR #50684:
URL: https://github.com/apache/spark/pull/50684#discussion_r2056894761



##########
python/docs/source/user_guide/sql/python_data_source.rst:
##########
@@ -356,17 +356,28 @@ For library that are used inside a method, it must be 
imported inside the method
         from pyspark import TaskContext
         context = TaskContext.get()
 
+Mutating State
+~~~~~~~~~~~~~~
+Some methods such as DataSourceReader.read() and DataSourceReader.partitions() 
must be stateless. Changes to the object state made in these methods are not 
guaranteed to be visible or invisible to future invocations.
+
+Other methods such as DataSource.schema() and 
DataSourceStreamReader.latestOffset() can be stateful. Changes to the object 
state made in these methods are visible to future invocations.
+
+Refer to the documentation of each method for more details.

Review Comment:
   Can we also link to the documentation here? 



##########
python/docs/source/user_guide/sql/python_data_source.rst:
##########
@@ -517,6 +530,121 @@ The following example demonstrates how to implement a 
basic Data Source using Ar
 
     df.show()
 
+Filter Pushdown in Python Data Sources
+--------------------------------------
+
+Filter pushdown is an optimization technique that allows data sources to 
handle filters natively, reducing the amount of data that needs to be 
transferred and processed by Spark.
+
+The filter pushdown API is introduced in Spark 4.1, enabling DataSourceReader 
to selectively push down filters from the query to the source.

Review Comment:
   We don't need to mention Spark 4.1 here (we don't backport documentation PR)



##########
python/pyspark/sql/datasource.py:
##########
@@ -539,6 +539,11 @@ def pushFilters(self, filters: List["Filter"]) -> 
Iterable["Filter"]:
         This method is allowed to modify `self`. The object must remain 
picklable.
         Modifications to `self` are visible to the `partitions()` and `read()` 
methods.
 
+        Notes
+        -----
+        Configuration `spark.sql.python.filterPushdown.enabled` must be set to 
`true`
+        to implement this method.

Review Comment:
   Not sure if we should put this in the doc. Can we throw an warning in the 
code?



##########
python/docs/source/user_guide/sql/python_data_source.rst:
##########
@@ -517,6 +530,121 @@ The following example demonstrates how to implement a 
basic Data Source using Ar
 
     df.show()
 
+Filter Pushdown in Python Data Sources
+--------------------------------------
+
+Filter pushdown is an optimization technique that allows data sources to 
handle filters natively, reducing the amount of data that needs to be 
transferred and processed by Spark.
+
+The filter pushdown API is introduced in Spark 4.1, enabling DataSourceReader 
to selectively push down filters from the query to the source.
+
+You must turn on the configuration ``spark.sql.python.filterPushdown.enabled`` 
to enable filter pushdown.
+
+**How Filter Pushdown Works**
+
+When a query includes filter conditions, Spark can pass these filters to the 
data source implementation, which can then apply the filters during data 
retrieval. This is especially beneficial for:
+
+- Data sources backed by formats that allow efficient filtering (e.g. 
key-value stores)
+- APIs that support filtering (e.g. REST and GraphQL APIs)
+
+The data source receives the filters, decides which ones can be pushed down, 
and returns the remaining filters to Spark to be applied later.
+
+**Implementing Filter Pushdown**
+
+To enable filter pushdown in your Python Data Source, implement the 
``pushFilters`` method in your ``DataSourceReader`` class:
+
+.. code-block:: python
+
+    from pyspark.sql.datasource import EqualTo, Filter, GreaterThan, LessThan
+
+    def pushFilters(self, filters: List[Filter]) -> Iterable[Filter]:

Review Comment:
   Can we add a complete example here so that people can copy paste and try it 
out?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51883][DOCS][PYTHON] Python Data Source user guide for filter pushdown [spark]

Reply via email to