wengh commented on code in PR #50684: URL: https://github.com/apache/spark/pull/50684#discussion_r2069715404
########## python/docs/source/user_guide/sql/python_data_source.rst: ########## @@ -517,6 +530,121 @@ The following example demonstrates how to implement a basic Data Source using Ar df.show() +Filter Pushdown in Python Data Sources +-------------------------------------- + +Filter pushdown is an optimization technique that allows data sources to handle filters natively, reducing the amount of data that needs to be transferred and processed by Spark. + +The filter pushdown API is introduced in Spark 4.1, enabling DataSourceReader to selectively push down filters from the query to the source. + +You must turn on the configuration ``spark.sql.python.filterPushdown.enabled`` to enable filter pushdown. + +**How Filter Pushdown Works** + +When a query includes filter conditions, Spark can pass these filters to the data source implementation, which can then apply the filters during data retrieval. This is especially beneficial for: + +- Data sources backed by formats that allow efficient filtering (e.g. key-value stores) +- APIs that support filtering (e.g. REST and GraphQL APIs) + +The data source receives the filters, decides which ones can be pushed down, and returns the remaining filters to Spark to be applied later. + +**Implementing Filter Pushdown** + +To enable filter pushdown in your Python Data Source, implement the ``pushFilters`` method in your ``DataSourceReader`` class: + +.. code-block:: python + + from pyspark.sql.datasource import EqualTo, Filter, GreaterThan, LessThan + + def pushFilters(self, filters: List[Filter]) -> Iterable[Filter]: Review Comment: Changed to an example source that returns prime numbers sequentially -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org