allisonwang-db commented on code in PR #50684: URL: https://github.com/apache/spark/pull/50684#discussion_r2056894761
########## python/docs/source/user_guide/sql/python_data_source.rst: ########## @@ -356,17 +356,28 @@ For library that are used inside a method, it must be imported inside the method from pyspark import TaskContext context = TaskContext.get() +Mutating State +~~~~~~~~~~~~~~ +Some methods such as DataSourceReader.read() and DataSourceReader.partitions() must be stateless. Changes to the object state made in these methods are not guaranteed to be visible or invisible to future invocations. + +Other methods such as DataSource.schema() and DataSourceStreamReader.latestOffset() can be stateful. Changes to the object state made in these methods are visible to future invocations. + +Refer to the documentation of each method for more details. Review Comment: Can we also link to the documentation here? ########## python/docs/source/user_guide/sql/python_data_source.rst: ########## @@ -517,6 +530,121 @@ The following example demonstrates how to implement a basic Data Source using Ar df.show() +Filter Pushdown in Python Data Sources +-------------------------------------- + +Filter pushdown is an optimization technique that allows data sources to handle filters natively, reducing the amount of data that needs to be transferred and processed by Spark. + +The filter pushdown API is introduced in Spark 4.1, enabling DataSourceReader to selectively push down filters from the query to the source. Review Comment: We don't need to mention Spark 4.1 here (we don't backport documentation PR) ########## python/pyspark/sql/datasource.py: ########## @@ -539,6 +539,11 @@ def pushFilters(self, filters: List["Filter"]) -> Iterable["Filter"]: This method is allowed to modify `self`. The object must remain picklable. Modifications to `self` are visible to the `partitions()` and `read()` methods. + Notes + ----- + Configuration `spark.sql.python.filterPushdown.enabled` must be set to `true` + to implement this method. Review Comment: Not sure if we should put this in the doc. Can we throw an warning in the code? ########## python/docs/source/user_guide/sql/python_data_source.rst: ########## @@ -517,6 +530,121 @@ The following example demonstrates how to implement a basic Data Source using Ar df.show() +Filter Pushdown in Python Data Sources +-------------------------------------- + +Filter pushdown is an optimization technique that allows data sources to handle filters natively, reducing the amount of data that needs to be transferred and processed by Spark. + +The filter pushdown API is introduced in Spark 4.1, enabling DataSourceReader to selectively push down filters from the query to the source. + +You must turn on the configuration ``spark.sql.python.filterPushdown.enabled`` to enable filter pushdown. + +**How Filter Pushdown Works** + +When a query includes filter conditions, Spark can pass these filters to the data source implementation, which can then apply the filters during data retrieval. This is especially beneficial for: + +- Data sources backed by formats that allow efficient filtering (e.g. key-value stores) +- APIs that support filtering (e.g. REST and GraphQL APIs) + +The data source receives the filters, decides which ones can be pushed down, and returns the remaining filters to Spark to be applied later. + +**Implementing Filter Pushdown** + +To enable filter pushdown in your Python Data Source, implement the ``pushFilters`` method in your ``DataSourceReader`` class: + +.. code-block:: python + + from pyspark.sql.datasource import EqualTo, Filter, GreaterThan, LessThan + + def pushFilters(self, filters: List[Filter]) -> Iterable[Filter]: Review Comment: Can we add a complete example here so that people can copy paste and try it out? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org