[PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

via GitHub Thu, 18 Apr 2024 22:45:19 -0700


HyukjinKwon opened a new pull request, #46129:
URL: https://github.com/apache/spark/pull/46129


   ### What changes were proposed in this pull request?
   
   This PR proposes to have a parent `pyspark.sql.DataFrame` class which 
`pyspark.sql.connect.dataframe.DataFrame` and 
`pyspark.sql.classic.dataframe.DataFrame` inherit.
   
   **Before**
   
   1. `pyspark.sql.DataFrame` (Spark Claasic)
       - docstrings
       - Spark Classic logic
   
   2. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect)
       - Spark Connect logic
   
   3. Users can only see the type hints from `pyspark.sql.DataFrame`.
   
   **After**
   
   1. `pyspark.sql.DataFrame` (Common)
       - docstrings
       - Support classmethod usages (dispatch to either Spark Connect or Spark 
Classic)
   
   2. `pyspark.sql.classic.dataframe.DataFrame` (Spark Classic)
       - Spark Connect logic
   
   3. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect)
       - Spark Connect logic
   
   4. Users can only see the type hints from `pyspark.sql.DataFrame`.
   
   ### Why are the changes needed?
   
   This fixes two issues from the current structure at Spark Connect:
   
   1. Support usage of regular methods as class methods, e.g.,
   
   ```python
   from pyspark.sql import DataFrame
   df = spark.range(10)
   DataFrame.union(df, df)
   ```
   
   **Before**
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../spark/python/pyspark/sql/dataframe.py", line 4809, in union
       return DataFrame(self._jdf.union(other._jdf), self.sparkSession)
                        ^^^^^^^^^
     File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1724, in 
__getattr__
       raise PySparkAttributeError(
   pyspark.errors.exceptions.base.PySparkAttributeError: 
[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark 
Connect as it depends on the JVM. If you need to use this attribute, do not use 
Spark Connect when creating your session. Visit 
https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession
 for creating regular Spark Session in detail.
   ```
   
   **After**
   
   ```
   DataFrame[id: bigint]
   ```
   
   2. Supports `isinstance` call
   
   ```python
   from pyspark.sql import DataFrame
   isinstance(spark.range(1), DataFrame)
   ```
   
   **Before**
   
   ```
   False
   ```
   
   **After**
   
   ```
   True
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, as described above.
   
   ### How was this patch tested?
   
   Manually tested, and CI should verify them.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic [spark]

Reply via email to