[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

Amanda Liu (Jira) Tue, 25 Jul 2023 11:19:27 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Amanda Liu updated SPARK-44546:
-------------------------------
    Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. Historically, PySpark has had code regressions due to 
insufficient testing of public APIs (see 
[https://databricks.atlassian.net/browse/ES-705815]).

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> ----------------------------------------------------
>
>                 Key: SPARK-44546
>                 URL: https://issues.apache.org/jira/browse/SPARK-44546
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 3.5.0
>            Reporter: Amanda Liu
>            Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing regressions in 
> the codebase. Historically, PySpark has had code regressions due to 
> insufficient testing of public APIs (see 
> [https://databricks.atlassian.net/browse/ES-705815]).
> Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
> from the perspective of arguments. Many of these edge cases are passed into 
> the LLM through the script's base prompt.
> Please note that this list is not exhaustive, but rather a starting point. 
> Some of these cases may not apply, depending on the situation. We encourage 
> all PySpark developers to carefully consider edge case scenarios when writing 
> tests.
> h2. Table of Contents
>  # None
>  # Ints
>  # Floats
>  # Strings
>  # Single column / column name
>  # Multi column / column names
>  # DataFrame argument
> h3. 1. None
>  * Empty input
>  * None type
> h3. 2. Ints
>  * Negatives
>  * 0
>  * value > Int.MaxValue
>  * value < Int.MinValue
> h3. 3. Floats
>  * Negatives
>  * 0.0
>  * Float(“nan”)
>  * Float("inf")
>  * Float("-inf")
>  * decimal.Decimal
>  * numpy.float16
> h3. 4. Strings
>  * Special characters
>  * Spaces
>  * Empty strings
> h3. 5. Single column / column name
>  * Non-existent column
>  * Empty column name
>  * Column name with special characters, e.g. dots
>  * Multi columns with the same name
>  * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
>  * Column of special types, e.g. nested type;
>  * Column containing special values, e.g. Null;
> h3. 6. Multi column / column names
>  * Empty input; e.g DataFrame.drop()
>  * Special cases for each single column
>  * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
>  * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))
> h3. 7. DataFrame argument
>  * DataFrame argument
>  * Empty dataframe; e.g. spark.range(5).limit(0)
>  * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
>  * Dataset with repeated arguments
>  * Local dataset (pd.DataFrame) containing unsupported datatype



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

Reply via email to