[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amanda Liu updated SPARK-44546: ------------------------------- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Historically, PySpark has had code regressions due to insufficient testing of public APIs (see [https://databricks.atlassian.net/browse/ES-705815]). Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > ---------------------------------------------------- > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark > Affects Versions: 3.5.0 > Reporter: Amanda Liu > Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing regressions in > the codebase. Historically, PySpark has had code regressions due to > insufficient testing of public APIs (see > [https://databricks.atlassian.net/browse/ES-705815]). > Below, we outline some common edge case scenarios for PySpark DataFrame APIs, > from the perspective of arguments. Many of these edge cases are passed into > the LLM through the script's base prompt. > Please note that this list is not exhaustive, but rather a starting point. > Some of these cases may not apply, depending on the situation. We encourage > all PySpark developers to carefully consider edge case scenarios when writing > tests. > h2. Table of Contents > # None > # Ints > # Floats > # Strings > # Single column / column name > # Multi column / column names > # DataFrame argument > h3. 1. None > * Empty input > * None type > h3. 2. Ints > * Negatives > * 0 > * value > Int.MaxValue > * value < Int.MinValue > h3. 3. Floats > * Negatives > * 0.0 > * Float(“nan”) > * Float("inf") > * Float("-inf") > * decimal.Decimal > * numpy.float16 > h3. 4. Strings > * Special characters > * Spaces > * Empty strings > h3. 5. Single column / column name > * Non-existent column > * Empty column name > * Column name with special characters, e.g. dots > * Multi columns with the same name > * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ > * Column of special types, e.g. nested type; > * Column containing special values, e.g. Null; > h3. 6. Multi column / column names > * Empty input; e.g DataFrame.drop() > * Special cases for each single column > * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) > * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) > h3. 7. DataFrame argument > * DataFrame argument > * Empty dataframe; e.g. spark.range(5).limit(0) > * Dataframe with 0 columns, e.g. spark.range(5).drop('id') > * Dataset with repeated arguments > * Local dataset (pd.DataFrame) containing unsupported datatype -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org