[ 
https://issues.apache.org/jira/browse/SPARK-51847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stan Lochtenberg updated SPARK-51847:
-------------------------------------
    Description: 
*Background*

The PySpark testing framework currently provides utilities like 
assertDataFrameEqual and assertSchemaEqual for testing DataFrame operations. 
However, it lacks utilities for common data quality and integrity tests that 
are essential for data validation in ETL pipelines and data applications.

*Proposal*

Extend the PySpark testing framework with four new utility functions that 
enable developers to perform common data quality tests:
 # {{ assertColumnUnique: }}
 Verifies that specified column(s) contain only unique values.

 # {{assertColumnNonNull}}
Checks that specified column(s) do not contain null values.

 # {{ assertColumnValuesInSet}}
Ensures all values in specified column(s) are within a given set of accepted 
values.

 # {{ assertReferentialIntegrity}}
Validates that all non-null values in a source column exist in a target column 
(similar to foreign key constraints).

*Benefits*
 * Simplifies data validation in PySpark applications and tests

 * Reduces boilerplate code for common data quality checks

 * Provides consistent error reporting for data quality issues

 * Enables testing patterns similar to those in popular data testing frameworks 
like in dbt

 * Improves developer productivity when writing data quality tests

 

  was:
*Background*

The PySpark testing framework currently provides utilities like 
assertDataFrameEqual and assertSchemaEqual for testing DataFrame operations. 
However, it lacks utilities for common data quality and integrity tests that 
are essential for data validation in ETL pipelines and data applications.

*Proposal*

Extend the PySpark testing framework with four new utility functions that 
enable developers to perform common data quality tests:
 # {{ assertColumnUnique: }}
 Verifies that specified column(s) contain only unique values.

 # {{ assertColumnNonNull}}

Checks that specified column(s) do not contain null values.

 # {{ assertColumnValuesInSet}}

Ensures all values in specified column(s) are within a given set of accepted 
values.

 # {{ assertReferentialIntegrity}}

Validates that all non-null values in a source column exist in a target column 
(similar to foreign key constraints).

*Benefits*
 * Simplifies data validation in PySpark applications and tests

 * Reduces boilerplate code for common data quality checks

 * Provides consistent error reporting for data quality issues

 * Enables testing patterns similar to those in popular data testing frameworks 
like in dbt

 * Improves developer productivity when writing data quality tests

 


> Extend PySpark testing framework util functions with basic data tests
> ---------------------------------------------------------------------
>
>                 Key: SPARK-51847
>                 URL: https://issues.apache.org/jira/browse/SPARK-51847
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, Tests
>    Affects Versions: 4.0.0
>            Reporter: Stan Lochtenberg
>            Priority: Major
>
> *Background*
> The PySpark testing framework currently provides utilities like 
> assertDataFrameEqual and assertSchemaEqual for testing DataFrame operations. 
> However, it lacks utilities for common data quality and integrity tests that 
> are essential for data validation in ETL pipelines and data applications.
> *Proposal*
> Extend the PySpark testing framework with four new utility functions that 
> enable developers to perform common data quality tests:
>  # {{ assertColumnUnique: }}
>  Verifies that specified column(s) contain only unique values.
>  # {{assertColumnNonNull}}
> Checks that specified column(s) do not contain null values.
>  # {{ assertColumnValuesInSet}}
> Ensures all values in specified column(s) are within a given set of accepted 
> values.
>  # {{ assertReferentialIntegrity}}
> Validates that all non-null values in a source column exist in a target 
> column (similar to foreign key constraints).
> *Benefits*
>  * Simplifies data validation in PySpark applications and tests
>  * Reduces boilerplate code for common data quality checks
>  * Provides consistent error reporting for data quality issues
>  * Enables testing patterns similar to those in popular data testing 
> frameworks like in dbt
>  * Improves developer productivity when writing data quality tests
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to