stanlocht opened a new pull request, #50644:
URL: https://github.com/apache/spark/pull/50644

   ### What changes were proposed in this pull request?
   This PR extends the PySpark testing framework with four new utility 
functions for data quality and integrity testing:
   
   1. `assertColumnUnique`: Verifies that specified column(s) contain only 
unique values
   2. `assertColumnNonNull`: Checks that specified column(s) do not contain 
null values
   3. `assertColumnValuesInSet`: Ensures all values in specified column(s) are 
within a given set of accepted values
   4. `assertReferentialIntegrity`: Validates that all non-null values in a 
source column exist in a target column (similar to foreign key constraints)
   
   ### Why are the changes needed?
   
   These new utility functions address this gap by providing standardized, 
well-tested implementations of the most common data quality checks. They reduce 
boilerplate code, improve test readability, and enable testing patterns similar 
to those in popular data testing frameworks like dbt.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR introduces new public utility functions in the 
`pyspark.testing` module. These are additive changes that don't modify existing 
functionality.
   
   Example usage:
   ```python
   from pyspark.testing import assertColumnUnique, assertReferentialIntegrity
   
   # Check that 'id' column contains only unique values
   assertColumnUnique(df, "id")
   
   # Check that all customer_ids in orders exist in customers.id
   assertReferentialIntegrity(orders, "customer_id", customers, "id")
   ```
   
   
   ### How was this patch tested?
   
   Comprehensive tests were added for all new functions in 
`python/pyspark/sql/tests/test_utils.py`. The tests cover:
   
   - Basic functionality with valid inputs
   - Error cases with invalid inputs
   - Edge cases (e.g., null values, empty DataFrames)
   - Different DataFrame types (Spark, pandas, pandas-on-Spark)
   - Detailed validation of error messages
   
   Each function has multiple test methods that verify both positive and 
negative test cases. For example, `assertReferentialIntegrity` has tests for 
valid relationships, invalid relationships with a single missing value, 
multiple missing values, and proper handling of null values.
   
   All tests pass on the current master branch.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by: Claude 3.7 Sonnet
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to