stanlocht opened a new pull request, #50644: URL: https://github.com/apache/spark/pull/50644
### What changes were proposed in this pull request? This PR extends the PySpark testing framework with four new utility functions for data quality and integrity testing: 1. `assertColumnUnique`: Verifies that specified column(s) contain only unique values 2. `assertColumnNonNull`: Checks that specified column(s) do not contain null values 3. `assertColumnValuesInSet`: Ensures all values in specified column(s) are within a given set of accepted values 4. `assertReferentialIntegrity`: Validates that all non-null values in a source column exist in a target column (similar to foreign key constraints) ### Why are the changes needed? These new utility functions address this gap by providing standardized, well-tested implementations of the most common data quality checks. They reduce boilerplate code, improve test readability, and enable testing patterns similar to those in popular data testing frameworks like dbt. ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces new public utility functions in the `pyspark.testing` module. These are additive changes that don't modify existing functionality. Example usage: ```python from pyspark.testing import assertColumnUnique, assertReferentialIntegrity # Check that 'id' column contains only unique values assertColumnUnique(df, "id") # Check that all customer_ids in orders exist in customers.id assertReferentialIntegrity(orders, "customer_id", customers, "id") ``` ### How was this patch tested? Comprehensive tests were added for all new functions in `python/pyspark/sql/tests/test_utils.py`. The tests cover: - Basic functionality with valid inputs - Error cases with invalid inputs - Edge cases (e.g., null values, empty DataFrames) - Different DataFrame types (Spark, pandas, pandas-on-Spark) - Detailed validation of error messages Each function has multiple test methods that verify both positive and negative test cases. For example, `assertReferentialIntegrity` has tests for valid relationships, invalid relationships with a single missing value, multiple missing values, and proper handling of null values. All tests pass on the current master branch. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude 3.7 Sonnet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org