[ https://issues.apache.org/jira/browse/SPARK-51062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon reassigned SPARK-51062: ------------------------------------ Assignee: amoghsantarkar > assertSchemaEqual Does Not Compare Decimal Precision and Scale > -------------------------------------------------------------- > > Key: SPARK-51062 > URL: https://issues.apache.org/jira/browse/SPARK-51062 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.5.0, 3.5.1, 3.5.2, 3.5.3, 3.5.4 > Reporter: pscheurig > Assignee: amoghsantarkar > Priority: Major > Labels: pull-request-available > > h1. Summary > The {{assertSchemaEqual}} function in PySpark's testing utilities does not > properly compare DecimalType fields, as it only checks the base type name > (e.g. "decimal") without comparing precision and scale parameters. This > significantly reduces the utility of the function for schemas containing > decimal fields. > h2. Version > * Apache Spark Version: >=3.5.0 > * Component: PySpark Testing Utils > * Function: {{pyspark.testing.assertSchemaEqual}} > h2. Description > When comparing two schemas containing DecimalType fields with different > precision and scale parameters, {{assertSchemaEqual}} incorrectly reports > them as equal because it only compares the base type name ("decimal") without > considering the precision and scale parameters. > h3. Current Behavior > {code:python} > from pyspark.sql.types import StructType, StructField, DecimalType > from pyspark.testing import assertSchemaEqual > s1 = StructType( > [ > StructField("price_102", DecimalType(10, 2), True), > StructField("price_80", DecimalType(8, 0), True), > ] > ) > s2 = StructType( > [ > StructField("price_102", DecimalType(10, 4), True), # Different scale > StructField( > "price_80", DecimalType(10, 2), True > ), # Different precision and scale > ] > ) > # This passes when it should fail > assertSchemaEqual(s1, s2) > {code} > h3. Expected Behavior > The function should compare both precision and scale parameters of > DecimalType fields and raise a PySparkAssertionError when they differ, > similar to how it handles other type mismatches. The error message should > indicate which fields have mismatched decimal parameters. > h2. Impact > This issue affects data quality validation and testing scenarios where > precise decimal specifications are crucial, such as: > * Financial data processing where decimal precision and scale are critical > * ETL validation where source and target schemas must match exactly > h2. Suggested Fix > The {{compare_datatypes_ignore_nullable}} function in > {{pyspark/testing/utils.py}} should be enhanced to compare precision and > scale parameters when dealing with decimal types: > {code:python} > def compare_datatypes_ignore_nullable(dt1: Any, dt2: Any): > if dt1.typeName() == dt2.typeName(): > if dt1.typeName() == "decimal": > return dt1.precision == dt2.precision and dt1.scale == dt2.scale > elif dt1.typeName() == "array": > return compare_datatypes_ignore_nullable(dt1.elementType, > dt2.elementType) > elif dt1.typeName() == "struct": > return compare_schemas_ignore_nullable(dt1, dt2) > else: > return True > else: > return False > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org