[ 
https://issues.apache.org/jira/browse/SPARK-51062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-51062:
------------------------------------

    Assignee: amoghsantarkar

> assertSchemaEqual Does Not Compare Decimal Precision and Scale
> --------------------------------------------------------------
>
>                 Key: SPARK-51062
>                 URL: https://issues.apache.org/jira/browse/SPARK-51062
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.5.0, 3.5.1, 3.5.2, 3.5.3, 3.5.4
>            Reporter: pscheurig
>            Assignee: amoghsantarkar
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Summary
> The {{assertSchemaEqual}} function in PySpark's testing utilities does not 
> properly compare DecimalType fields, as it only checks the base type name 
> (e.g. "decimal") without comparing precision and scale parameters. This 
> significantly reduces the utility of the function for schemas containing 
> decimal fields.
> h2. Version
>  * Apache Spark Version: >=3.5.0
>  * Component: PySpark Testing Utils
>  * Function: {{pyspark.testing.assertSchemaEqual}}
> h2. Description
> When comparing two schemas containing DecimalType fields with different 
> precision and scale parameters, {{assertSchemaEqual}} incorrectly reports 
> them as equal because it only compares the base type name ("decimal") without 
> considering the precision and scale parameters.
> h3. Current Behavior
> {code:python}
> from pyspark.sql.types import StructType, StructField, DecimalType
> from pyspark.testing import assertSchemaEqual
> s1 = StructType(
>     [
>         StructField("price_102", DecimalType(10, 2), True),
>         StructField("price_80", DecimalType(8, 0), True),
>     ]
> )
> s2 = StructType(
>     [
>         StructField("price_102", DecimalType(10, 4), True),  # Different scale
>         StructField(
>             "price_80", DecimalType(10, 2), True
>         ),  # Different precision and scale
>     ]
> )
> # This passes when it should fail
> assertSchemaEqual(s1, s2)
> {code}
> h3. Expected Behavior
> The function should compare both precision and scale parameters of 
> DecimalType fields and raise a PySparkAssertionError when they differ, 
> similar to how it handles other type mismatches. The error message should 
> indicate which fields have mismatched decimal parameters.
> h2. Impact
> This issue affects data quality validation and testing scenarios where 
> precise decimal specifications are crucial, such as:
>  * Financial data processing where decimal precision and scale are critical
>  * ETL validation where source and target schemas must match exactly
> h2. Suggested Fix
> The {{compare_datatypes_ignore_nullable}} function in 
> {{pyspark/testing/utils.py}} should be enhanced to compare precision and 
> scale parameters when dealing with decimal types:
> {code:python}
> def compare_datatypes_ignore_nullable(dt1: Any, dt2: Any):
>     if dt1.typeName() == dt2.typeName():
>         if dt1.typeName() == "decimal":
>             return dt1.precision == dt2.precision and dt1.scale == dt2.scale
>         elif dt1.typeName() == "array":
>             return compare_datatypes_ignore_nullable(dt1.elementType, 
> dt2.elementType)
>         elif dt1.typeName() == "struct":
>             return compare_schemas_ignore_nullable(dt1, dt2)
>         else:
>             return True
>     else:
>         return False
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to