[ 
https://issues.apache.org/jira/browse/SPARK-51426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-51426:
-----------------------------------
    Labels: pull-request-available  (was: )

> Setting metadata to empty dict does not work
> --------------------------------------------
>
>                 Key: SPARK-51426
>                 URL: https://issues.apache.org/jira/browse/SPARK-51426
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.5.0
>         Environment: PySpark in Databricks.
> Databricks Runtime Version: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
>            Reporter: Sebastian Bengtsson
>            Priority: Major
>              Labels: pull-request-available
>
> It should be possible to remove column metadata in a dataframe by setting 
> metadata to an empty dictionary.
> Surprisingly, it is not possible to remove metadata by setting metadata to 
> empty dict. 
> If column "a" has metadata set, the following has no effect: 
> {code:java}
> df.withMetadata('a', {}){code}
> Expected: Metadata would be removed/replaced by an empty dict.
> Experienced: Metadata is still there, unaffected.
>  
> Code to demonstrate this behavior:
> {code:java}
> df = spark.createDataFrame([('',)], ['a'])
> print('no metadata:', df.schema['a'].metadata)
> df = df.withMetadata('a', {'foo': 'bar'})
> print('metadata has been set:', df.schema['a'].metadata)
> df = df.select([col('a').alias('a', metadata={})])
> print('metadata has not been removed:', df.schema['a'].metadata)
> df = df.withMetadata('a', {'baz': 'burr'})
> print('metadata has been replaced:', df.schema['a'].metadata)
> df = df.withMetadata('a', {})
> print('metadata still there:', df.schema['a'].metadata){code}
> {code:java}
> no metadata: {}
> metadata has been set: {'foo': 'bar'}
> metadata has not been removed: {'foo': 'bar'}
> metadata has been replaced: {'baz': 'burr'}
> metadata still there: {'baz': 'burr'}
> {code}
> Fixing this would include the following patch:
> {code:java}
> --- a/python/pyspark/sql/classic/column.py
> +++ b/python/pyspark/sql/classic/column.py
> @@ -518,7 +518,7 @@ class Column(ParentColumn):
>          sc = get_active_spark_context()
>          if len(alias) == 1:
> -            if metadata:
> +            if metadata is not None:
>                  assert sc._jvm is not None
>                  jmeta = getattr(sc._jvm, 
> "org.apache.spark.sql.types.Metadata").fromJson(
>                      json.dumps(metadata) {code}
> But I suspect further changes in the Scala part of spark is also required.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to