Sebastian Bengtsson created SPARK-51426:
-------------------------------------------

             Summary: Setting metadata to empty dict does not work
                 Key: SPARK-51426
                 URL: https://issues.apache.org/jira/browse/SPARK-51426
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core
    Affects Versions: 3.5.0
         Environment: PySpark in Databricks.
Databricks Runtime Version: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
            Reporter: Sebastian Bengtsson


It should be possible to remove column metadata in a dataframe by setting 
metadata to an empty dictionary.

Surprisingly, it is not possible to remove metadata by setting metadata to 
empty dict. 

If column "a" has metadata set, the following has no effect: 
{code:java}
df.withMetadata('a', {}){code}
Expected: Metadata would be removed/replaced by an empty dict.

Experienced: Metadata is still there, unaffected.

 

Code to demonstrate this behavior:
{code:java}
df = spark.createDataFrame([('',)], ['a'])
print('no metadata:', df.schema['a'].metadata)

df = df.withMetadata('a', {'foo': 'bar'})
print('metadata has been set:', df.schema['a'].metadata)

df = df.select([col('a').alias('a', metadata={})])
print('metadata has not been removed:', df.schema['a'].metadata)

df = df.withMetadata('a', {'baz': 'burr'})
print('metadata has been replaced:', df.schema['a'].metadata)

df = df.withMetadata('a', {})
print('metadata still there:', df.schema['a'].metadata){code}
{code:java}
no metadata: {}
metadata has been set: {'foo': 'bar'}
metadata has not been removed: {'foo': 'bar'}
metadata has been replaced: {'baz': 'burr'}
metadata still there: {'baz': 'burr'}
{code}
Fixing this would include the following patch:
{code:java}
--- a/python/pyspark/sql/classic/column.py
+++ b/python/pyspark/sql/classic/column.py
@@ -518,7 +518,7 @@ class Column(ParentColumn):
         sc = get_active_spark_context()
         if len(alias) == 1:
-            if metadata:
+            if metadata is not None:
                 assert sc._jvm is not None
                 jmeta = getattr(sc._jvm, 
"org.apache.spark.sql.types.Metadata").fromJson(
                     json.dumps(metadata) {code}
But I suspect further changes in the Scala part of spark is also required.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to