Re: [PYSPARK] df.collect throws exception for MapType with ArrayType as key

Soumasish Fri, 23 May 2025 07:05:19 -0700

This looks like a bug to me.
https://github.com/apache/spark/blob/master/python/pyspark/serializers.py
When cloudpickle.loads() tries to deserialize
{["A", "B"]: "foo"} -> List as a dict type will break.
Tuple("A", "B") : Python Input -> ArrayData -> works fine
ArrayData ->List["A", "B"] -> Breaks as map key

I don't see a validator for key hashability here ->
https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/MapType.scala#L39
Missing validation that key type is actually usable as a map key.

This to my mind is a bug and you should create a bug ticket.

Best Regards
Soumasish Goswami
in: www.linkedin.com/in/soumasish
# (415) 530-0405

   -

On Fri, May 23, 2025 at 5:04 AM Eyck Troschke <e...@troschke.onmicrosoft.com>
wrote:

> Dear Spark Development Community,
>
> According to the PySpark documentation, it should be possible to have a
> MapType column with ArrayType keys. MapType supports keys of type DataType
> and ArrayType inherits from DataType.
> When i try that with PySpark 3.5.3, the show() method of the DataFrame
> works as aspected, but the collect() method throws an exception:
>
> from pyspark.sql import SparkSession
> from pyspark.sql.types import MapType, ArrayType, StringType
>
> schema = MapType(ArrayType(StringType()), StringType())
> data = [{("A", "B"): "foo", ("X", "Y", "Z"): "bar"}]
> df = spark.createDataFrame(data, schema)
> df.show() # works
> df.collect() # throws exception
>
>
> Is this behavior correct?
>
> Kind regards,
>
> Eyck
>

Re: [PYSPARK] df.collect throws exception for MapType with ArrayType as key

Reply via email to