This looks like a bug to me. https://github.com/apache/spark/blob/master/python/pyspark/serializers.py When cloudpickle.loads() tries to deserialize {["A", "B"]: "foo"} -> List as a dict type will break. Tuple("A", "B") : Python Input -> ArrayData -> works fine ArrayData ->List["A", "B"] -> Breaks as map key
I don't see a validator for key hashability here -> https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/MapType.scala#L39 Missing validation that key type is actually usable as a map key. This to my mind is a bug and you should create a bug ticket. Best Regards Soumasish Goswami in: www.linkedin.com/in/soumasish # (415) 530-0405 - On Fri, May 23, 2025 at 5:04 AM Eyck Troschke <e...@troschke.onmicrosoft.com> wrote: > Dear Spark Development Community, > > According to the PySpark documentation, it should be possible to have a > MapType column with ArrayType keys. MapType supports keys of type DataType > and ArrayType inherits from DataType. > When i try that with PySpark 3.5.3, the show() method of the DataFrame > works as aspected, but the collect() method throws an exception: > > from pyspark.sql import SparkSession > from pyspark.sql.types import MapType, ArrayType, StringType > > schema = MapType(ArrayType(StringType()), StringType()) > data = [{("A", "B"): "foo", ("X", "Y", "Z"): "bar"}] > df = spark.createDataFrame(data, schema) > df.show() # works > df.collect() # throws exception > > > Is this behavior correct? > > Kind regards, > > Eyck >