Resubmitting after fixing subscription to mailing list. Based on the list of functions here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
there doesn't seem to be a way to get the length of an array in a dataframe without defining a UDF. What I would be looking for is something like this (except length_udf would be pyspark.sql.functions.length or something similar): length_udf = UserDefinedFunction(len, IntegerType()) test_schema = StructType([ StructField('arr', ArrayType(IntegerType())), StructField('letter', StringType()) ]) test_df = sql.createDataFrame(sc.parallelize([ [[1, 2, 3], 'a'], [[4, 5, 6, 7, 8], 'b'] ]), test_schema) test_df.select(length_udf(test_df.arr)).collect() Output: [Row(PythonUDF#len(arr)=3), Row(PythonUDF#len(arr)=5)] Is there currently a way to accomplish this? If this doesn't exist and seems useful, I would be happy to contribute a PR with the function. Pedro Rodriguez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-DataFrames-length-of-ArrayType-tp23869.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org