Hi, I am trying to apply a UDF to a column in a PySpark df containing SparseVectors (created using pyspark.ml.feature.IDF). Originally, I was trying to apply a more involved function, but am getting the same error with any application of a function. So for the sake of an example:
udfSum = udf(lambda x: np.sum(x.values), FloatType()) df = df.withColumn("vec_sum", udfSum(df.idf)) df.take(10) I am getting this error: Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 55.0 failed 4 times, most recent failure: Lost task 0.3 in stage 55.0 (TID 111, 10.0.11.102): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) If I convert the df to Pandas and apply the function, I can confirm that FloatType() is the correct response type. This seems related: https://issues.apache.org/jira/browse/SPARK-7902, but I'm not sure how to proceed. Any ideas? Thanks! Abby -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apply-UDF-to-SparseVector-column-in-spark-2-0-tp27870.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org