Hi,
I've been trying to debug a Spark UDF for a couple of days now but I can't
seem to figure out what is going on. The UDF essentially pads a 2D array to
a certain fixed length. When the code uses NumPy, it fails with a
PickleException. When I re write using plain python, it works like charm.:
This does not work:
@udf("array<array<float>>")
def pad(arr: List[List[float]], n: int) -> List[List[float]]:
return np.pad(arr, [(n, 0), (0, 0)], "constant",
constant_values=0.0).tolist()
But this works:
@udf("array<array<float>>")
def pad(arr, n):
padded_arr = []
for i in range(n):
padded_arr.append([0.0] * len(arr[0]))
padded_arr.extend(arr)
return padded_arr
The code for calling them remains exactly the same:
df.withColumn("test", pad(col("array_col"), expected_length - actual_length)
What am I missing?
The arrays do not have any NaNs or Nulls.
Any thoughts or suggestions or tips for troubleshooting would be
appreciated.
Best regards,
Sanket