Thanks for that. I have played with this a bit more after your feedback and found:
I can only recreate the problem with python 3.6+. If I change between python 2.7, python 3.6 and python 3.7 I find that the problem occurs in the python 3.6 and 3.7 case but not in the python 2.7. - I have used minimal python virtual environments with the same dependencies between python 2.7 and python 3.x (basically nothing installed except numpy), so I don't think it's a python dependency version issue - I have compared the DAG's and execution plans generated by Spark and they look the same between the working and broken cases. I don't think the python version is impact Sparks execution plan Note that in the python3.6+ case I still can't recreate the problem every time, but it does seem to happen the majority of the times I try. I also tested with Spark 2.4.6 and still get the problem. I cannot try with 3.0.0 as that hits a fatal exception due to defect SPARK-32232 The workaround you suggest isn't going to work in my case as the code sample I provide is a simplified version of what I'm actually doing in python. However I think I have a workaround where I force a cache/persist of the data after the model has transformed the features as I cannot recreate the issue if the python UDF is run on the cached data in a separate action. I will add another message if I find any more info -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org