Thanks for that. I have played with this a bit more after your feedback and
found:

I can only recreate the problem with python 3.6+. If I change between python
2.7, python 3.6 and python 3.7 I find that the problem occurs in the python
3.6 and 3.7 case but not in the python 2.7.
- I have used minimal python virtual environments with the same dependencies
between python 2.7 and python 3.x (basically nothing installed except
numpy), so I don't think it's a python dependency version issue
- I have compared the DAG's and execution plans generated by Spark and they
look the same between the working and broken cases. I don't think the python
version is impact Sparks execution plan

Note that in the python3.6+ case I still can't recreate the problem every
time, but it does seem to happen the majority of the times I try.

I also tested with Spark 2.4.6 and still get the problem. I cannot try with
3.0.0 as that hits a fatal exception due to defect SPARK-32232

The workaround you suggest isn't going to work in my case as the code sample
I provide is a simplified version of what I'm actually doing in python.
However I think I have a workaround where I force a cache/persist of the
data after the model has transformed the features as I cannot recreate the
issue if the python UDF is run on the cached data in a separate action.

I will add another message if I find any more info



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to