Hey Jim, this is unfortunately harder than I’d like right now, but here’s how
to do it. Look at the stderr file of the executor on that machine, and you’ll
see lines like this:
14/04/24 19:17:24 INFO HadoopRDD: Input split:
file:/Users/matei/workspace/apache-spark/README.md:0+2000
This says wh
I'm using PySpark to load some data and getting an error while
parsing it. Is it possible to find the source file and line of the bad
data? I imagine that this would be extremely tricky when dealing with
multiple derived RRDs, so an answer with the caveat of "this only
works when running .map() o