Re: Finding bad data

2014-04-24 Thread Matei Zaharia
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how to do it. Look at the stderr file of the executor on that machine, and you’ll see lines like this: 14/04/24 19:17:24 INFO HadoopRDD: Input split: file:/Users/matei/workspace/apache-spark/README.md:0+2000 This says wh

Finding bad data

2014-04-24 Thread Jim Blomo
I'm using PySpark to load some data and getting an error while parsing it. Is it possible to find the source file and line of the bad data? I imagine that this would be extremely tricky when dealing with multiple derived RRDs, so an answer with the caveat of "this only works when running .map() o