Hi. As part of my attempt to port Pyspark to Python 3, I've
re-applied, with modifications, Josh's old commit for using Dill with
Pyspark (as Dill already supports Python 3). Alas, I ran into an odd
problem that I could use some help with.
Josh's old commit;
https://github.com/JoshRosen/incubator
I've created an issue for this but if anyone has any advice, please let me
know.
Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two
remaining tasks (out of 320). Those tasks seem to be waiting on data from
another task on another node. Eventually (about 2 hours later) they t
Hi, all
Any admin can assign this issue
https://issues.apache.org/jira/browse/SPARK-2126 to me?
I have started working on this
Thanks,
--
Nan Zhu
I'll make a comment on the JIRA - thanks for reporting this, let's get
to the bottom of it.
On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman
wrote:
> I've created an issue for this but if anyone has any advice, please let me
> know.
>
> Basically, on about 10 GBs of data, saveAsTextFile()
Thanks for helping with the Dill integration; I had some early first attempts,
but had to set them aside when I got busy with some other work.
Just to bring everyone up to speed regarding context:
There are some objects that PySpark’s `cloudpickle` library doesn’t serialize
properly, such as ope
While I was trying to execute a job using spark-submit, I discover a
scala.MatchError at runtime... a DriverStateChanged.FAILED message was send
to an actor, and the match statement used was not taking that value into
account.
When I inspected that DriverStateChange.scala file I discovered that it