When I run a groupByKey it seems to create a single tasks after the
groupByKey that never stops executing. I'm loading a smallish json dataset
that is 4 million. This is the code I'm running.

rdd = sql_context.jsonFile(uri)
rdd = rdd.cache()

grouped = rdd.map(lambda row: (row.id, row)).groupByKey(160)

grouped.take(1)

The groupByKey stage takes a few minutes with 160 tasks which is expected.
However it then creates a single task "runjob at PythonRDD.scala:300" that
never ends. I gave up after 30minutes.

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n20559/Screen_Shot_2014-12-05_at_6.png>
 







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Loading-Json-Following-by-groupByKey-seems-broken-in-spark-1-1-1-tp20559.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to