When I run a groupByKey it seems to create a single tasks after the groupByKey that never stops executing. I'm loading a smallish json dataset that is 4 million. This is the code I'm running.
rdd = sql_context.jsonFile(uri) rdd = rdd.cache() grouped = rdd.map(lambda row: (row.id, row)).groupByKey(160) grouped.take(1) The groupByKey stage takes a few minutes with 160 tasks which is expected. However it then creates a single task "runjob at PythonRDD.scala:300" that never ends. I gave up after 30minutes. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n20559/Screen_Shot_2014-12-05_at_6.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Loading-Json-Following-by-groupByKey-seems-broken-in-spark-1-1-1-tp20559.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org