Task partition ID in Spark event logs

2017-07-20 Thread Michael Mior
I see there's a comment in the TaskInfo class that the index may not be the same as the ID of the RDD partition the task is computing. Under what circumstances *will* the ID by the same? If there are zero guarantees, any suggestions on how to grab this info from the scheduler to populate a new fiel

What does spark.python.worker.memory affect?

2017-07-20 Thread Cyanny LIANG
Hi As the documentation said: spark.python.worker.memory Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks. I search the con