Well if we think of shuffling as a necessity to perform an operation, then the 
problem would be that you are adding a ln aggregation stage to a job that is 
going to get shuffled anyway.  Like if you need to join two datasets, then 
Spark will still shuffle the data, whether they are grouped by hostname prior 
to that or not.  My question is, is there anything else that you would expect 
to gain, except for enforcing maybe a dataset that is already bucketed? Like 
you could enforce that data is where it is supposed to be, but what else would 
you avoid? 

Sent from my iPhone

> On Aug 27, 2018, at 12:22 PM, Patrick McCarthy 
> <pmccar...@dstillery.com.INVALID> wrote:
> 
> When debugging some behavior on my YARN cluster I wrote the following PySpark 
> UDF to figure out what host was operating on what row of data:
> 
> @F.udf(T.StringType())
> def add_hostname(x):
> 
>     import socket
> 
>     return str(socket.gethostname())
> 
> It occurred to me that I could use this to enforce node-locality for other 
> operations:
> 
> df.withColumn('host', add_hostname(x)).groupBy('host').apply(...)
> 
> When working on a big job without obvious partition keys, this seems like a 
> very straightforward way to avoid a shuffle, but it seems too easy. 
> 
> What problems would I introduce by trying to partition on hostname like this?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to