e splittable). In reality, that's
> what I would really want to do in the first place.
>
> Thanks again for your insights.
>
> Darin.
>
> --
> *From:* Imran Rashid
> *To:* Darin McBeath
> *Cc:* User
> *Sent:* Tuesday, February 1
would really want to do in the first place.
Thanks again for your insights.
Darin.
From: Imran Rashid
To: Darin McBeath
Cc: User
Sent: Tuesday, February 17, 2015 3:29 PM
Subject: Re: MapValues and Shuffle Reads
Hi Darin,
When you say you "see 400GB
Hi Darin,
When you say you "see 400GB of shuffle writes" from the first code snippet,
what do you mean? There is no action in that first set, so it won't do
anything. By itself, it won't do any shuffle writing, or anything else for
that matter.
Most likely, the .count() on your second code snip
In the following code, I read in a large sequence file from S3 (1TB) spread
across 1024 partitions. When I look at the job/stage summary, I see about
400GB of shuffle writes which seems to make sense as I'm doing a hash partition
on this file.
// Get the baseline input file
JavaPairRDD hsfBase