Yep - that's correct. As an optimization we save the shuffle output and
re-use if if you execute a stage twice. So this can make A:B tests like
this a bit confusing.
- Patrick
On Friday, August 22, 2014, Nieyuan wrote:
> Because map-reduce tasks like join will save shuffle data to disk . So the
Because map-reduce tasks like join will save shuffle data to disk . So the
only diffrence with caching or no-caching version is :
>> .map { case (x, (n, i)) => (x, n)}
-
Thanks,
Nieyuan
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using
Hi,
thank you for your response. I removed issues you mentioned. Now I read
RDDs from files, whole rdd is cached, I don't use random and rdd1 and rdd2
are identical.
RDDs that are joined contains 100k entries and result contains 10m entries.
rdd1 and rdd2 after join also contains 10m entries. Here
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact
effect of caching. In rdd3, in addition to the fact that rdd will be
cached, you are also doing a bunch of extra random number generation. So it
will be hard to isolate the effect of caching.
On Wed, Aug 20, 2014 at 7:48 AM, Gr