stage, it behaves like there is a
"persist(StorageLevel.DISk_ONLY)" called implicitly?
Regards,
Kang Liu
From: Liu, Raymond
Date: 2014-06-27 11:02
To: user@spark.apache.org
Subject: RE: About StorageLevel
I think there is a shuffle stage involved. And the future count job will
depends on
: Friday, June 27, 2014 10:08 AM
To: user
Subject: Re: About StorageLevel
Thank u Andrew, that's very helpful.
I still have some doubts on a simple trial: I opened a spark shell in local
mode,
and typed in
val r=sc.parallelize(0 to 50)
val r2=r.keyBy(x=>x).groupByKey(10)
and then I inv
es)
The first job obviously takes more time than the latter ones. Is there some
magic underneath?
Regards,
Kang Liu
From: Andrew Or
Date: 2014-06-27 02:25
To: user
Subject: Re: About StorageLevel
Hi Kang,
You raise a good point. Spark does not automatically cache all your RDDs. Why?
Simply bec
Hi Kang,
You raise a good point. Spark does not automatically cache all your RDDs.
Why? Simply because the application may create many RDDs, and not all of
them are to be reused. After all, there is only so much memory available to
each executor, and caching an RDD adds some overhead especially if