RE: About StorageLevel

2014-06-26 Thread tomsheep...@gmail.com
stage, it behaves like there is a "persist(StorageLevel.DISk_ONLY)" called implicitly? Regards, Kang Liu From: Liu, Raymond Date: 2014-06-27 11:02 To: user@spark.apache.org Subject: RE: About StorageLevel I think there is a shuffle stage involved. And the future count job will depends on

RE: About StorageLevel

2014-06-26 Thread Liu, Raymond
: Friday, June 27, 2014 10:08 AM To: user Subject: Re: About StorageLevel Thank u Andrew, that's very helpful. I still have some doubts on a simple trial: I opened a spark shell in local mode, and typed in val r=sc.parallelize(0 to 50) val r2=r.keyBy(x=>x).groupByKey(10) and then I inv

Re: About StorageLevel

2014-06-26 Thread tomsheep...@gmail.com
es) The first job obviously takes more time than the latter ones. Is there some magic underneath? Regards, Kang Liu From: Andrew Or Date: 2014-06-27 02:25 To: user Subject: Re: About StorageLevel Hi Kang, You raise a good point. Spark does not automatically cache all your RDDs. Why? Simply bec

Re: About StorageLevel

2014-06-26 Thread Andrew Or
Hi Kang, You raise a good point. Spark does not automatically cache all your RDDs. Why? Simply because the application may create many RDDs, and not all of them are to be reused. After all, there is only so much memory available to each executor, and caching an RDD adds some overhead especially if