Re: Process time series RDD after sortByKey

2015-03-17 Thread Shawn Zheng
gt; create my own RDD class before (not RDD instance J). But this is very >> valuable approach to me so I am desired to learn. >> >> >> >> Regards, >> >> >> >> Shuai >> >> >> >> *From:* Imran Rashid [mailto:iras...@cloudera.c

Re: Process time series RDD after sortByKey

2015-03-16 Thread Imran Rashid
t; > > > *From:* Imran Rashid [mailto:iras...@cloudera.com] > *Sent:* Monday, March 16, 2015 11:22 AM > *To:* Shawn Zheng; user@spark.apache.org > *Subject:* Re: Process time series RDD after sortByKey > > > > Hi Shuai, > > > > On Sat, Mar 14, 2015 at 11:02

RE: Process time series RDD after sortByKey

2015-03-16 Thread Shuai Zheng
valuable approach to me so I am desired to learn. Regards, Shuai From: Imran Rashid [mailto:iras...@cloudera.com] Sent: Monday, March 16, 2015 11:22 AM To: Shawn Zheng; user@spark.apache.org Subject: Re: Process time series RDD after sortByKey Hi Shuai, On Sat, Mar 14, 2015 at

Re: Process time series RDD after sortByKey

2015-03-16 Thread Imran Rashid
Hi Shuai, On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng wrote: > Sorry I response late. > > Zhan Zhang's solution is very interesting and I look at into it, but it is > not what I want. Basically I want to run the job sequentially and also gain > parallelism. So if possible, if I have 1000 parti

Re: Process time series RDD after sortByKey

2015-03-11 Thread Imran Rashid
this is a very interesting use case. First of all, its worth pointing out that if you really need to process the data sequentially, fundamentally you are limiting the parallelism you can get. Eg., if you need to process the entire data set sequentially, then you can't get any parallelism. If you

Re: Process time series RDD after sortByKey

2015-03-09 Thread Zhan Zhang
Does the code flow similar to following work for you, which processes each partition of an RDD sequentially? while( iterPartition < RDD.partitions.length) { val res = sc.runJob(this, (it: Iterator[T]) => somFunc, iterPartition, allowLocal = true) Some other function after processing

Process time series RDD after sortByKey

2015-03-09 Thread Shuai Zheng
Hi All, I am processing some time series data. For one day, it might has 500GB, then for each hour, it is around 20GB data. I need to sort the data before I start process. Assume I can sort them successfully dayRDD.sortByKey but after that, I might have thousands of partitions (to m