gt; create my own RDD class before (not RDD instance J). But this is very
>> valuable approach to me so I am desired to learn.
>>
>>
>>
>> Regards,
>>
>>
>>
>> Shuai
>>
>>
>>
>> *From:* Imran Rashid [mailto:iras...@cloudera.c
t;
>
>
> *From:* Imran Rashid [mailto:iras...@cloudera.com]
> *Sent:* Monday, March 16, 2015 11:22 AM
> *To:* Shawn Zheng; user@spark.apache.org
> *Subject:* Re: Process time series RDD after sortByKey
>
>
>
> Hi Shuai,
>
>
>
> On Sat, Mar 14, 2015 at 11:02
valuable approach
to me so I am desired to learn.
Regards,
Shuai
From: Imran Rashid [mailto:iras...@cloudera.com]
Sent: Monday, March 16, 2015 11:22 AM
To: Shawn Zheng; user@spark.apache.org
Subject: Re: Process time series RDD after sortByKey
Hi Shuai,
On Sat, Mar 14, 2015 at
Hi Shuai,
On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng wrote:
> Sorry I response late.
>
> Zhan Zhang's solution is very interesting and I look at into it, but it is
> not what I want. Basically I want to run the job sequentially and also gain
> parallelism. So if possible, if I have 1000 parti
this is a very interesting use case. First of all, its worth pointing out
that if you really need to process the data sequentially, fundamentally you
are limiting the parallelism you can get. Eg., if you need to process the
entire data set sequentially, then you can't get any parallelism. If you
Does the code flow similar to following work for you, which processes each
partition of an RDD sequentially?
while( iterPartition < RDD.partitions.length) {
val res = sc.runJob(this, (it: Iterator[T]) => somFunc, iterPartition,
allowLocal = true)
Some other function after processing
Hi All,
I am processing some time series data. For one day, it might has 500GB, then
for each hour, it is around 20GB data.
I need to sort the data before I start process. Assume I can sort them
successfully
dayRDD.sortByKey
but after that, I might have thousands of partitions (to m