Shiwei, yes, you might be right. Thanks. :)

Best,
Yifan LI





> On 12 Oct 2015, at 12:55, 郭士伟 <guoshi...@gmail.com> wrote:
> 
> I think this is not a problem Spark can solve effectively, cause RDD in 
> immutable. Every time you want to change an RDD, you create a new one, and 
> sort again. Maybe hbase or some other DB system will be a more suitable 
> solution. Or, if the data can fit into memory, use a simple heap will work.
> 
> 
> 2015-10-12 18:29 GMT+08:00 Yifan LI <iamyifa...@gmail.com 
> <mailto:iamyifa...@gmail.com>>:
> Hey Adrian,
> 
> Thanks for your fast reply. :)
> 
> Actually the “pre-condition” is not fixed in real application, e.g. it would 
> change based on counting of previous unmatched elements.
> So I need to use iterator operator, rather than flatMap-like operators…
> 
> Besides, do you have any idea on how to avoid that “sort again”? it is too 
> costly… :(
> 
> Anyway thank you again!
> 
> Best,
> Yifan LI
> 
> 
> 
> 
> 
>> On 12 Oct 2015, at 12:19, Adrian Tanase <atan...@adobe.com 
>> <mailto:atan...@adobe.com>> wrote:
>> 
>> I think you’re looking for the flatMap (or flatMapValues) operator – you can 
>> do something like
>> 
>> sortedRdd.flatMapValues( v =>
>> If (v % 2 == 0) {
>> Some(v / 2)
>> } else {
>> None
>> }
>> )
>> 
>> Then you need to sort again.
>> 
>> -adrian
>> 
>> From: Yifan LI
>> Date: Monday, October 12, 2015 at 1:03 PM
>> To: spark users
>> Subject: "dynamically" sort a large collection?
>> 
>> Hey,
>> 
>> I need to scan a large "key-value" collection as below:
>> 
>> 1) sort it on an attribute of “value” 
>> 2) scan it one by one, from element with largest value
>> 2.1) if the current element matches a pre-defined condition, its value will 
>> be reduced and the element will be inserted back to collection. 
>> if not, this current element should be removed from collection.
>> 
>> 
>> In my previous program, the 1) step can be easily conducted in Spark(RDD 
>> operation), but I am not sure how to do 2.1) step, esp. the “put/inserted 
>> back” operation on a sorted RDD.
>> I have tried to make a new RDD at every-time an element was found to 
>> inserted, but it is very costly due to a re-sorting…
>> 
>> 
>> Is there anyone having some ideas?
>> 
>> Thanks so much!
>> 
>> ******************
>> an example:
>> 
>> the sorted result of initial collection C(on bold value), sortedC:
>> (1, (71, “aaa"))
>> (2, (60, “bbb"))
>> (3, (53.5, “ccc”))
>> (4, (48, “ddd”))
>> (5, (29, “eee"))
>> …
>> 
>> pre-condition: its_value%2 == 0
>> if pre-condition is matched, its value will be reduce on half.
>> 
>> Thus:
>> 
>> #1:
>> 71 is not matched, so this element is removed.
>> (1, (71, “aaa”)) —> removed!
>> (2, (60, “bbb"))
>> (3, (53.5, “ccc”))
>> (4, (48, “ddd”))
>> (5, (29, “eee"))
>> …
>> 
>> #2:
>> 60 is matched! 60/2 = 30, the collection right now should be as:
>> (3, (53.5, “ccc”))
>> (4, (48, “ddd”))
>> (2, (30, “bbb”)) <— inserted back here
>> (5, (29, “eee"))
>> …
>> 
>> 
>> 
>> 
>> 
>> 
>> Best,
>> Yifan LI
>> 
>> 
>> 
>> 
>> 
> 
> 

Reply via email to