Shiwei, yes, you might be right. Thanks. :) Best, Yifan LI
> On 12 Oct 2015, at 12:55, 郭士伟 <guoshi...@gmail.com> wrote: > > I think this is not a problem Spark can solve effectively, cause RDD in > immutable. Every time you want to change an RDD, you create a new one, and > sort again. Maybe hbase or some other DB system will be a more suitable > solution. Or, if the data can fit into memory, use a simple heap will work. > > > 2015-10-12 18:29 GMT+08:00 Yifan LI <iamyifa...@gmail.com > <mailto:iamyifa...@gmail.com>>: > Hey Adrian, > > Thanks for your fast reply. :) > > Actually the “pre-condition” is not fixed in real application, e.g. it would > change based on counting of previous unmatched elements. > So I need to use iterator operator, rather than flatMap-like operators… > > Besides, do you have any idea on how to avoid that “sort again”? it is too > costly… :( > > Anyway thank you again! > > Best, > Yifan LI > > > > > >> On 12 Oct 2015, at 12:19, Adrian Tanase <atan...@adobe.com >> <mailto:atan...@adobe.com>> wrote: >> >> I think you’re looking for the flatMap (or flatMapValues) operator – you can >> do something like >> >> sortedRdd.flatMapValues( v => >> If (v % 2 == 0) { >> Some(v / 2) >> } else { >> None >> } >> ) >> >> Then you need to sort again. >> >> -adrian >> >> From: Yifan LI >> Date: Monday, October 12, 2015 at 1:03 PM >> To: spark users >> Subject: "dynamically" sort a large collection? >> >> Hey, >> >> I need to scan a large "key-value" collection as below: >> >> 1) sort it on an attribute of “value” >> 2) scan it one by one, from element with largest value >> 2.1) if the current element matches a pre-defined condition, its value will >> be reduced and the element will be inserted back to collection. >> if not, this current element should be removed from collection. >> >> >> In my previous program, the 1) step can be easily conducted in Spark(RDD >> operation), but I am not sure how to do 2.1) step, esp. the “put/inserted >> back” operation on a sorted RDD. >> I have tried to make a new RDD at every-time an element was found to >> inserted, but it is very costly due to a re-sorting… >> >> >> Is there anyone having some ideas? >> >> Thanks so much! >> >> ****************** >> an example: >> >> the sorted result of initial collection C(on bold value), sortedC: >> (1, (71, “aaa")) >> (2, (60, “bbb")) >> (3, (53.5, “ccc”)) >> (4, (48, “ddd”)) >> (5, (29, “eee")) >> … >> >> pre-condition: its_value%2 == 0 >> if pre-condition is matched, its value will be reduce on half. >> >> Thus: >> >> #1: >> 71 is not matched, so this element is removed. >> (1, (71, “aaa”)) —> removed! >> (2, (60, “bbb")) >> (3, (53.5, “ccc”)) >> (4, (48, “ddd”)) >> (5, (29, “eee")) >> … >> >> #2: >> 60 is matched! 60/2 = 30, the collection right now should be as: >> (3, (53.5, “ccc”)) >> (4, (48, “ddd”)) >> (2, (30, “bbb”)) <— inserted back here >> (5, (29, “eee")) >> … >> >> >> >> >> >> >> Best, >> Yifan LI >> >> >> >> >> > >