It seems you are not reducing the data in size. If you are not then you are better off partitioning the data into buckets (folders?) & keep data sorted in those buckets .. A more cleaner approach is to use HBase to keep track of keys & keep adding keys as you find them & let hbase handle it.
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Mon, May 19, 2014 at 2:14 PM, Laurent T <laurent.thou...@ldmobile.net>wrote: > (resending this as alot of mails seems not to be delivered) > > Hi, > > I have some complex behavior i'd like to be advised on as i'm really new to > Spark. > > I'm reading some log files that contains various events. There are two > types > of events: parents and children. A child event can only have one parent and > a parent can have multiple children. > > Currently i'm mapping my lines to get a Tuple2(parentID, Tuple2(Parent, > List<Child>)) and then reducing by key to combine all children into one > list > and associate them with their parent. > .reduceByKey(new Function2<Tuple2<Parent, List<Child>>, > Tuple2<Parent, > List<Child>>, Tuple2<Parent, List<Child>>>(){...}). > > It works fine on static data. But in production, i will have to process > only > part of the log files, for instance, everyday at midnight i'll process the > last day of logs. > > So i'm facing the problem that a Parent may arrive one day and children on > the next day. Right after reducing, i'm having Tuples with no parent and > i'd > like, only for those, to go check the previous log files to find the parent > in a efficient way. > > My first idea would be to branch data using a filter and it's opposite. > I'll > then read previous files one by one until i've found all parents or i've > reached a predefined limit. I would finally merge back everything to > finalize my job. > The problem is, i'm not even sure how i can do that. The filter part should > be easy but how am i gonna scan files one by one using spark ? > > I hope someone can guide me through this. > FYI, there will be gigs of data to process. > > Thanks > Laurent > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Advanced-log-processing-tp5743p6025.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >