(resending this as alot of mails seems not to be delivered) Hi,
I have some complex behavior i'd like to be advised on as i'm really new to Spark. I'm reading some log files that contains various events. There are two types of events: parents and children. A child event can only have one parent and a parent can have multiple children. Currently i'm mapping my lines to get a Tuple2(parentID, Tuple2(Parent, List<Child>)) and then reducing by key to combine all children into one list and associate them with their parent. .reduceByKey(new Function2<Tuple2<Parent, List<Child>>, Tuple2<Parent, List<Child>>, Tuple2<Parent, List<Child>>>(){...}). It works fine on static data. But in production, i will have to process only part of the log files, for instance, everyday at midnight i'll process the last day of logs. So i'm facing the problem that a Parent may arrive one day and children on the next day. Right after reducing, i'm having Tuples with no parent and i'd like, only for those, to go check the previous log files to find the parent in a efficient way. My first idea would be to branch data using a filter and it's opposite. I'll then read previous files one by one until i've found all parents or i've reached a predefined limit. I would finally merge back everything to finalize my job. The problem is, i'm not even sure how i can do that. The filter part should be easy but how am i gonna scan files one by one using spark ? I hope someone can guide me through this. FYI, there will be gigs of data to process. Thanks Laurent -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Advanced-log-processing-tp5743p6025.html Sent from the Apache Spark User List mailing list archive at Nabble.com.