It seems you are not reducing the data in size. If you are not then you are
better off partitioning the data into buckets (folders?) & keep data sorted
in those buckets ..
A more cleaner approach is to use HBase to keep track of keys & keep adding
keys as you find them & let hbase handle it.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, May 19, 2014 at 2:14 PM, Laurent T <laurent.thou...@ldmobile.net>wrote:

> (resending this as alot of  mails seems not to be delivered)
>
> Hi,
>
> I have some complex behavior i'd like to be advised on as i'm really new to
> Spark.
>
> I'm reading some log files that contains various events. There are two
> types
> of events: parents and children. A child event can only have one parent and
> a parent can have multiple children.
>
> Currently i'm mapping my lines to get a Tuple2(parentID, Tuple2(Parent,
> List<Child>)) and then reducing by key to combine all children into one
> list
> and associate them with their parent.
> .reduceByKey(new Function2<Tuple2&lt;Parent, List&lt;Child>>,
> Tuple2<Parent,
> List&lt;Child>>, Tuple2<Parent, List&lt;Child>>>(){...}).
>
> It works fine on static data. But in production, i will have to process
> only
> part of the log files, for instance, everyday at midnight i'll process the
> last day of logs.
>
> So i'm facing the problem that a Parent may arrive one day and children on
> the next day. Right after reducing, i'm having Tuples with no parent and
> i'd
> like, only for those, to go check the previous log files to find the parent
> in a efficient way.
>
> My first idea would be to branch data using a filter and it's opposite.
> I'll
> then read previous files one by one until i've found all parents or i've
> reached a predefined limit. I would finally merge back everything to
> finalize my job.
> The problem is, i'm not even sure how i can do that. The filter part should
> be easy but how am i gonna scan files one by one using spark ?
>
> I hope someone can guide me through this.
> FYI, there will be gigs of data to process.
>
> Thanks
> Laurent
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Advanced-log-processing-tp5743p6025.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to