Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
an't just magically ignore some time > range of rdds, because they may contain events you care about. > > On Wed, Jan 6, 2016 at 10:55 AM, Julien Naour wrote: > >> The following lines are my understanding of Spark Streaming AFAIK, I >> could be wrong: >> >>

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
n you can do foreachPartition with a local map to store just a single > event per user, e.g. > > foreachPartition { p => > val m = new HashMap > p.foreach ( event => > m.put(event,user, event) > } > m.foreach { > ... do your computation > } > &g

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
keys corresponding to some kind of user id. I want to process last events by each user id once ie skip intermediate events by user id. I have only one Kafka topic with all theses events. Regards, Julien Naour Le mer. 6 janv. 2016 à 16:13, Cody Koeninger a écrit : > Have you read >

Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
mingContext and process the same DStream at different speed (low processing vs high)? Is it easily possible to share values (map for example) between pipelines without using an external database? I think accumulator/broadcast could work but between two pipelines I'm not sure. Regards, Julien Naour

Re: Broadcast vs simple variable

2014-08-21 Thread Julien Naour
427 > And current k-means implementation of MLlib, it's benefited from sparse > vector computing. > http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 > > > > 2014-08-21 15:40 GMT+08:00 Julien Naour : > > My Arrays are in fact Array[Array[Long]] and l

Re: Broadcast vs simple variable

2014-08-21 Thread Julien Naour
My Arrays are in fact Array[Array[Long]] and like 17x15 (17 centers with 150 000 modalities, i'm working on qualitative variables) so they are pretty large. I'm working on it to get them smaller, it's mostly a sparse matrix. Good things to know nervertheless. Thanks, Julien

Broadcast vs simple variable

2014-08-20 Thread Julien Naour
dcast instead of simple variable? Cheers, Julien Naour

Re: about spark and using machine learning model

2014-08-05 Thread Julien Naour
You can find in the following presentation a simple example of a clustering model use to classify new incoming tweet : https://www.youtube.com/watch?v=sPhyePwo7FA Regards, Julien 2014-08-05 7:08 GMT+02:00 Xiangrui Meng : > Some extra work is needed to close the loop. One related example is > st

Accumulator and Accumulable vs classic MR

2014-08-01 Thread Julien Naour
Hi, My question is simple: could it be some performance issue using Accumulable/Accumulator instead of method like map() reduce()... ? My use case : implementation of a clustering algorithm like k-means. At the begining I used two steps, one to asign data to cluster and another to calculate new c