an't just magically ignore some time
> range of rdds, because they may contain events you care about.
>
> On Wed, Jan 6, 2016 at 10:55 AM, Julien Naour wrote:
>
>> The following lines are my understanding of Spark Streaming AFAIK, I
>> could be wrong:
>>
>>
n you can do foreachPartition with a local map to store just a single
> event per user, e.g.
>
> foreachPartition { p =>
> val m = new HashMap
> p.foreach ( event =>
> m.put(event,user, event)
> }
> m.foreach {
> ... do your computation
> }
>
&g
keys corresponding to some kind of user id. I want
to process last events by each user id once ie skip intermediate events by
user id.
I have only one Kafka topic with all theses events.
Regards,
Julien Naour
Le mer. 6 janv. 2016 à 16:13, Cody Koeninger a écrit :
> Have you read
>
mingContext and process the
same DStream at different speed (low processing vs high)?
Is it easily possible to share values (map for example) between pipelines
without using an external database? I think accumulator/broadcast could
work but between two pipelines I'm not sure.
Regards,
Julien Naour
427
> And current k-means implementation of MLlib, it's benefited from sparse
> vector computing.
> http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
>
>
>
> 2014-08-21 15:40 GMT+08:00 Julien Naour :
>
> My Arrays are in fact Array[Array[Long]] and l
My Arrays are in fact Array[Array[Long]] and like 17x15 (17 centers
with 150 000 modalities, i'm working on qualitative variables) so they are
pretty large. I'm working on it to get them smaller, it's mostly a sparse
matrix.
Good things to know nervertheless.
Thanks,
Julien
dcast instead of
simple variable?
Cheers,
Julien Naour
You can find in the following presentation a simple example of a clustering
model use to classify new incoming tweet :
https://www.youtube.com/watch?v=sPhyePwo7FA
Regards,
Julien
2014-08-05 7:08 GMT+02:00 Xiangrui Meng :
> Some extra work is needed to close the loop. One related example is
> st
Hi,
My question is simple: could it be some performance issue using
Accumulable/Accumulator instead of method like map() reduce()... ?
My use case : implementation of a clustering algorithm like k-means.
At the begining I used two steps, one to asign data to cluster and another
to calculate new c