Thanks a lot for the response. Regarding the sampling part - yeah that's what I need to do if there's no way of titrating the number of clusters online.
I am using something like dstream.foreachRDD { rdd => if (rdd.count() > 0) { //.. logic } } Feels a little odd but if that's the idiom then I will stick to it. regards. On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <c...@koeninger.org> wrote: > So I haven't played around with streaming k means at all, but given > that no one responded to your message a couple of days ago, I'll say > what I can. > > 1. Can you not sample out some % of the stream for training? > 2. Can you run multiple streams at the same time with different values > for k and compare their performance? > 3. foreachRDD is fine in general, can't speak to the specifics. > 4. If you haven't done any transformations yet on a direct stream, > foreachRDD will give you a KafkaRDD. Checking if a KafkaRDD is empty > is very cheap, it's done on the driver only because the beginning and > ending offsets are known. So you should be able to skip empty > batches. > > > > On Sat, Nov 19, 2016 at 10:46 AM, debasishg <ghosh.debas...@gmail.com> > wrote: > > Hello - > > > > I am trying to implement an outlier detection application on streaming > data. > > I am a newbie to Spark and hence would like some advice on the confusions > > that I have .. > > > > I am thinking of using StreamingKMeans - is this a good choice ? I have > one > > stream of data and I need an online algorithm. But here are some > questions > > that immediately come to my mind .. > > > > 1. I cannot do separate training, cross validation etc. Is this a good > idea > > to do training and prediction online ? > > > > 2. The data will be read from the stream coming from Kafka in > microbatches > > of (say) 3 seconds. I get a DStream on which I train and get the > clusters. > > How can I decide on the number of clusters ? Using StreamingKMeans is > there > > any way I can iterate on microbatches with different values of k to find > the > > optimal one ? > > > > 3. Even if I fix k, after training on every microbatch I get a DStream. > How > > can I compute things like clustering score on the DStream ? > > StreamingKMeansModel has a computeCost function but it takes an RDD. I > can > > use dstream.foreachRDD { // process RDD for the micro batch here } - is > this > > the idiomatic way ? > > > > 4. If I use dstream.foreachRDD { .. } and use functions like new > > StandardScaler().fit(rdd) to do feature normalization, then it works > when I > > have data in the stream. But when the microbatch is empty (say I don't > have > > data for some time), the fit method throws exception as it gets an empty > > collection. Things start working ok when data starts coming back to the > > stream. But is this the way to go ? > > > > any suggestion will be welcome .. > > > > regards. > > > > > > > > -- > > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > -- Debasish Ghosh http://manning.com/ghosh2 http://manning.com/ghosh Twttr: @debasishg Blog: http://debasishg.blogspot.com Code: http://github.com/debasishg