Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
I'm sorry, I missed some important informations. I use Spark version 2.0.2 in Scala 2.11.8. 2017-03-14 13:44 GMT+01:00 Julian Keppel : > Hi everybody, > > I make some experiments with the Spark kmeans implementation of the new > DataFrame-API. I compare clustering results of di

[MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
Hi everybody, I make some experiments with the Spark kmeans implementation of the new DataFrame-API. I compare clustering results of different runs with different parameters. I recognized that for random initialization mode, the seed value is the same every time. How is it calculated? In my unders

[Spark DataFrames/Streaming]: Bad performance with window function in streaming job

2017-01-16 Thread Julian Keppel
Hi, I use Spark 2.0.2 and want to do the following: I extract features in a streaming job and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a specifi

Re: Kafka direct approach,App UI shows wrong input rate

2016-11-22 Thread Julian Keppel
16-11-19 18:06 GMT+01:00 Cody Koeninger : > There have definitely been issues with UI reporting for the direct > stream in the past, but I'm not able to reproduce this with 2.0.2 and > 0.8. See below: > > https://i.imgsafe.org/086019ae57.png > > > > On Fri, N

Re: using StreamingKMeans

2016-11-21 Thread Julian Keppel
I do research in anomaly detection with methods of machine learning at the moment. And currently I do kmeans clustering, too in an offline learning setting. In further work we want to compare the two paradigms of offline and online learning. I would like to share some thoughts on this disscussion.

Kafka direct approach,App UI shows wrong input rate

2016-11-18 Thread Julian Keppel
Hello, I use Spark 2.0.2 with Kafka integration 0-8. The Kafka version is 0.10.0.1 (Scala 2.11). I read data from Kafka with the direct approach. The complete infrastructure runs on Google Container Engine. I wonder why the corresponding application UI says the input rate is zero records per seco

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-14 Thread Julian Keppel
Okay, thank you! Can you say, when this feature will be released? 2016-10-13 16:29 GMT+02:00 Cody Koeninger : > As Sean said, it's unreleased. If you want to try it out, build spark > > http://spark.apache.org/docs/latest/building-spark.html > > The easiest way to include the jar is probably to

Re: Sharing object/state accross transformations

2015-12-06 Thread Julian Keppel
Yes, but what they do is to only add new elements to a state which is passed as parameter. But my problem is that my "counter" (the hyperloglog object) comes from outside and is not passed to the function. So i have to track the state of this "external" hll object accross the whole lifecycle of the