Hi Lan, Streaming Means, Linear Regression and Logistic Regression support online machine learning as you mentioned. Online machine learning is where model is being trained and updated on every batch of streaming data. These models have trainOn() and predictOn() methods where you can simply pass in DStreams you want to train the model on and DStreams you want the model to predict on. So when the next batch of data arrives model is trained and updated again. In this case model weights are continually updated and hopefully model performs better in terms of convergence and accuracy over time. What we are really trying to do in online learning case is that we are only showing few examples of the data at a time ( stream of data) and updating the parameters in case of Linear and Logistic Regression and updating the centers in case of K-Means. In the case of Linear or Logistic Regression this is possible due to the optimizer that is chosen for minimizing the cost function which is Stochastic Gradient Descent. This optimizer helps us to move closer and closer to the optimal weights after every batch and over the time we will have a model that has learned how to represent our data and predict well.
In the scenario of using any MLlib algorithms and doing training with DStream.transform() and DStream.foreachRDD() operations, when the first batch of data arrives we build a model, let’s call this model1. Once you have the model1 you can make predictions on the same DStream or a different DStream source. But for the next batch if you follow the same procedure and create a model, let’s call this model2. This model2 will be significantly different than model1 based on how different the data is in the second DStream vs the first DStream as it is not continually updating the model. It’s like weight vectors are jumping from one place to the other for every batch and we never know if the algorithm is converging to the optimal weights. So I believe it is not possible to do true online learning with other MLLib models in Spark Streaming. I am not sure if this is because the models don’t generally support this streaming scenarios or if the streaming versions simply haven’t been implemented yet. Though technically you can use any of the MLlib algorithms in Spark Streaming with the procedure you mentioned and make predictions, it is important to figure out if the model you are choosing can converge by showing only a subset(batches - DStreams) of the data over time. Based on the algorithm you choose certain optimizers won’t necessarily be able to converge by showing only individual data points and require to see majority of the data to be able to learn optimal weights. In these cases, you can still do offline learning/training with Spark bach processing using any of the MLlib algorithms and save those models on hdfs. You can then start a streaming job and load these saved models into your streaming application and make predictions. This is traditional offline learning. In general, online learning is hard as it’s hard to evaluate since we are not holding any test data during the model training. We are simply training the model and predicting. So in the initial batches, results can vary quite a bit and have significant errors in terms of the predictions. So choosing online learning vs. offline learning depends on how much tolerance the application can have towards wild predictions in the beginning. Offline training is simple and cheap where as online training can be hard and needs to be constantly monitored to see how it is performing. Hope this helps in understanding offline learning vs. online learning and which algorithms you can choose for online learning in MLlib. Guru Medasani gdm...@gmail.com > On Mar 5, 2016, at 7:37 PM, Lan Jiang <ljia...@gmail.com> wrote: > > Hi, there > > I hope someone can clarify this for me. It seems that some of the MLlib > algorithms such as KMean, Linear Regression and Logistics Regression have a > Streaming version, which can do online machine learning. But does that mean > other MLLib algorithm cannot be used in Spark streaming applications, such as > random forest, SVM, collaborate filtering, etc?? > > DStreams are essentially a sequence of RDDs. We can use DStream.transform() > and DStream.foreachRDD() operations, which allows you access RDDs in a > DStream and apply MLLib functions on them. So it looks like all MLLib > algorithms should be able to run in the streaming application. Am I wrong? > > Lan > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org