Hi, Thanks a lot, Guru Medasani, for such an excellent theory rich intro to MLlib! I wish I found only such emails in my mailbox.
Sorry. Couldn't resist since I've just started with MLlib and your response has resonated so well with my initial experience. Thanks! Jacek 06.03.2016 6:55 AM "Guru Medasani" <gdm...@gmail.com> napisał(a): > Hi Lan, > > Streaming Means, Linear Regression and Logistic Regression support online > machine learning as you mentioned. Online machine learning is where model > is being trained and updated on every batch of streaming data. These models > have trainOn() and predictOn() methods where you can simply pass in > DStreams you want to train the model on and DStreams you want the model to > predict on. So when the next batch of data arrives model is trained and > updated again. In this case model weights are continually updated and > hopefully model performs better in terms of convergence and accuracy over > time. What we are really trying to do in online learning case is that we > are only showing few examples of the data at a time ( stream of data) and > updating the parameters in case of Linear and Logistic Regression and > updating the centers in case of K-Means. In the case of Linear or Logistic > Regression this is possible due to the optimizer that is chosen for > minimizing the cost function which is Stochastic Gradient Descent. This > optimizer helps us to move closer and closer to the optimal weights after > every batch and over the time we will have a model that has learned how to > represent our data and predict well. > > In the scenario of using any MLlib algorithms and doing training with > DStream.transform() and DStream.foreachRDD() operations, when the first > batch of data arrives we build a model, let’s call this model1. Once you > have the model1 you can make predictions on the same DStream or a different > DStream source. But for the next batch if you follow the same procedure and > create a model, let’s call this model2. This model2 will be significantly > different than model1 based on how different the data is in the second > DStream vs the first DStream as it is not continually updating the model. > It’s like weight vectors are jumping from one place to the other for every > batch and we never know if the algorithm is converging to the optimal > weights. So I believe it is not possible to do true online learning with > other MLLib models in Spark Streaming. I am not sure if this is because > the models don’t generally support this streaming scenarios or if the > streaming versions simply haven’t been implemented yet. > > Though technically you can use any of the MLlib algorithms in Spark > Streaming with the procedure you mentioned and make predictions, it is > important to figure out if the model you are choosing can converge by > showing only a subset(batches - DStreams) of the data over time. Based on > the algorithm you choose certain optimizers won’t necessarily be able to > converge by showing only individual data points and require to see majority > of the data to be able to learn optimal weights. In these cases, you can > still do offline learning/training with Spark bach processing using any of > the MLlib algorithms and save those models on hdfs. You can then start a > streaming job and load these saved models into your streaming application > and make predictions. This is traditional offline learning. > > In general, online learning is hard as it’s hard to evaluate since we are > not holding any test data during the model training. We are simply training > the model and predicting. So in the initial batches, results can vary quite > a bit and have significant errors in terms of the predictions. So choosing > online learning vs. offline learning depends on how much tolerance the > application can have towards wild predictions in the beginning. Offline > training is simple and cheap where as online training can be hard and needs > to be constantly monitored to see how it is performing. > > Hope this helps in understanding offline learning vs. online learning and > which algorithms you can choose for online learning in MLlib. > > Guru Medasani > gdm...@gmail.com > > > > > On Mar 5, 2016, at 7:37 PM, Lan Jiang <ljia...@gmail.com> wrote: > > > > Hi, there > > > > I hope someone can clarify this for me. It seems that some of the MLlib > algorithms such as KMean, Linear Regression and Logistics Regression have a > Streaming version, which can do online machine learning. But does that mean > other MLLib algorithm cannot be used in Spark streaming applications, such > as random forest, SVM, collaborate filtering, etc?? > > > > DStreams are essentially a sequence of RDDs. We can use > DStream.transform() and DStream.foreachRDD() operations, which allows you > access RDDs in a DStream and apply MLLib functions on them. So it looks > like all MLLib algorithms should be able to run in the streaming > application. Am I wrong? > > > > Lan > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >