Re: MLLib + Streaming

Guru Medasani Sat, 05 Mar 2016 21:56:08 -0800

Hi Lan,

Streaming Means, Linear Regression and Logistic Regression support online 
machine learning as you mentioned. Online machine learning is where model is 
being trained and updated on every batch of streaming data. These models have 
trainOn() and predictOn() methods where you can simply pass in DStreams you 
want to train the model on and DStreams you want the model to predict on. So 
when the next batch of data arrives model is trained and updated again. In this 
case model weights are continually updated and hopefully model performs better 
in terms of convergence and accuracy over time. What we are really trying to do 
in online learning case is that we are only showing few examples of the data at 
a time ( stream of data) and updating the parameters in case of Linear and 
Logistic Regression and updating the centers in case of K-Means. In the case of 
Linear or Logistic Regression this is possible due to the optimizer that is 
chosen for minimizing the cost function which is Stochastic Gradient Descent. 
This optimizer helps us to move closer and closer to the optimal weights after 
every batch and over the time we will have a model that has learned how to 
represent our data and predict well.

In the scenario of using any MLlib algorithms and doing training with 
DStream.transform() and DStream.foreachRDD() operations, when the first batch 
of data arrives we build a model, let’s call this model1. Once you have the 
model1 you can make predictions on the same DStream or a different DStream 
source. But for the next batch if you follow the same procedure and create a 
model, let’s call this model2. This model2 will be significantly different than 
model1 based on how different the data is in the second DStream vs the first 
DStream as it is not continually updating the model. It’s like weight vectors 
are jumping from one place to the other for every batch and we never know if 
the algorithm is converging to the optimal weights. So I believe it is not 
possible to do true online learning with other MLLib models in Spark Streaming. 
 I am not sure if this is because the models don’t generally support this 
streaming scenarios or if the streaming versions simply haven’t been 
implemented yet. 

Though technically you can use any of the MLlib algorithms in Spark Streaming 
with the procedure you mentioned and make predictions, it is important to 
figure out if the model you are choosing can converge by showing only a 
subset(batches  - DStreams) of the data over time. Based on the algorithm you 
choose certain optimizers won’t necessarily be able to converge by showing only 
individual data points and require to see majority of the data to be able to 
learn optimal weights.  In these cases, you can still do offline 
learning/training with Spark bach processing using any of the MLlib algorithms 
and save those models on hdfs. You can then start a streaming job and load 
these saved models into your streaming application and make predictions. This 
is traditional offline learning.  

In general, online learning is hard as it’s hard to evaluate since we are not 
holding any test data during the model training. We are simply training the 
model and predicting. So in the initial batches, results can vary quite a bit 
and have significant errors in terms of the predictions. So choosing online 
learning vs. offline learning depends on how much tolerance the application can 
have towards wild predictions in the beginning. Offline training is simple and 
cheap where as online training can be hard and needs to be constantly monitored 
to see how it is performing. 

Hope this helps in understanding offline learning vs. online learning and which 
algorithms you can choose for online learning in MLlib. 

Guru Medasani
gdm...@gmail.com

> On Mar 5, 2016, at 7:37 PM, Lan Jiang <ljia...@gmail.com> wrote:
> 
> Hi, there
> 
> I hope someone can clarify this for me.  It seems that some of the MLlib 
> algorithms such as KMean, Linear Regression and Logistics Regression have a 
> Streaming version, which can do online machine learning. But does that mean 
> other MLLib algorithm cannot be used in Spark streaming applications, such as 
> random forest, SVM, collaborate filtering, etc??
> 
> DStreams are essentially a sequence of RDDs. We can use DStream.transform() 
> and DStream.foreachRDD() operations, which allows you access RDDs in a 
> DStream and apply MLLib functions on them. So it looks like all MLLib 
> algorithms should be able to run in the streaming application. Am I wrong? 
> 
> Lan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib + Streaming

Reply via email to