Question about differences between batch and streaming training of LogisticRegression Algorithm in Spark3.0

cfang1109 Wed, 09 Sep 2020 04:21:33 -0700

HI ALL,

We want to use socket streaming data to train a LR Model with 
StreamingLogisticRegressionWithSGD and now have some questions.
1，The trainOn method of StreamingLogisticRegressionWithSGD contains a part of 
code like this,
data.foreachRDD{ (rdd, time) =>
       if (!rdd.isEmpty) { ... }
}
And we found that the rdd.isEmpty cost too much time, actually, 2s while this 
batch RDD training cost 9s. We believe this is a point that we could optimize, 
but we don't konw how.
2，The Optimizer instance between LogisticRegressionWithSGD and 
LogisticRegressionWithLBFGS is different, the former is GradientDescent while 
the latter LBFGS.
Now the following description is interesting. We found that GradientDescent 
contains a line code like this,
val numExamples = data.count()


// if no data, return initial weights to avoid NaNs
if (numExamples == 0) {
  logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no 
data found")
  return (initialWeights, stochasticLossHistory.toArray)
}

if (numExamples * miniBatchFraction < 1) {
  logWarning("The miniBatchFraction is too small")
}
,where data is the input training data with the form (label, [feature values]) .
And we found the data.count() action operation cost too much time, actually 5s 
while this data training costs 9s. 
However, another Optimizer implement LBFGS does not have this problem. 
Now the interesting point is that, the streaming implement for LR is 
StreamingLogisticRegressionWithSGD whose inner algorithm is 
LogisticRegressionWithSGD with GradientDescent Optimizer, while batch implement 
for LR is LogisticRegresionWithLBFGS with LBFS Optimizer. The result of this 
that the performance of batch implement LR is better.  I think that's 
unacceptable, please help me and any comment is appreciated.

Question about differences between batch and streaming training of LogisticRegression Algorithm in Spark3.0

Reply via email to