HI ALL, We want to use socket streaming data to train a LR Model with StreamingLogisticRegressionWithSGD and now have some questions. 1,The trainOn method of StreamingLogisticRegressionWithSGD contains a part of code like this, data.foreachRDD{ (rdd, time) => if (!rdd.isEmpty) { ... } } And we found that the rdd.isEmpty cost too much time, actually, 2s while this batch RDD training cost 9s. We believe this is a point that we could optimize, but we don't konw how. 2,The Optimizer instance between LogisticRegressionWithSGD and LogisticRegressionWithLBFGS is different, the former is GradientDescent while the latter LBFGS. Now the following description is interesting. We found that GradientDescent contains a line code like this, val numExamples = data.count()
// if no data, return initial weights to avoid NaNs if (numExamples == 0) { logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found") return (initialWeights, stochasticLossHistory.toArray) } if (numExamples * miniBatchFraction < 1) { logWarning("The miniBatchFraction is too small") } ,where data is the input training data with the form (label, [feature values]) . And we found the data.count() action operation cost too much time, actually 5s while this data training costs 9s. However, another Optimizer implement LBFGS does not have this problem. Now the interesting point is that, the streaming implement for LR is StreamingLogisticRegressionWithSGD whose inner algorithm is LogisticRegressionWithSGD with GradientDescent Optimizer, while batch implement for LR is LogisticRegresionWithLBFGS with LBFS Optimizer. The result of this that the performance of batch implement LR is better. I think that's unacceptable, please help me and any comment is appreciated.