I'm not sure that second count can be optimized away, as it's used a few times. Are you sure it takes that long? how are you measuring that and is it not perhaps the effect of caching the data the first time? What is the nature of the data that it takes that long?
On Wed, Sep 9, 2020 at 6:21 AM cfang1109 <cfang1...@aliyun.com.invalid> wrote: > > HI ALL, > > We want to use socket streaming data to train a LR Model with > StreamingLogisticRegressionWithSGD and now have some questions. > 1,The trainOn method of StreamingLogisticRegressionWithSGD contains a part of > code like this, > data.foreachRDD{ (rdd, time) => > if (!rdd.isEmpty) { ... } > } > And we found that the rdd.isEmpty cost too much time, actually, 2s while this > batch RDD training cost 9s. We believe this is a point that we could > optimize, but we don't konw how. > 2,The Optimizer instance between LogisticRegressionWithSGD and > LogisticRegressionWithLBFGS is different, the former is GradientDescent while > the latter LBFGS. > Now the following description is interesting. We found that GradientDescent > contains a line code like this, > > val numExamples = data.count() > > // if no data, return initial weights to avoid NaNs > if (numExamples == 0) { > logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no > data found") > return (initialWeights, stochasticLossHistory.toArray) > } > > if (numExamples * miniBatchFraction < 1) { > logWarning("The miniBatchFraction is too small") > } > > ,where data is the input training data with the form (label, [feature > values]) . > And we found the data.count() action operation cost too much time, actually > 5s while this data training costs 9s. > However, another Optimizer implement LBFGS does not have this problem. > Now the interesting point is that, the streaming implement for LR is > StreamingLogisticRegressionWithSGD whose inner algorithm is > LogisticRegressionWithSGD with GradientDescent Optimizer, while batch > implement for LR is LogisticRegresionWithLBFGS with LBFS Optimizer. The > result of this that the performance of batch implement LR is better. I think > that's unacceptable, please help me and any comment is appreciated. > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org