Hi Sean, I think the simplest way is to return a *breeze.linalg.HashVector* when *org.apache.spark.ml.linalg.SparseVector#asBreeze* is called, and use a lazy value to store that vector because the construction of *breeze.linalg.HashVector* has some extra performance cost.
The code will be like class SparseVector @Since("2.0.0") ( override val size: Int, @Since("2.0.0") val indices: Array[Int], @Since("2.0.0") val values: Array[Double]) extends Vector { lazy val breezeVector = breeze.linalg.HashVector.apply(size)(indices.zip(values): _*) ..... private[spark] override def asBreeze: BV[Double] = breezeVector .... } I'm not sure this is the best approach so I think I can file an issue for further discussion. Thanks, Huafeng Sean Owen <sro...@gmail.com>于2018年7月2日周一 上午11:38写道: > This could be a good optimization. But can it be done without changing > any APIs or slowing anything else down? if so this could be worth a > pull request. > On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <fvunic...@gmail.com> wrote: > > > > Hi there, > > > > I'm using GBTClassifier do some classification jobs and find the > performance of scoring stage is not quite satisfying. The trained model has > about 160 trees and the input feature vector is sparse and its size is > about 20+. > > > > After some digging, I found the model will repeatedly and randomly > access feature in SparseVector when predicting an input vector, which will > eventually call function breeze.linalg.SparseVector#apply. That function > generally uses a binary search to locate the corresponding index so the > complexity is O(log numNonZero). > > > > Then I tried to convert my feature vectors to dense vectors before > inference and the result shows that the inference stage can speed up for > about 2~3 times. (Random access in DenseVector is O(1)) > > > > So my question is why not use breeze.linalg.HashVector when randomly > accessing values in SpareVector since the complexity is O(1) according to > Breeze's documentation, much better than the SparseVector in such case. > > > > Thanks, > > Vincent >