This could be a good optimization. But can it be done without changing any APIs or slowing anything else down? if so this could be worth a pull request. On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <fvunic...@gmail.com> wrote: > > Hi there, > > I'm using GBTClassifier do some classification jobs and find the performance > of scoring stage is not quite satisfying. The trained model has about 160 > trees and the input feature vector is sparse and its size is about 20+. > > After some digging, I found the model will repeatedly and randomly access > feature in SparseVector when predicting an input vector, which will > eventually call function breeze.linalg.SparseVector#apply. That function > generally uses a binary search to locate the corresponding index so the > complexity is O(log numNonZero). > > Then I tried to convert my feature vectors to dense vectors before inference > and the result shows that the inference stage can speed up for about 2~3 > times. (Random access in DenseVector is O(1)) > > So my question is why not use breeze.linalg.HashVector when randomly > accessing values in SpareVector since the complexity is O(1) according to > Breeze's documentation, much better than the SparseVector in such case. > > Thanks, > Vincent
--------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org