On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote:
> Would you mind letting us know the # training examples in the datasets? > Also, what do your features look like? Are they text, categorical etc? You > mention that most rows only have a few features, and all rows together have > a few 10,000s features, yet your max feature value is 20 million. How are > your constructing your feature vectors to get a 20 million size? The only > realistic way I can see this situation occurring in practice is with > feature hashing (HashingTF). > The sub-sample I'm currently training on is about 50K rows, so ... small. The features causing this issue are numeric (int) IDs for ... lets call it "Thing". For each Thing in the record, we set the feature Thing.id to a value of 1.0 in our vector (which is of course a SparseVector). I'm not sure how IDs are generated for Things, but they can be large numbers. The largest Thing ID is around 20 million, so that ends up being the size of the vector. But in fact there are fewer than 10,000 unique Thing IDs in this data. The mean number of features per record in what I'm currently training against is 41, while the maximum for any given record was 1754. It is possible to map the features into a small set (just need to zipWithIndex), but this is undesirable because of the added complexity (not just for the training, but also anything wanting to score against the model). It might be a little easier if this could be encapsulated within the model object itself (perhaps via composition), though I'm not sure how feasible that is. But I'd rather not bother with dimensionality reduction at all - since we can train using liblinear in just a few minutes, it doesn't seem necessary. > > MultivariateOnlineSummarizer uses dense arrays, but it should be possible > to enable sparse data. Though in theory, the result will tend to be dense > anyway, unless you have very many entries in the input feature vector that > never occur and are actually zero throughout the data set (which it seems > is the case with your data?). So I doubt whether using sparse vectors for > the summarizer would improve performance in general. > Yes, that is exactly my case - the vast majority of entries in the input feature vector will *never* occur. Presumably that means most of the values in the aggregators' arrays will be zero. > > LR doesn't accept a sparse weight vector, as it uses dense vectors for > coefficients and gradients currently. When using L1 regularization, it > could support sparse weight vectors, but the current implementation doesn't > do that yet. > Good to know it is theoretically possible to implement. I'll have to give it some thought. In the meantime I guess I'll experiment with coalescing the data to minimize the communication overhead. Thanks again.