Thanks for the feedback. I think Spark can certainly meet your use case when your data size scales up, as the actual model dimension is very small - you will need to use those indexers or some other mapping mechanism.
There is ongoing work for Spark 2.0 to make it easier to use models outside of Spark - also see PMML export (I think mllib logistic regression is supported but I have to check that). That will help use spark models in serving environments. Finally, I will add a JIRA to investigate sparse models for LR - maybe also a ticket for multivariate summariser (though I don't think in practice there will be much to gain). On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <daniel.siegm...@teamaol.com> wrote: > Thanks for the pointer to those indexers, those are some good examples. A > good way to go for the trainer and any scoring done in Spark. I will > definitely have to deal with scoring in non-Spark systems though. > > I think I will need to scale up beyond what single-node liblinear can > practically provide. The system will need to handle much larger sub-samples > of this data (and other projects might be larger still). Additionally, the > system needs to train many models in parallel (hyper-parameter optimization > with n-fold cross-validation, multiple algorithms, different sets of > features). > > Still, I suppose we'll have to consider whether Spark is the best system > for this. For now though, my job is to see what can be achieved with Spark. > > > > On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <nick.pentre...@gmail.com > > wrote: > >> Ok, I think I understand things better now. >> >> For Spark's current implementation, you would need to map those features >> as you mention. You could also use say StringIndexer -> OneHotEncoder or >> VectorIndexer. You could create a Pipeline to deal with the mapping and >> training (e.g. >> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline). >> Pipeline supports persistence. >> >> But it depends on your scoring use case too - a Spark pipeline can be >> saved and then reloaded, but you need all of Spark dependencies in your >> serving app which is often not ideal. If you're doing bulk scoring offline, >> then it may suit. >> >> Honestly though, for that data size I'd certainly go with something like >> Liblinear :) Spark will ultimately scale better with # training examples >> for very large scale problems. However there are definitely limitations on >> model dimension and sparse weight vectors currently. There are potential >> solutions to these but they haven't been implemented as yet. >> >> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <daniel.siegm...@teamaol.com> >> wrote: >> >>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath < >>> nick.pentre...@gmail.com> wrote: >>> >>>> Would you mind letting us know the # training examples in the datasets? >>>> Also, what do your features look like? Are they text, categorical etc? You >>>> mention that most rows only have a few features, and all rows together have >>>> a few 10,000s features, yet your max feature value is 20 million. How are >>>> your constructing your feature vectors to get a 20 million size? The only >>>> realistic way I can see this situation occurring in practice is with >>>> feature hashing (HashingTF). >>>> >>> >>> The sub-sample I'm currently training on is about 50K rows, so ... small. >>> >>> The features causing this issue are numeric (int) IDs for ... lets call >>> it "Thing". For each Thing in the record, we set the feature Thing.id >>> to a value of 1.0 in our vector (which is of course a SparseVector). >>> I'm not sure how IDs are generated for Things, but they can be large >>> numbers. >>> >>> The largest Thing ID is around 20 million, so that ends up being the >>> size of the vector. But in fact there are fewer than 10,000 unique Thing >>> IDs in this data. The mean number of features per record in what I'm >>> currently training against is 41, while the maximum for any given record >>> was 1754. >>> >>> It is possible to map the features into a small set (just need to >>> zipWithIndex), but this is undesirable because of the added complexity (not >>> just for the training, but also anything wanting to score against the >>> model). It might be a little easier if this could be encapsulated within >>> the model object itself (perhaps via composition), though I'm not sure how >>> feasible that is. >>> >>> But I'd rather not bother with dimensionality reduction at all - since >>> we can train using liblinear in just a few minutes, it doesn't seem >>> necessary. >>> >>> >>>> >>>> MultivariateOnlineSummarizer uses dense arrays, but it should be >>>> possible to enable sparse data. Though in theory, the result will tend to >>>> be dense anyway, unless you have very many entries in the input feature >>>> vector that never occur and are actually zero throughout the data set >>>> (which it seems is the case with your data?). So I doubt whether using >>>> sparse vectors for the summarizer would improve performance in general. >>>> >>> >>> Yes, that is exactly my case - the vast majority of entries in the input >>> feature vector will *never* occur. Presumably that means most of the >>> values in the aggregators' arrays will be zero. >>> >>> >>>> >>>> LR doesn't accept a sparse weight vector, as it uses dense vectors for >>>> coefficients and gradients currently. When using L1 regularization, it >>>> could support sparse weight vectors, but the current implementation doesn't >>>> do that yet. >>>> >>> >>> Good to know it is theoretically possible to implement. I'll have to >>> give it some thought. In the meantime I guess I'll experiment with >>> coalescing the data to minimize the communication overhead. >>> >>> Thanks again. >>> >> >