Any reason why the regularization path cannot be implemented using current owlqn pr ?
We can change owlqn in breeze to fit your needs... On Feb 24, 2015 3:27 PM, "Joseph Bradley" <jos...@databricks.com> wrote: > Hi Mike, > > I'm not aware of a "standard" big dataset, but there are a number > available: > * The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in # > instances but not # features): > www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html > * I've used this text dataset from which one can generate lots of n-gram > features (but not many instances): http://www.ark.cs.cmu.edu/10K/ > * I've seen some papers use the KDD Cup datasets, which might be the best > option I know of. The KDD Cup 2012 track 2 one seems promising. > > Good luck! > Joseph > > On Tue, Feb 24, 2015 at 1:56 PM, <m...@mbowles.com> wrote: > > > Joseph, > > Thanks for your reply. We'll take the steps you suggest - generate some > > timing comparisons and post them in the GLMNET JIRA with a link from the > > OWLQN JIRA. > > > > We've got the regression version of GLMNET programmed. The regression > > version only requires a pass through the data each time the active set of > > coefficients changes. That's usualy less than or equal to the number of > > decrements in the penalty coefficient (typical default = 100). The > > intermediate iterations can be done using results of previous passes > > through the full data set. We're expecting the number of data passes > will > > be independent of either number of rows or columns in the data set. > We're > > eager to demonstrate this scaling. Do you have any suggestions regarding > > data sets for large scale regression problems? It would be nice to > > demonstrate scaling for both number of rows and number of columns. > > > > Thanks for your help. > > Mike > > > > -----Original Message----- > > *From:* Joseph Bradley [mailto:jos...@databricks.com] > > *Sent:* Sunday, February 22, 2015 06:48 PM > > *To:* m...@mbowles.com > > *Cc:* dev@spark.apache.org > > *Subject:* Re: Have Friedman's glmnet algo running in Spark > > > > Hi Mike, glmnet has definitely been very successful, and it would be > great > > to see how we can improve optimization in MLlib! There is some related > work > > ongoing; here are the JIRAs: GLMNET implementation in Spark > > LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package > > The GLMNET JIRA has actually been closed in favor of the latter JIRA. > > However, if you're getting good results in your experiments, could you > > please post them on the GLMNET JIRA and link them from the other JIRA? If > > it's faster and more scalable, that would be great to find out. As far as > > where the code should go and the APIs, that can be discussed on the > JIRA. I > > hope this helps, and I'll keep an eye out for updates on the JIRAs! > Joseph > > On Thu, Feb 19, 2015 at 10:59 AM, wrote: > Dev List, > A couple of > > colleagues and I have gotten several versions of glmnet algo > coded and > > running on Spark RDD. glmnet algo ( > > > http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for > > > generating coefficient paths solving penalized regression with elastic > net > > > penalties. The algorithm runs fast by taking an approach that > generates > > > solutions for a wide variety of penalty parameter. We're able to > integrate > > > into Mllib class structure a couple of different ways. The algorithm > may > > > fit better into the new pipeline structure since it naturally returns > a > > > multitide of models (corresponding to different vales of penalty > > > parameters). That appears to fit better into pipeline than Mllib linear > > > regression (for example). > > We've got regression running with the speed > > optimizations that Friedman > recommends. We'll start working on the > > logistic regression version next. > > We're eager to make the code > > available as open source and would like to > get some feedback about how > > best to do that. Any thoughts? > Mike Bowles. > > > > > > > >