Here's the JIRA: https://issues.apache.org/jira/browse/SPARK-1473
Future discussions should take place in its comments section. Thanks. On Fri, Apr 11, 2014 at 11:26 AM, Ignacio Zendejas < ignacio.zendejas...@gmail.com> wrote: > Thanks for the response, Xiangrui. > > And sounds good, Héctor. Look forward to working on this together. > > A common interface is definitely required. I'll create a JIRA shortly and > will explore design options myself to bring ideas to the table. > > cheers. > > > > On Fri, Apr 11, 2014 at 5:44 AM, Héctor Mouriño-Talín > <hmou...@gmail.com>wrote: > >> Hi, >> >> Regarding the implementation of feature selection techniques, I'm >> implementing some iterative algorithms based on a paper by Gavin Brown et >> al. [1]. In this paper, he proposes a common framework for many >> Information >> Theory-based criteria, namely those that use relevancy (mutual information >> between one feature and the label; Information Gain), redundancy, and >> conditional redundancy. The latter two are differently interpreted >> depending on the criteria, but all of them play with the mutual >> information >> between the feature being analyzed and the already selected ones and the >> same mutual information conditioned to the label. >> >> I think we should have a common interface to plug different Feature >> Selection techniques. I already have the algorithm implemented, but still >> have to do tests on it. Right now I'm working on the design. Next week I >> can share with you a proposal, so we can work together to bring Feature >> Selection to Spark. >> >> [1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional >> likelihood maximisation: a unifying framework for information theoretic >> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. >> >> --- >> Héctor >> >> >> On Fri, Apr 11, 2014 at 5:20 AM, Xiangrui Meng <men...@gmail.com> wrote: >> >> > Hi Ignacio, >> > >> > Please create a JIRA and send a PR for the information gain >> > computation, so it is easy to track the progress. >> > >> > The sparse vector support for NaiveBayes is already implemented in >> > branch-1.0 and master. You only need to provide an RDD of sparse >> > vectors (created from Vectors.sparse). >> > >> > MLUtils.loadLibSVMData reads sparse features in LIBSVM format. >> > >> > Best, >> > Xiangrui >> > >> > On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas >> > <ignacio.zendejas...@gmail.com> wrote: >> > > Hi, again - >> > > >> > > As part of the next step, I'd like to make a more substantive >> > contribution >> > > and propose some initial work on feature selection, primarily as it >> > relates >> > > to text classification. >> > > >> > > Specifically, I'd like to contribute very straightforward code to >> perform >> > > information gain feature evaluation. Below's a good primer that shows >> > that >> > > Information Gain is a very good option in many cases. If successful, >> BNS >> > > (introduced in the paper), would be another approach worth looking >> into >> > as >> > > it actually improves the f score with a smaller feature space. >> > > >> > > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf >> > > >> > > And here's my first cut: >> > > >> > >> https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8 >> > > >> > > I don't like that I do two passes to compute the class priors and >> joint >> > > distributions, so I'll look into using combineByKey as in the >> NaiveBayes >> > > implementation. Also, this is still untested code, but it gets my >> ideas >> > > out there and think it'd be best to define a FeatureEval trait or >> whatnot >> > > that helps with ranking and selecting. >> > > >> > > I also realize the above methods are probably more suitable for MLI >> than >> > > MLlib, but there doesn't seem to be much activity on the former. >> > > >> > > Second, is there a plan to support sparse vector representations for >> > > NaiveBayes. This will probably be more efficient in, for example, text >> > > classification tasks with lots of features (consider the case where >> > n-grams >> > > with n > 1 are used). >> > > >> > > And on a related note, MLUtils.loadLabeledData doesn't support loading >> > > sparse data. Any plans here to do so? There also doesn't seem to be a >> > > defined file format for MLlib. Has there been any consideration to >> > support >> > > multiple standard formats, rather than defining one: eg, csv, tsv, >> Weka's >> > > arff, etc? >> > > >> > > Thanks for your time, >> > > Ignacio >> > >> > >