Thanks for the response, Xiangrui. And sounds good, Héctor. Look forward to working on this together.
A common interface is definitely required. I'll create a JIRA shortly and will explore design options myself to bring ideas to the table. cheers. On Fri, Apr 11, 2014 at 5:44 AM, Héctor Mouriño-Talín <hmou...@gmail.com>wrote: > Hi, > > Regarding the implementation of feature selection techniques, I'm > implementing some iterative algorithms based on a paper by Gavin Brown et > al. [1]. In this paper, he proposes a common framework for many Information > Theory-based criteria, namely those that use relevancy (mutual information > between one feature and the label; Information Gain), redundancy, and > conditional redundancy. The latter two are differently interpreted > depending on the criteria, but all of them play with the mutual information > between the feature being analyzed and the already selected ones and the > same mutual information conditioned to the label. > > I think we should have a common interface to plug different Feature > Selection techniques. I already have the algorithm implemented, but still > have to do tests on it. Right now I'm working on the design. Next week I > can share with you a proposal, so we can work together to bring Feature > Selection to Spark. > > [1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > > --- > Héctor > > > On Fri, Apr 11, 2014 at 5:20 AM, Xiangrui Meng <men...@gmail.com> wrote: > > > Hi Ignacio, > > > > Please create a JIRA and send a PR for the information gain > > computation, so it is easy to track the progress. > > > > The sparse vector support for NaiveBayes is already implemented in > > branch-1.0 and master. You only need to provide an RDD of sparse > > vectors (created from Vectors.sparse). > > > > MLUtils.loadLibSVMData reads sparse features in LIBSVM format. > > > > Best, > > Xiangrui > > > > On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas > > <ignacio.zendejas...@gmail.com> wrote: > > > Hi, again - > > > > > > As part of the next step, I'd like to make a more substantive > > contribution > > > and propose some initial work on feature selection, primarily as it > > relates > > > to text classification. > > > > > > Specifically, I'd like to contribute very straightforward code to > perform > > > information gain feature evaluation. Below's a good primer that shows > > that > > > Information Gain is a very good option in many cases. If successful, > BNS > > > (introduced in the paper), would be another approach worth looking into > > as > > > it actually improves the f score with a smaller feature space. > > > > > > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf > > > > > > And here's my first cut: > > > > > > https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8 > > > > > > I don't like that I do two passes to compute the class priors and joint > > > distributions, so I'll look into using combineByKey as in the > NaiveBayes > > > implementation. Also, this is still untested code, but it gets my > ideas > > > out there and think it'd be best to define a FeatureEval trait or > whatnot > > > that helps with ranking and selecting. > > > > > > I also realize the above methods are probably more suitable for MLI > than > > > MLlib, but there doesn't seem to be much activity on the former. > > > > > > Second, is there a plan to support sparse vector representations for > > > NaiveBayes. This will probably be more efficient in, for example, text > > > classification tasks with lots of features (consider the case where > > n-grams > > > with n > 1 are used). > > > > > > And on a related note, MLUtils.loadLabeledData doesn't support loading > > > sparse data. Any plans here to do so? There also doesn't seem to be a > > > defined file format for MLlib. Has there been any consideration to > > support > > > multiple standard formats, rather than defining one: eg, csv, tsv, > Weka's > > > arff, etc? > > > > > > Thanks for your time, > > > Ignacio > > >