Re: feature selection and sparse vector support

Ignacio Zendejas Fri, 11 Apr 2014 11:34:18 -0700

Thanks for the response, Xiangrui.

And sounds good, Héctor. Look forward to working on this together.


A common interface is definitely required.  I'll create a JIRA shortly and
will explore design options myself to bring ideas to the table.

cheers.



On Fri, Apr 11, 2014 at 5:44 AM, Héctor Mouriño-Talín <hmou...@gmail.com>wrote:

> Hi,
>
> Regarding the implementation of feature selection techniques, I'm
> implementing some iterative algorithms based on a paper by Gavin Brown et
> al. [1]. In this paper, he proposes a common framework for many Information
> Theory-based criteria, namely those that use relevancy (mutual information
> between one feature and the label; Information Gain), redundancy, and
> conditional redundancy. The latter two are differently interpreted
> depending on the criteria, but all of them play with the mutual information
> between the feature being analyzed and the already selected ones and the
> same mutual information conditioned to the label.
>
> I think we should have a common interface to plug different Feature
> Selection techniques. I already have the algorithm implemented, but still
> have to do tests on it. Right now I'm working on the design. Next week I
> can share with you a proposal, so we can work together to bring Feature
> Selection to Spark.
>
> [1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
>
> ---
> Héctor
>
>
> On Fri, Apr 11, 2014 at 5:20 AM, Xiangrui Meng <men...@gmail.com> wrote:
>
> > Hi Ignacio,
> >
> > Please create a JIRA and send a PR for the information gain
> > computation, so it is easy to track the progress.
> >
> > The sparse vector support for NaiveBayes is already implemented in
> > branch-1.0 and master. You only need to provide an RDD of sparse
> > vectors (created from Vectors.sparse).
> >
> > MLUtils.loadLibSVMData reads sparse features in LIBSVM format.
> >
> > Best,
> > Xiangrui
> >
> > On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas
> > <ignacio.zendejas...@gmail.com> wrote:
> > > Hi, again -
> > >
> > > As part of the next step, I'd like to make a more substantive
> > contribution
> > > and propose some initial work on feature selection, primarily as it
> > relates
> > > to text classification.
> > >
> > > Specifically, I'd like to contribute very straightforward code to
> perform
> > > information gain feature evaluation. Below's a good primer that shows
> > that
> > > Information Gain is a very good option in many cases. If successful,
> BNS
> > > (introduced in the paper), would be another approach worth looking into
> > as
> > > it actually improves the f score with a smaller feature space.
> > >
> > > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf
> > >
> > > And here's my first cut:
> > >
> >
> https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8
> > >
> > > I don't like that I do two passes to compute the class priors and joint
> > > distributions, so I'll look into using combineByKey as in the
> NaiveBayes
> > > implementation.  Also, this is still untested code, but it gets my
> ideas
> > > out there and think it'd be best to define a FeatureEval trait or
> whatnot
> > > that helps with ranking and selecting.
> > >
> > > I also realize the above methods are probably more suitable for MLI
> than
> > > MLlib, but there doesn't seem to be much activity on the former.
> > >
> > > Second, is there a plan to support sparse vector representations for
> > > NaiveBayes. This will probably be more efficient in, for example, text
> > > classification tasks with lots of features (consider the case where
> > n-grams
> > > with n > 1 are used).
> > >
> > > And on a related note, MLUtils.loadLabeledData doesn't support loading
> > > sparse data. Any plans here to do so? There also doesn't seem to be a
> > > defined file format for MLlib. Has there been any consideration to
> > support
> > > multiple standard formats, rather than defining one: eg, csv, tsv,
> Weka's
> > > arff, etc?
> > >
> > > Thanks for your time,
> > > Ignacio
> >
>

Re: feature selection and sparse vector support

Reply via email to