Re: feature selection and sparse vector support

Ignacio Zendejas Fri, 11 Apr 2014 14:22:19 -0700

Here's the JIRA:
https://issues.apache.org/jira/browse/SPARK-1473


Future discussions should take place in its comments section.

Thanks.




On Fri, Apr 11, 2014 at 11:26 AM, Ignacio Zendejas <
ignacio.zendejas...@gmail.com> wrote:

> Thanks for the response, Xiangrui.
>
> And sounds good, Héctor. Look forward to working on this together.
>
> A common interface is definitely required.  I'll create a JIRA shortly and
> will explore design options myself to bring ideas to the table.
>
> cheers.
>
>
>
> On Fri, Apr 11, 2014 at 5:44 AM, Héctor Mouriño-Talín 
> <hmou...@gmail.com>wrote:
>
>> Hi,
>>
>> Regarding the implementation of feature selection techniques, I'm
>> implementing some iterative algorithms based on a paper by Gavin Brown et
>> al. [1]. In this paper, he proposes a common framework for many
>> Information
>> Theory-based criteria, namely those that use relevancy (mutual information
>> between one feature and the label; Information Gain), redundancy, and
>> conditional redundancy. The latter two are differently interpreted
>> depending on the criteria, but all of them play with the mutual
>> information
>> between the feature being analyzed and the already selected ones and the
>> same mutual information conditioned to the label.
>>
>> I think we should have a common interface to plug different Feature
>> Selection techniques. I already have the algorithm implemented, but still
>> have to do tests on it. Right now I'm working on the design. Next week I
>> can share with you a proposal, so we can work together to bring Feature
>> Selection to Spark.
>>
>> [1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
>> likelihood maximisation: a unifying framework for information theoretic
>> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
>>
>> ---
>> Héctor
>>
>>
>> On Fri, Apr 11, 2014 at 5:20 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> > Hi Ignacio,
>> >
>> > Please create a JIRA and send a PR for the information gain
>> > computation, so it is easy to track the progress.
>> >
>> > The sparse vector support for NaiveBayes is already implemented in
>> > branch-1.0 and master. You only need to provide an RDD of sparse
>> > vectors (created from Vectors.sparse).
>> >
>> > MLUtils.loadLibSVMData reads sparse features in LIBSVM format.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas
>> > <ignacio.zendejas...@gmail.com> wrote:
>> > > Hi, again -
>> > >
>> > > As part of the next step, I'd like to make a more substantive
>> > contribution
>> > > and propose some initial work on feature selection, primarily as it
>> > relates
>> > > to text classification.
>> > >
>> > > Specifically, I'd like to contribute very straightforward code to
>> perform
>> > > information gain feature evaluation. Below's a good primer that shows
>> > that
>> > > Information Gain is a very good option in many cases. If successful,
>> BNS
>> > > (introduced in the paper), would be another approach worth looking
>> into
>> > as
>> > > it actually improves the f score with a smaller feature space.
>> > >
>> > > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf
>> > >
>> > > And here's my first cut:
>> > >
>> >
>> https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8
>> > >
>> > > I don't like that I do two passes to compute the class priors and
>> joint
>> > > distributions, so I'll look into using combineByKey as in the
>> NaiveBayes
>> > > implementation.  Also, this is still untested code, but it gets my
>> ideas
>> > > out there and think it'd be best to define a FeatureEval trait or
>> whatnot
>> > > that helps with ranking and selecting.
>> > >
>> > > I also realize the above methods are probably more suitable for MLI
>> than
>> > > MLlib, but there doesn't seem to be much activity on the former.
>> > >
>> > > Second, is there a plan to support sparse vector representations for
>> > > NaiveBayes. This will probably be more efficient in, for example, text
>> > > classification tasks with lots of features (consider the case where
>> > n-grams
>> > > with n > 1 are used).
>> > >
>> > > And on a related note, MLUtils.loadLabeledData doesn't support loading
>> > > sparse data. Any plans here to do so? There also doesn't seem to be a
>> > > defined file format for MLlib. Has there been any consideration to
>> > support
>> > > multiple standard formats, rather than defining one: eg, csv, tsv,
>> Weka's
>> > > arff, etc?
>> > >
>> > > Thanks for your time,
>> > > Ignacio
>> >
>>
>
>

Re: feature selection and sparse vector support

Reply via email to