[jira] [Created] (SOLR-9252) Feature selection and logistic regression on text

Cao Manh Dat (JIRA) Sun, 26 Jun 2016 08:34:05 -0700

Cao Manh Dat created SOLR-9252:
----------------------------------

             Summary: Feature selection and logistic regression on text
                 Key: SOLR-9252
                 URL: https://issues.apache.org/jira/browse/SOLR-9252
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Cao Manh Dat



SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on encon mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

```
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
```
featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

```
tlogit(collection1, q="*:*",
         featuresSelection(collection1, 
                                      q="*:*",  
                                      field="tv_text", 
                                      outcome="out_i", 
                                      positiveLabel=1, 
                                      numTerms=100),
         field="tv_text",
         outcome="out_i",
         maxIterations=100)
```

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-9252) Feature selection and logistic regression on text

Reply via email to