Cao Manh Dat created SOLR-9252:
----------------------------------
Summary: Feature selection and logistic regression on text
Key: SOLR-9252
URL: https://issues.apache.org/jira/browse/SOLR-9252
Project: Solr
Issue Type: Improvement
Security Level: Public (Default Security Level. Issues are Public)
Reporter: Cao Manh Dat
SOLR-9186 come up with a challenges that for each iterative we have to rebuild
the tf-idf vector for each documents. It is costly computation if we represent
doc by a lot of terms. Features selection can help reducing the computation.
Due to its computational efficiency and simple interpretation, information gain
is one of the most popular feature selection methods. It is used to measure the
dependence between features and labels and calculates the information gain
between the i-th feature and the class labels
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
I confirmed that by running logistics regressions on encon mail dataset (in
which each email is represented by top 100 terms that have highest information
gain) and got the accuracy by 92% and precision by 82%.
This ticket will create two new streaming expression. Both of them use the same
*parallel iterative framework* as SOLR-8492.
```
featuresSelection(collection1, q="*:*", field="tv_text", outcome="out_i",
positiveLabel=1, numTerms=100)
```
featuresSelection will emit top terms that have highest information gain
scores. It can be combined with new tlogit stream.
```
tlogit(collection1, q="*:*",
featuresSelection(collection1,
q="*:*",
field="tv_text",
outcome="out_i",
positiveLabel=1,
numTerms=100),
field="tv_text",
outcome="out_i",
maxIterations=100)
```
This will support use cases such as building models for spam detection,
sentiment analysis and threat detection.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]