[
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445830#comment-13445830
]
Robert Muir commented on LUCENE-4345:
-------------------------------------
docsWithClassSize should ideally be terms.getDocCount() for the field as well
rather than maxDoc.
docCount() should not do a search, instead I think it should just return
IR.docFreq(term) ?
One more piece: if classCount is just a Map<UniqueValues,DocFreq>,
it would be a lot better to just compute this with a TermsEnum,
just iterating over the terms for the field.
It seems the "value" part is not used, so for now it could be
just a hashset as well?
This would remove the stored fields loop (replacing it with a termsenum
loop), but I think we can probably remove the loop entirely too as
a second step.
I don't like that assignClass has a loop over all possible terms in the
field, re-tokenizing the doc for each one!
it seems we dont need this classCount map at all, nor the priors map?
Instead we would just tokenize each doc a single time, and compute the prior of
the terms
we find on the fly (it seems to just be IDF anyway really).
And we wouldnt need any maps for that.
> Create a Classification module
> ------------------------------
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Tommaso Teofili
> Assignee: Tommaso Teofili
> Priority: Minor
> Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in
> fields so that these can be used as training examples (w/ features) in order
> to very quickly create classifiers algorithms to use on new documents and /
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a
> ClassificationComponent that will use already seen data (the indexed
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes
> classifier but more implementations should be added in the future.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]