[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456415#comment-13456415
 ] 

Robert Muir commented on LUCENE-4345:
-------------------------------------

I don't think this should be using payloads to pull POS tags: the purpose of 
payloads
is when you need something stored in the actual index (and should be limited to 
e.g. a single byte),
its not type-safe but application-specific.

Instead such taggers should expose a type-safe PartOfSpeechAttribute as 
suggested in the
o.a.l.analysis package javadocs. If they want to put POS into the index for 
e.g. payload-based queries,
thats a separate concern, they should have a separate tokenfilter that encodes 
the POS attribute
into the payload so this is optional (as it has tradeoffs in the index). See 
TypeAsPayloadFilter etc
as an example of what I mean. But for this module we don't need anything in the 
index.

If we think its useful for classifiers to limit the analysis to certain POS 
categories, then
instead we should factor out a *minimal* POSAttribute sub-interface with 
something very generic
like isNominal()/isVerbal() that can actually be implemented by different 
taggers with different tag sets
across different languages.

Then things like kuromoji's POSAttribute, openNLP's POSAttribute, or even your 
custom home-grown one,
or some commercial one could extend this sub-interface and plug into it.

At least i think this is possible with our attributes API :)

                
> Create a Classification module
> ------------------------------
>
>                 Key: LUCENE-4345
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4345
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Minor
>         Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to