[jira] Commented: (SOLR-2051) analysis.jsp is incorrect for protWords etc

Uwe Schindler (JIRA) Mon, 16 Aug 2010 15:16:42 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899133#action_12899133
 ]


Uwe Schindler commented on SOLR-2051:
-------------------------------------

After a discussion with Robert, I also think that a Tap would be an elegant and 
less intrusive aproach (from the TokenStreams point of view). The Whole thing 
would simply create the Tokenizer, wrap the tap-filter around then add the next 
filter in chain, again add the tap again, and so on.

The filter simply calls input.increametToken() and then prints the current 
attributes. It can also hold a local "pos" field that is updated with 
positionIncrement to do formatting right. The code to resort tokens when 
negative position increments occur is useless, as Lucene no longer allows 
negative position increments (from what I know). The whole JSP would use no 
caching lists of tokens, no iterators, no array copy, no copyTo(). It just 
builds a tokenstream and consumes it. The Tap filter can also be added around 
the generic (non TokenizerChain Lucene Analyzer). The main code would simply do 
"while (ts.incrementToken())" - nothing more. All printout is done in the 
filters added between each chain step (or after the generic lucene analyzer).

> analysis.jsp is incorrect for protWords etc
> -------------------------------------------
>
>                 Key: SOLR-2051
>                 URL: https://issues.apache.org/jira/browse/SOLR-2051
>             Project: Solr
>          Issue Type: Bug
>          Components: web gui
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>         Attachments: SOLR-2051.patch, SOLR-2051.patch
>
>
> Analysis.jsp gives the incorrect results if you use "protwords.txt" or 
> "stemdict.txt" or the like.
> This is because this is now implemented with KeywordAttribute (so you can 
> easily override any stemmer etc).
> For example, if your schema had "foobars" in protwords.txt, analysis.jsp 
> would show it being stemmed to "foobar", even though this doesnt actually 
> happen.
> The problem is that this jsp is downconverting the entire tokenstream to 
> Token in between processing, so it silently discards KeywordAttribute and you 
> get the wrong result.
> Note: this issue isnt about *displaying* other attributes such as 
> KeywordAttribute (which would be a new feature). Its about not throwing them 
> away so that the analysis actually represents what happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-2051) analysis.jsp is incorrect for protWords etc

Reply via email to