[jira] [Resolved] (LUCENE-4216) Token X exceeds length of provided text sized X

Robert Muir (JIRA) Sat, 04 Aug 2012 03:36:07 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir resolved LUCENE-4216.
---------------------------------

    Resolution: Not A Problem

The bugs are in your custom tokenizer. I would recommend looking at 
lucene-test-framework.jar (especially BaseTokenStreamTestCase) and writing some 
tests for it.

Problems I see at a glance:
* it doesn't implement reset(), so its not safe at all. This is the main reason 
it doesn't work for you in 4.0, because Analysis reuse is mandatory and it 
doesn't reset its state.
* it doesn't implement end(), so multi-valued fields wont work
* it doesn't call correctOffset(), so charfilters won't work
* it removes tashkeel in the tokenizer itself without adjusting offsets, thats 
unsafe.

Really you can fix this easily, by:
1. instead of extending Tokenizer, extend CharTokenizer and implement 
isTokenChar via isArabicChar. Or just use StandardTokenizer, it tokenizes 
arabic just fine.
2. instead of removing tashkeel in your tokenizer itself with your pattern 
([\u0650\u064D\u064E\u064B\u064F\u064C\u0652\u0651]), just pass that pattern to 
PatternReplaceFilter.

                
> Token X exceeds length of provided text sized X
> -----------------------------------------------
>
>                 Key: LUCENE-4216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4216
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 4.0-ALPHA
>         Environment: Windows 7, jdk1.6.0_27
>            Reporter: Ibrahim
>         Attachments: myApp.zip
>
>
> I'm facing this exception:
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token رأيكم 
> exceeds length of provided text sized 170
>       at 
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
>       at classes.myApp$16$1.run(myApp.java:1508)
> I tried to find anything wrong in my code when i start migrating Lucene 3.6 
> to 4.0 without successful. i found similar issues with HTMLStripCharFilter 
> e.g. LUCENE-3690, LUCENE-2208 but not with SimpleHTMLFormatter so I'm 
> triggering this here to see if there is really a bug or it is something wrong 
> in my code with v4. The code that im using:
> final Highlighter highlighter = new Highlighter(new 
> SimpleHTMLFormatter("<font color=red>", "</font>"), new QueryScorer(query));
> .......
> final TokenStream tokenStream = 
> TokenSources.getAnyTokenStream(defaultSearcher.getIndexReader(), j, "Line", 
> analyzer);
> final TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, 
> doc.get("Line"), false, 10);
> Please note that this is working fine with v3.6

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (LUCENE-4216) Token X exceeds length of provided text sized X

Reply via email to