[
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-4072:
--------------------------------
Attachment: LUCENE-4072.patch
Whew, thank you!
I did some minor cleanup: I toned down the tests i had added that were very
slow (added multiplier, so they will do more work in jenkins), added
testMassiveLigature (just to test the case where normalization increases the
length), and removed the stuff around reset()... since mark isnt supported the
default UOE is the right thing.
I'll commit shortly
> CharFilter that Unicode-normalizes input
> ----------------------------------------
>
> Key: LUCENE-4072
> URL: https://issues.apache.org/jira/browse/LUCENE-4072
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Ippei UKAI
> Attachments: 4072.patch, 4072.patch, DebugCode.txt,
> LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch,
> LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch,
> ippeiukai-ICUNormalizer2CharFilter-4752cad.zip
>
>
> I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
> The benefit of having this process as CharFilter is that tokenizer can work
> on normalised text while offset-correction ensuring fast vector highlighter
> and other offset-dependent features do not break.
> The implementation is available at following repository:
> https://github.com/ippeiukai/ICUNormalizer2CharFilter
> Unfortunately this is my unpaid side-project and cannot spend much time to
> merge my work to Lucene to make appropriate patch. I'd appreciate it if
> anyone could give it a go. I'm happy to relicense it to whatever that meets
> your needs.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]