[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

Robert Muir (JIRA) Wed, 19 Mar 2014 17:12:07 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-4072:
--------------------------------

    Attachment: LUCENE-4072.patch

Whew, thank you!

I did some minor cleanup: I toned down the tests i had added that were very 
slow (added multiplier, so they will do more work in jenkins), added 
testMassiveLigature (just to test the case where normalization increases the 
length), and removed the stuff around reset()... since mark isnt supported the 
default UOE is the right thing.

I'll commit shortly

> CharFilter that Unicode-normalizes input
> ----------------------------------------
>
>                 Key: LUCENE-4072
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4072
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Ippei UKAI
>         Attachments: 4072.patch, 4072.patch, DebugCode.txt, 
> LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
> LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
> ippeiukai-ICUNormalizer2CharFilter-4752cad.zip
>
>
> I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
> The benefit of having this process as CharFilter is that tokenizer can work 
> on normalised text while offset-correction ensuring fast vector highlighter 
> and other offset-dependent features do not break.
> The implementation is available at following repository:
> https://github.com/ippeiukai/ICUNormalizer2CharFilter
> Unfortunately this is my unpaid side-project and cannot spend much time to 
> merge my work to Lucene to make appropriate patch. I'd appreciate it if 
> anyone could give it a go. I'm happy to relicense it to whatever that meets 
> your needs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

Reply via email to