[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

David Goldfarb (JIRA) Wed, 12 Mar 2014 07:04:20 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Goldfarb updated LUCENE-4072:
-----------------------------------

    Attachment: 4072.patch

Attaching a new patch. All tests pass. 

I'm using Normalizer2.isInert to check if we need to keep reading to the input 
buffer since it doesn't return false positives, even though it's not as fast as 
.hasBoundaryBefore().

> CharFilter that Unicode-normalizes input
> ----------------------------------------
>
>                 Key: LUCENE-4072
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4072
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Ippei UKAI
>         Attachments: 4072.patch, 4072.patch, DebugCode.txt, 
> LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, LUCENE-4072.patch, 
> LUCENE-4072.patch, LUCENE-4072.patch, 
> ippeiukai-ICUNormalizer2CharFilter-4752cad.zip
>
>
> I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.
> The benefit of having this process as CharFilter is that tokenizer can work 
> on normalised text while offset-correction ensuring fast vector highlighter 
> and other offset-dependent features do not break.
> The implementation is available at following repository:
> https://github.com/ippeiukai/ICUNormalizer2CharFilter
> Unfortunately this is my unpaid side-project and cannot spend much time to 
> merge my work to Lucene to make appropriate patch. I'd appreciate it if 
> anyone could give it a go. I'm happy to relicense it to whatever that meets 
> your needs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input

Reply via email to