[jira] [Commented] (SOLR-5722) Add catenateShingles option to WordDelimiterFilter

Greg Pendlebury (JIRA) Sun, 16 Feb 2014 12:38:30 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902824#comment-13902824
 ]


Greg Pendlebury commented on SOLR-5722:
---------------------------------------

I don't think it does. It has been a while since we looked into it, and that 
link is currently returning 503 for me, but my understanding was that the 
HyphenatedWordsFilter put two tokens back together when a hyphen was found on 
the end of the first token. The catenateShingles options we are using addresses 
the scenario where multiple hyphens are found internal to a single token.

> Add catenateShingles option to WordDelimiterFilter
> --------------------------------------------------
>
>                 Key: SOLR-5722
>                 URL: https://issues.apache.org/jira/browse/SOLR-5722
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Greg Pendlebury
>            Priority: Minor
>              Labels: filter, newbie, patch
>         Attachments: WDFconcatShingles.patch
>
>
> Apologies if I put this in the wrong spot. I'm attaching a patch (against 
> current trunk) that adds support for a 'catenateShingles' option to the 
> WordDelimiterFilter. 
> We (National Library of Australia - NLA) are currently maintaining this as an 
> internal modification to the Filter, but I believe it is generic enough to 
> contribute upstream.
> Description:
> =========
> {code}
> /**
>  * NLA Modification to the standard word delimiter to support various
>  * hyphenation use cases. Primarily driven by requirements for
>  * newspapers where words are often broken across line endings.
>  *
>  *  eg. "hyphenated-surname" is printed printed across a line ending and
>  *         turns out like "hyphen-ated-surname" or "hyphenated-sur-name".
>  *
>  *  In this scenario the stock filter, with 'catenateAll' turned on, will
>  *  generate individual tokens plus one combined token, but not
>  *  sub-tokens like "hyphenated surname" and "hyphenatedsur name".
>  *
>  *  So we add a new 'catenateShingles' to achieve this.
> */
> {code}
> Includes unit tests, and as is noted in one of them CATENATE_WORDS and 
> CATENATE_SHINGLES are logically considered mutually exclusive for sensible 
> usage and can cause duplicate tokens (although they should have the same 
> positions etc).
> I'm happy to work on it more if anyone finds problems with it.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5722) Add catenateShingles option to WordDelimiterFilter

Reply via email to