[
https://issues.apache.org/jira/browse/SOLR-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sunil Srinivasan updated SOLR-4864:
-----------------------------------
Attachment: SOLR-4864.patch
Here is a patch for the functionality that Hoss/Jack wanted. Please review.
I've added the additional tests as part of the testRegexReplace. Please let me
know if they need to be separated out.
> RegexReplaceProcessorFactory should support pattern capture group
> substitution in replacement string
> ----------------------------------------------------------------------------------------------------
>
> Key: SOLR-4864
> URL: https://issues.apache.org/jira/browse/SOLR-4864
> Project: Solr
> Issue Type: Improvement
> Components: update
> Affects Versions: 4.3
> Reporter: Jack Krupansky
> Attachments: SOLR-4864.patch
>
>
> It is unfortunate the the replacement string for RegexReplaceProcessorFactory
> is a pure, "quoted" (escaped) literal and does not support pattern capture
> group substitution. This processor should be enhanced to support full,
> standard pattern capture group substitution.
> The test case I used:
> {code}
> <updateRequestProcessorChain name="regex-mark-special-words">
> <processor class="solr.RegexReplaceProcessorFactory">
> <str name="fieldRegex">.*</str>
> <str name="pattern">([^a-zA-Z]|^)(cat|dog|fox)([^a-zA-Z]|$)</str>
> <str name="replacement">$1<<$2>>$3</str>
> </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> {code}
> Indexing with this command against the standard Solr example with the above
> addition to solrconfig:
> {code}
> curl
> "http://localhost:8983/solr/update?commit=true&update.chain=regex-mark-special-words"
> \
> -H 'Content-type:application/json' -d '
> [{"id": "doc-1",
> "title": "Hello World",
> "content": "The cat and the dog jumped over the fox.",
> "other_ss": ["cat","cat bird", "lazy dog", "red fox den"]}]'
> {code}
> Alas, the resulting document consists of:
> {code}
> "id":"doc-1",
> "title":["Hello World"],
> "content":["The$1<<$2>>$3and the$1<<$2>>$3jumped over the$1<<$2>>$3"],
> "other_ss":["$1<<$2>>$3",
> "$1<<$2>>$3bird",
> "lazy$1<<$2>>$3",
> "red$1<<$2>>$3den"],
> {code}
> The Javadoc for RegexReplaceProcessorFactory uses the exact same terminology
> of "replacement string", as does Java's Matcher.replaceAll, but clearly the
> semantics are distinct, with replaceAll supporting pattern capture group
> substitution for its "replacement string", while RegexReplaceProcessorFactory
> interprets "replacement string" as being a literal. At a minimum, the
> RegexReplaceProcessorFactory Javadoc should explicitly state that the string
> is a literal that does not support pattern capture group substitution.
> The relevant code in RegexReplaceProcessorFactory#init:
> {code}
> replacement = Matcher.quoteReplacement(replacementParam.toString());
> {code}
> Possible options for the enhancement:
> 1. Simply skip the quoteReplacement and fully support pattern capture group
> substitution with no additional changes. Does have a minor backcompat issue.
> 2. Add an alternative to "replacement", say "nonQuotedReplacement" that is
> not quoted as "replacement" is.
> 3. Add an option, say "quotedReplacement" that defaults to "true" for
> backcompat, but can be set to "false" to support full replaceAll pattern
> capture group substitution.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]