[
https://issues.apache.org/jira/browse/SOLR-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988595#comment-14988595
]
Jason Gerlowski commented on SOLR-7981:
---------------------------------------
Haha, funny; I've definitely been there.
I also don't have a huge opinion about adding this option. I didn't pick this
up because I wanted the feature in Solr; I just wanted to learn how to work on
Solr. And it's been a good first introduction, so "SUCCESS" on that front. if
there's a consensus that this is a thing people would like to have, I'm happy
to keep working on it (should I assign myself on this JIRA? Or is that only for
commiters?) If we *do* think this would be useful for people, I could use a
bit of clarification on what the desired behavior actually is. If not, should
I close this JIRA?
Questions about 'Desired' Behavior:
1.) Currently, analysis is only done on things that ValueSourceParser
identifies as being TextFields. Are numeric/date/other fields typically
analyzed? If so, do we want them to be analyzed here too? Even among fields
containing text, this doesn't cover as much as I'd expect. For example, I was
writing some tests for this stuff and tried to use a field like:
{{ <!-- A text field with mismatched analyzers for query/index..used for
testing. -->
<fieldType name="text_different_analyzers" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="query"> <!-- Whitespace only for query-analysis -->
<tokenizer class="solr.MockTokenizerFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.MockTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<field name="text_analysis_mismatch" type="text_different_analyzers"
indexed="true" stored="true"/>
}}
but it turns out that it wasn't being analyzed by the current ValueSourceParser
code. Maybe this is just me being new to Solr, but I expected this to be
considered a "TextField" by the code.
2.) Do we care whether the input-value gets analyzed to > 1 token? The initial
bug description mentioned error handling for this, but I didn't see any special
error-handling for this in the default-to-query-analyzer case that's already in
the code.
Thanks for any clarification anyone can give. Still getting used to the
process of working on these things.
> term based ValueSourceParsers should support an option to run an analyzer for
> hte specified field on the input
> --------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-7981
> URL: https://issues.apache.org/jira/browse/SOLR-7981
> Project: Solr
> Issue Type: Improvement
> Reporter: Hoss Man
> Labels: newdev
> Attachments: SOLR-7981.patch
>
>
> The following functions all take exactly 2 arguments: a field name, and a
> term value...
> * idf
> * termfreq
> * tf
> * totaltermfreq
> ...we should consider adding an optional third argument to indicate if an
> analyzer for the specified field should be used on the input to find the real
> "Term" to consider.
> For example, the following might all result in equivilent numeric values for
> all docs assuming simple plural stemming and lowercasing...
> {noformat}
> termfreq(foo_t,'Bicycles',query) // use the query analyzer for field foo_t on
> input Bicycles
> termfreq(foo_t,'Bicycles',index) // use the index analyzer for field foo_t on
> input Bicycles
> termfreq(foo_t,'bicycle',none) // no analyzer used to construct Term
> termfreq(foo_t,'bicycle') // legacy 2 arg syntax, same as 'none'
> {noformat}
> (Special error checking needed if analyzer creates more then one term for the
> given input string)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]