[ https://issues.apache.org/jira/browse/SOLR-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743596#comment-17743596 ]
Noble Paul edited comment on SOLR-16894 at 7/17/23 10:46 AM: ------------------------------------------------------------- Instead of blindly adding ResourceLoader support, maybe we should discuss the lifecycle of the component and see what is the best way to support the usecase. BTW, is there a working code for this component where I can take a peek? was (Author: noble.paul): Instead of blindly adding ResourceLoader support, maybe we should discuss the lifecycle of the component and see what is the best way to support the usecase. BYW, is there a working code for this component where I can take a peek? > Configurable doc freq: Allow StatsCache instances to be ResourceLoaderAware > --------------------------------------------------------------------------- > > Key: SOLR-16894 > URL: https://issues.apache.org/jira/browse/SOLR-16894 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Doug Turnbull > Priority: Major > > I had been working on a plugin to allow the document frequency stats to be > controlled by the user. This has precedent in other search engines where > another corpus is more representative of a terms true document frequency / > significance. Specifically, [Vespa lets you pass significance at query > time.|#significance]]. This doesn't just apply to how doc freq is > represented, but the entire set of stats from total term freq, etc. > This is a common painpoint in test corpuses where you have a smaller sample > of the documents than the global corpus. It was a frequency bugabear at > Shopify, and now at my current employer, for doing relevance testing. It's > also a problem whenever you have a corpus that may include some "outliers" > that actually aren't outliers in the sense of how your users perceive your > corpus. An example is "headache" may not be the jargon to use in a medical > textbook, it is just rare by happenstance. Yet searchers still perceive it as > a not very significant term. > I had made some progress ([here|http://example.com/]), however I noticed only > certain types of classes can be ResourceLoaderAware in order to read > configuration. Specifically I see this error running my tests: > > {code:java} > ./gradlew --stacktrace --info test > {code} > {code:java} > org.apache.solr.common.SolrException: Invalid 'Aware' object: > manual.idf.stats.ManagedStatsCache@5c19c030 -- > org.apache.lucene.util.ResourceLoaderAware must be an instance of: > [org.apache.lucene.analysis.CharFilterFactory] > [org.apache.lucene.analysis.TokenFilterFactory] > [org.apache.lucene.analysis.TokenizerFactory] > [org.apache.solr.search.QParserPlugin] > [org.apache.solr.schema.FieldType]{code} > > > Can I propose we add the StatsCache to the list of allowed > ResourceLoaderAware objects? > Some alternatives I've thought about: > * I probably can do some ugly hacks to work around this, but I'd rather do > the "right thing" > * I'd prefer not to create a separate fieldtype that changes how the stats > are managed. For one, in my specific case, I don't want to have to have a > radically different test config compared to my setup. This is still "text" > with texty like configurability > ** Second I like the ability with the stats cache to "fall back" to an > internal stat if one is missing. > * Pass at query time - this is a more radical change similar to what it > would take to make BM25 params configurable at query time > * It's possible I could create a Similarity to change doc freq, however it > too, would not be ResourceLoaderAware apparently. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org