[jira] [Comment Edited] (SOLR-4619) Improve PreAnalyzedField query analysis

Steve Rowe (JIRA) Wed, 13 Jan 2016 17:36:07 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097417#comment-15097417
 ]


Steve Rowe edited comment on SOLR-4619 at 1/14/16 1:34 AM:
-----------------------------------------------------------

Patch that brings Andrzej's patch up to date with trunk, and adds tests for 
query-time functionality.

I had assumed that {{PreAnalyzedField}}-s would use the 
{{PreAnalyzedTokenizer}} at query time, but that is not (currently) the case: 
instead {{FieldType.DefaultAnalyzer}} is used.  This patch changes the behavior 
when no analyzer is specified to instead use {{PreAnalyzedTokenizer}}.

However, there is a chicken-and-egg interaction between 
{{PreAnalyzedTokenizer}} and {{QueryBuilder.createFieldQuery()}}, which aborts 
before performing any tokenization if the supplied analyzer's attribute factory 
doesn't contain a {{TermToBytesRefAttribute}}.  But {{PreAnalyzedTokenizer}} 
doesn't have any attributes defined until the input stream is consumed, in 
{{reset()}}. [~rcmuir] added a comment as part of LUCENE-5388 to 
{{PreAnalyzedTokenizer}}'s ctor, where 
{{AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY}} is set as the attribute factory 
rather than the default packed implementation: "we don't pack attributes: since 
we are used for (de)serialization and dont want bloat."

This patch moves the {{stream.reset()}} call in 
{{QueryBuilder.createFieldQuery()}} in front of the {{TermToBytesRefAttribute}} 
check, so that {{PreAnalyzedTokenizer}} (and other tokenizers that don't have a 
pre-added set of attributes) has a chance to populate its attributes, and also 
moves the {{addAttribute(PositionIncrementAttribute.class)}} call to after the 
{{TermToBytesRefAttribute}} check, since that won't be needed if no 
tokenization will be performed.

An alternate approach to fix the chicken-and-egg problem might be to have 
{{PreAnalyzedTokenizer}} always include a dummy {{TermToBytesRefAttribute}} 
implementation, and then remove it when {{reset()}} is called, but that seems 
hackish.

I haven't run the full tests yet with this patch, but the included query-time 
{{PreAnalyzedField}} tests succeed.

I welcome feedback.


was (Author: steve_rowe):
Patch that brings Andrzej's patch up to date with trunk, and adds tests for 
query-time functionality.

I had assumed that {{PreAnalyzedField}}-s would use the 
{{PreAnalyzedTokenizer}} at query time, but that is not (currently) the case: 
instead {{FieldType.DefaultAnalyzer}} is used.  This patch changes the behavior 
when no analyzer is specified to instead use {{PreAnalyzedTokenizer}}.

However, there is a chicken-and-egg interaction between 
{{PreAnalyzedTokenizer}} and {{QueryBuilder.createFieldQuery()}}, which aborts 
before performing any tokenization if the supplied analyzer's attribute factory 
doesn't contain a {{TermToBytesRefAttribute}}.  But {{PreAnalyzedTokenizer}} 
doesn't have any attributes defined until the input stream is consumed, in 
{{reset()}}. [~rcmuir] added a comment as part of LUCENE-5388 to 
{{PreAnalyzedTokenizer}}'s ctor, where 
{{AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY}} is set as the attribute factory 
rather than the default packed implementation: "we don't pack attributes: since 
we are used for (de)serialization and dont want bloat."

This patch moves the {{stream.reset()}} call in 
{{QueryBuilder.createFieldQuery()}} in front of the {{TermToBytesRefAttribute}} 
check, so that {{PreAnalyzedTokenizer}} (and other tokenizers that don't have a 
pre-added set of attributes) and also moves the 
{{addAttribute(PositionIncrementAttribute.class)}} call to after the the 
{{TermToBytesRefAttribute}} check.

An alternate approach to fix the chicken-and-egg problem might be to have 
{{PreAnalyzedTokenizer}} always include a dummy {{TermToBytesRefAttribute}} 
implementation, and then remove it when {{reset()}} is called, but that seems 
hackish.

I haven't run the full tests yet with this patch, but the included query-time 
{{PreAnalyzedField}} tests success.

I welcome feedback.

> Improve PreAnalyzedField query analysis
> ---------------------------------------
>
>                 Key: SOLR-4619
>                 URL: https://issues.apache.org/jira/browse/SOLR-4619
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.0, 4.1, 4.2, 4.2.1, Trunk
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: Trunk
>
>         Attachments: SOLR-4619.patch, SOLR-4619.patch
>
>
> PreAnalyzed field extends plain FieldType and mistakenly uses the 
> DefaultAnalyzer as query analyzer, and doesn't allow for customization via 
> <analyzer> schema elements.
> Instead it should extend TextField and support all query analysis supported 
> by that type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-4619) Improve PreAnalyzedField query analysis

Reply via email to