[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069504#comment-13069504
 ] 

Uwe Schindler commented on LUCENE-2309:
---------------------------------------

bq. I think Robert has stated here that he's comfortable continuing to use 
TokenStream as the API for IW to get the terms it indexes, is that what others 
feel too? I agree the inverted API I proposed is a little convoluted and I'm 
sure we can come up with a simple Consumable like abstraction (which Robert did 
also suggest above). But if people are content with TokenStream then theres no 
need.

I feel the same. The API of TokenStream is so stupid-simple, why replace it by 
another push-like API that is not simplier nor more complicated, just 
different? I see no reason in this. IW should simply request a TokenStream from 
the field and consume it.

{quote}
Likewise, for multi-valued fields, IW shouldn't "see" the separate
values; it should just receive a single token stream, and under the
hood (in Document/Field impl) it's concatenating separate token
streams, adding posIncr/offset gaps, etc. This too is now hardwired
in indexer but shouldn't be. Maybe an app wants to insert custom
"separator" tokens between the values...
{quote}

I agree with that, too. There is one problem with this: Concenatting 
TokenStreams is not easy to do, as they have different attribute instances, so 
IW getting all attributes at the start would then somehow in the middle of the 
TS have to change the attributes.

To implement this fast (without wrapping and copying), we need some 
notification that the consumer of a TokenStream needs to "request" the 
attribute instances again, but this is a "bad" idea. For me the only simple 
solutions to this problem is to make the Field return an iterator of 
TokenStreams and IW consumes them one after each other, and doing the 
addAttribute before each separate instance.

About the PosIncr Gap: The field can change the final offsets/posIncr in end() 
before handling over to a new TokenStream. IW would only consume TokenStreams 
one by one.

> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>
>                 Key: LUCENE-2309
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2309
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2309-analyzer-based.patch, LUCENE-2309.patch
>
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to