Re: Multi-valued fields and TokenStream

[email protected] Thu, 06 Nov 2014 12:44:14 -0800

On Thu, Nov 6, 2014 at 3:19 PM, Robert Muir <[email protected]> wrote:


> Do the concatenation yourself with your own TokenStream. You can index
> a field with a tokenstream for expert cases (the individual stored
> values can be added separately)
>

Yes, but that’s quite awkward and a fair amount of surrounding code when,
in the end, it could be so much simpler if somehow the TokenStream could be
notified.  I’d feel a little better about it if Lucene included the
tokenStream concatenating code (I’ve done a prototype for this, I could
work on it more and contribute) and if the Solr layer had a nice way of
presenting all the values to the Solr FieldType at once instead of
separately — SOLR-4329.


> No need to make the tokenstream API more complicated: its already very
> complicated.
>

Ehh, that’s arguable.  Steve’s suggestion amounts to one line of production
code (javadoc & test is separate).  If that’s too much then adding a
boolean argument to reset() would feel cleaner, be 0 lines of new code, but
would be backwards-incompatible.  Shrug.

Another idea is if Field.tokenStream(Analyzer analyzer, TokenStream reuse)
had another boolean to indicate first value or not.  I think I like the
other ideas better though.


>
> On Thu, Nov 6, 2014 at 3:13 PM, [email protected]
> <[email protected]> wrote:
> > Are you suggesting that DefaultIndexingChain.PerField.invert(boolean
> > firstValue) would, prior to calling reset(), call
> > setPositionIncrement(Integer.MAX_VALUE), but only when ‘firstValue’ is
> > false?  Hmmmm.  I guess that would work, although it seems a bit hacky
> and
> > it’s tying this to a specific attribute when ideally we notify the chain
> as
> > a whole what’s going on.  But it doesn’t require any new API, save for
> some
> > javadocs.  And it’s extremely unlikely there would be a
> > backwards-incompatible problem, so that’s good.  And I find this use is
> > related to positions so it’s not so bad to abuse the position increment
> for
> > this.  Nice idea Steve; this works for me.
> >
> > Does anyone else have an opinion before I create an issue?
> >
> > ~ David Smiley
> > Freelance Apache Lucene/Solr Search Consultant/Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> > On Thu, Nov 6, 2014 at 2:13 PM, Steve Rowe <[email protected]> wrote:
> >>
> >> Maybe the position increment gap would be useful?  If set to a value
> >> larger than likely max position for any individual value, it could be
> used
> >> to infer (non-)first-value-ness.
> >>
> >> > On Nov 5, 2014, at 1:03 PM, [email protected] wrote:
> >> >
> >> > Several times now, I’ve had to come up with work-arounds for a
> >> > TokenStream not knowing it’s processing the first value or a
> >> > subsequent-value of a multi-valued field.  Two of these times, the
> use-case
> >> > was ensuring the first position of each value started at a multiple
> of 1000
> >> > (or some other configurable value), and the third was encoding
> sentence
> >> > paragraph counters (similar to a do-it-yourself position increment).
> >> >
> >> > The work-arounds are awkward and hacky.  For example if you’re in
> >> > control of your Tokenizer, you can prefix subsequent values with a
> special
> >> > flag, and then do the right think in reset().  But then the
> highlighter or
> >> > value retrieval in general is impacted.  It’s also possible to create
> the
> >> > fields with the constructor that accepts a TokenStream that you’ve
> told it’s
> >> > the first or subsequent value but it’s awkward going that route, and
> >> > sometimes (e.g. Solr) it’s hard to know all the values you have
> up-front to
> >> > even do that.
> >> >
> >> > It would be nice if TokenStream.reset() took a boolean ‘first’
> argument.
> >> > Such a change would obviously be backwards incompatible.  Simply
> overloading
> >> > the method to call the no-arg version is problematic because
> TokenStreams
> >> > are a chain, and it would likely result in the chain getting
> doubly-reset.
> >> >
> >> > Any ideas?
> >> >
> >> > ~ David Smiley
> >> > Freelance Apache Lucene/Solr Search Consultant/Developer
> >> > http://www.linkedin.com/in/davidwsmiley
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Multi-valued fields and TokenStream

Reply via email to