Re: Getting multi-values to use in filter?

Shai Erera Tue, 29 Apr 2014 00:44:16 -0700

Hi Rob,

While the demo code uses a fixed number of 3 values, you don't need to
encode the number of values up front. Since your read the byte[] of a
document up front, you can read in a while loop as long as in.position() <
in.length().


Shai


On Tue, Apr 29, 2014 at 10:04 AM, Rob Audenaerde
<rob.audenae...@gmail.com>wrote:

> Hi Shai,
>
> I read the article on your blog, thanks for it! It seems to be a natural
> fit to do multi-values like this, and it is helpful indeed. For my specific
> problem, I have multiple values that do not have a fixed number, so it can
> be either 0 or 10 values. I think the best way to solve this is to encode
> the number of values as first entry in the BDV. This is not that hard so I
> will take this road.
>
> -Rob
>
>
> > Op 27 apr. 2014 om 21:27 heeft Shai Erera <ser...@gmail.com> het
> volgende geschreven:
> >
> > Hi Rob,
> >
> > Your question got me interested, so I wrote a quick prototype of what I
> > think solves your problem (and if not, I hope it solves someone else's!
> > :)). The idea is to write a special ValueSource, e.g. MaxValueSource
> which
> > reads a BinadyDocValues, decodes the values and returns the maximum one.
> It
> > can then be embedded in an expression quite easily.
> >
> > I published a post on Lucene expressions and included some prototype code
> > which demonstrates how to do it. Hope it's still helpful to you:
> > http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html.
> >
> > Shai
> >
> >
> >> On Thu, Apr 24, 2014 at 1:20 PM, Shai Erera <ser...@gmail.com> wrote:
> >>
> >> I don't think that you should use the facet module. If all you want is
> to
> >> encode a bunch of numbers under a 'foo' field, you can encode them into
> a
> >> byte[] and index them as a BDV. Then at search time you get the BDV and
> >> decode the numbers back. The facet module adds complexity here: yes, you
> >> get the encoding/decoding for free, but at the cost of adding mock
> >> categories to the taxonomy, or use associations, for no good reason IMO.
> >>
> >> Once you do that, you need to figure out how to extend the expressions
> >> module to support a function like maxValues(fieldName) (cannot use 'max'
> >> since it's reserved). I read about it some, and still haven't figured
> out
> >> exactly how to do it. The JavascriptCompiler can take custom functions
> to
> >> compile expressions, but the methods should take only double values. So
> I
> >> think it should be some sort of binding, but I'm not sure yet how to do
> it.
> >> Perhaps it should be a name like max_fieldName, which you add a custom
> >> Expression to as a binding ... I will try to look into it later.
> >>
> >> Shai
> >>
> >>
> >> On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde <
> rob.audenae...@gmail.com>wrote:
> >>
> >>> Thanks for all the questions, gives me an opportunity to clarify it :)
> >>>
> >>> I want the user to be able to give a (simple) formula (so I don't know
> it
> >>> on beforehand) and use that formula in the search. The Javascript
> >>> expressions are really powerful in this use case, but have the
> >>> single-value
> >>> limitation. Ideally, I would like to make it really flexible by for
> >>> example
> >>> allowing (in-document aggregating) expressions like: max(fieldA) -
> fieldB
> >>>>
> >>> fieldC.
> >>>
> >>> Currently, using single values, I can handle expressions in the form of
> >>> "fieldA - fieldB - fieldC > 0" and evaluate the long-value that I
> receive
> >>> from the FunctionValues and the ValueSource. I also optimize the query
> by
> >>> assuring the field exists and has a value, etc. to the search still
> fast
> >>> enough. This works well, but single value only.
> >>>
> >>> I also looked into the facets Association Fields, as they somewhat look
> >>> like the thing that I want. Only in the faceting module, all ordinals
> and
> >>> values are stored in one field, so there is no easy way extract the
> fields
> >>> that are used in the expression.
> >>>
> >>> I like the solution one you suggested, to add all the numeric fields an
> >>> encoded byte[] like the facets do, but then on a per-field basis, so
> that
> >>> each numeric field has a BDV field that contains all multiple values
> for
> >>> that field for that document.
> >>>
> >>> Now that I am typing this, I think there is another way. I could use
> the
> >>> faceting module and add a different facet field ($facetFIELDA,
> >>> $facetFIELDB) in the FacetsConfig for each field. That way it would be
> >>> relatively straightforward to get all the values for a field, as they
> are
> >>> exact all the values for the BDV for that document's facet field. Only
> >>> aggregating all facets will be harder, as the
> >>> TaxonomyFacetSum*Associations
> >>> would need to do this for all fields that I need facet counts/sums for.
> >>>
> >>> What do you think?
> >>>
> >>> -Rob
> >>>
> >>>
> >>>> On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera <ser...@gmail.com> wrote:
> >>>>
> >>>> A NumericDocValues field can only hold one value. Have you thought
> about
> >>>> encoding the values in a BinaryDocValues field? Or are you talking
> about
> >>>> multiple fields (different names), each has its own single value, and
> at
> >>>> search time you sum the values from a different set of fields?
> >>>>
> >>>> If it's one field, multiple values, then why do you need to separate
> the
> >>>> values? Is it because you sometimes sum and sometimes e.g. avg? Do you
> >>>> always include all values of a document in the formula, but the
> formula
> >>>> changes between searches, or do you sometimes use only a subset of the
> >>>> values?
> >>>>
> >>>> If you always use all values, but change the formula between queries,
> >>> then
> >>>> perhaps you can just encode the pre-computed value under different NDV
> >>>> fields? If you only use a handful of functions (and they are known in
> >>>> advance), it may not be too heavy on the index, and definitely perform
> >>>> better during search.
> >>>>
> >>>> Otherwise, I believe I'd consider indexing them as a BDV field. For
> >>> facets,
> >>>> we basically need the same multi-valued numeric field, and given that
> >>> NDV
> >>>> is single valued, we went w/ BDV.
> >>>>
> >>>> If I misunderstood the scenario, I'd appreciate if you clarify it :)
> >>>>
> >>>> Shai
> >>>>
> >>>>
> >>>> On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde <
> >>> rob.audenae...@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> Hi Shai, all,
> >>>>>
> >>>>> I am trying to write that Filter :). But I'm a bit at loss as how to
> >>>>> efficiently grab the multi-values. I can access the
> >>>>> context.reader().document() that accesses the storedfields, but that
> >>>> seems
> >>>>> slow.
> >>>>>
> >>>>> For single-value fields I use a compiled JavaScript Expression with
> >>>>> simplebindings as ValueSource, which seems to work quite well. The
> >>>> downside
> >>>>> is that I cannot find a way to implement multi-value through that
> >>>> solution.
> >>>>>
> >>>>> These create for example a LongFieldSource, which uses the
> >>>>> FieldCache.LongParser. These parsers only seem te parse one field.
> >>>>>
> >>>>> Is there an efficient way to get -all- of the (numeric) values for a
> >>>> field
> >>>>> in a document?
> >>>>>
> >>>>>
> >>>>>> On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera <ser...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> You can do that by writing a Filter which returns matching documents
> >>>>> based
> >>>>>> on a sum of the field's value. However I suspect that is going to be
> >>>>> slow,
> >>>>>> unless you know that you will need several such filters and can
> >>> cache
> >>>>> them.
> >>>>>>
> >>>>>> Another approach would be to write a Collector which serves as a
> >>>> Filter,
> >>>>>> but computes the sum only for documents that match the query.
> >>> Hopefully
> >>>>>> that would mean you compute the sum for less documents than you
> >>> would
> >>>>> have
> >>>>>> w/ the Filter approach.
> >>>>>>
> >>>>>> Shai
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov <
> >>>>>> msoko...@safaribooksonline.com> wrote:
> >>>>>>
> >>>>>>> This isn't really a good use case for an index like Lucene.  The
> >>> most
> >>>>>>> essential property of an index is that it lets you look up
> >>> documents
> >>>>> very
> >>>>>>> quickly based on *precomputed* values.
> >>>>>>>
> >>>>>>> -Mike
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 04/23/2014 06:56 AM, Rob Audenaerde wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> I'm looking for a way to use multi-values in a filter.
> >>>>>>>>
> >>>>>>>> I want to be able to search on  sum(field)=100, where field has
> >>>> values
> >>>>>> in
> >>>>>>>> one documents:
> >>>>>>>>
> >>>>>>>> field=60
> >>>>>>>> field=40
> >>>>>>>>
> >>>>>>>> In this case 'field' is a LongField. I examined the code in the
> >>>>>>>> FieldCache,
> >>>>>>>> but that seems to focus on single-valued fields only, or
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It this something that can be done in Lucene? And what would be a
> >>>> good
> >>>>>>>> approach?
> >>>>>>>>
> >>>>>>>> Thanks in advance,
> >>>>>>>>
> >>>>>>>> -Rob
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Getting multi-values to use in filter?

Reply via email to