Re: Numeric Ranges Faceting

2017-02-15 Thread Michael McCandless
Hi, have a look at the RangeFacetsExample.java under the lucene/demo
module... it shows how to do this.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Feb 14, 2017 at 12:07 PM, Chitra R  wrote:
> Hi,
>We have planned to implement both string and numeric faceting using 
> docvalues field.
>
> For string faceting, we have added pathtraversed dimensions in 
> drilldownquery. But for numeric faceting , how and where can we add 
> pathtraversed ranges during nextlevel faceted search.?
> And which is the better way to add pathtraversed ranges
> ( ie adding pathtraversed ranges in numericRangeQuery or
> adding pathtraversed ranges in filter).??Or Any other solution.???
>
> Thanks & Regards,
> Chitra
>
>
> Sent from my iPhone
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Oliver Mannion
Hi Mike,

Thanks for the suggestion, I've tried Operations.run on a Automaton and
it's fast enough for my use case.

However, the real problem I have is in building the Automaton
via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
of garbage produced during build (DaciukMihovAutomatonBuilder.States I
think). I'm using this in an app that also serves realtime requests, which
timeout during building of the Automaton. It's an inherited application
design and not the best (ideally I'd be building the Automaton offline in
another process) and I don't expect it's a use case considered for
DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
into other data structures for doing prefix matching. The dk.brics
Automaton has a similar performance profile, while the PatriciaTrie from
Apache Commons Collections seems to consume less CPU and produce less
garbage during build, although it has a less than ideal interface (it's a
Map).

Any further suggestion though would be most welcome!

Regards,

Oliver

On 14 February 2017 at 22:56, Michael McCandless 
wrote:

> Wow, 2G heap, that's horrible!
>
> How much heap does the automaton itself take?
>
> You can use the automaton's step method to transition from a state
> given the next input character to another state (or -1 if that state
> doesn't accept that character); it will be slower than the 2 GB run
> automaton, but perhaps fast enough?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion 
> wrote:
> > Thanks Mike for getting back to me, sounds like I'm on the right track.
> >
> > I'm building the automaton from around 1.7million strings, and it ends up
> > with about 3.8million states and it turns out building a
> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
> > suprised!), with negligible performance difference at run time. At your
> > suggestion I tried the ByteRunAutomaton and it was similar to the
> > CharacterRunAutomaton
> > in terms of heap and run time.  So for now I'm going to just stick to an
> > Automaton.
> >
> > On 14 February 2017 at 00:41, Michael McCandless <
> luc...@mikemccandless.com>
> > wrote:
> >
> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion 
> >> wrote:
> >>
> >> > I'd like to construct an Automaton to prefix match against a large
> set of
> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
> >> than
> >> > Automaton.
> >>
> >> That's correct.
> >>
> >> > Are there any other differences between the three Automaton
> >> > classes, for example, in memory usage?
> >>
> >> CompiledAutomaton is just a thin wrapper to hold onto both the
> >> original Automaton and the RunAutomaton, plus some other corner-casey
> >> things that are likely not interesting for your usage.
> >>
> >> > Would the general approach for such a problem be to use
> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
> list
> >> of
> >> > my string set, set all the states to accept (to enable prefix
> matching),
> >> > then pass the Automaton into the constructor of a
> CharacterRunAutomaton,
> >> > and use the run method on the CharacterRunAutomaton to match any
> queries?
> >>
> >> That sounds right.
> >>
> >> You could also try doing everything in UTF8 space instead, and use
> >> ByteRunAutomaton: it will likely be faster since it will do faster
> >> lookups on each transition.  It should still be safe to set all states
> >> as accept, even though some of those states will be inside a single
> >> Unicode character, as long as the strings you test against are whole
> >> UTF-8 sequences?
> >>
> >> > As it seems like I'm building up the Automaton at least three times,
> and
> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
> >> this
> >> > the most memory efficient way of building such an Automaton?
> >>
> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
> >> your case, probably more than the original automaton.
> >>
> >> I suppose you don't actually need to keep the Automaton around once
> >> the RunAutomaton is built, but Lucene doesn't make this possible
> >> today, since the RunAutomaton holds onto the Automaton...
> >>
> >> > Thanks in advance,
> >>
> >> You're welcome!
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>


Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Dawid Weiss
You could try using morfologik's byte-based implementation:

https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java

I can't guarantee it'll be fast enough -- you need to sort those input
sequences and even this may take a while. The construction of the
automaton after that is fairly fast. What are the time limits you have
with respect to input data sizes? Perhaps it's just unrealistic to
assume everything is performed as part of a single request?

Dawid

On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion  wrote:
> Hi Mike,
>
> Thanks for the suggestion, I've tried Operations.run on a Automaton and
> it's fast enough for my use case.
>
> However, the real problem I have is in building the Automaton
> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
> think). I'm using this in an app that also serves realtime requests, which
> timeout during building of the Automaton. It's an inherited application
> design and not the best (ideally I'd be building the Automaton offline in
> another process) and I don't expect it's a use case considered for
> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
> into other data structures for doing prefix matching. The dk.brics
> Automaton has a similar performance profile, while the PatriciaTrie from
> Apache Commons Collections seems to consume less CPU and produce less
> garbage during build, although it has a less than ideal interface (it's a
> Map).
>
> Any further suggestion though would be most welcome!
>
> Regards,
>
> Oliver
>
> On 14 February 2017 at 22:56, Michael McCandless 
> wrote:
>
>> Wow, 2G heap, that's horrible!
>>
>> How much heap does the automaton itself take?
>>
>> You can use the automaton's step method to transition from a state
>> given the next input character to another state (or -1 if that state
>> doesn't accept that character); it will be slower than the 2 GB run
>> automaton, but perhaps fast enough?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion 
>> wrote:
>> > Thanks Mike for getting back to me, sounds like I'm on the right track.
>> >
>> > I'm building the automaton from around 1.7million strings, and it ends up
>> > with about 3.8million states and it turns out building a
>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
>> > suprised!), with negligible performance difference at run time. At your
>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>> > CharacterRunAutomaton
>> > in terms of heap and run time.  So for now I'm going to just stick to an
>> > Automaton.
>> >
>> > On 14 February 2017 at 00:41, Michael McCandless <
>> luc...@mikemccandless.com>
>> > wrote:
>> >
>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion 
>> >> wrote:
>> >>
>> >> > I'd like to construct an Automaton to prefix match against a large
>> set of
>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
>> >> than
>> >> > Automaton.
>> >>
>> >> That's correct.
>> >>
>> >> > Are there any other differences between the three Automaton
>> >> > classes, for example, in memory usage?
>> >>
>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>> >> things that are likely not interesting for your usage.
>> >>
>> >> > Would the general approach for such a problem be to use
>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
>> list
>> >> of
>> >> > my string set, set all the states to accept (to enable prefix
>> matching),
>> >> > then pass the Automaton into the constructor of a
>> CharacterRunAutomaton,
>> >> > and use the run method on the CharacterRunAutomaton to match any
>> queries?
>> >>
>> >> That sounds right.
>> >>
>> >> You could also try doing everything in UTF8 space instead, and use
>> >> ByteRunAutomaton: it will likely be faster since it will do faster
>> >> lookups on each transition.  It should still be safe to set all states
>> >> as accept, even though some of those states will be inside a single
>> >> Unicode character, as long as the strings you test against are whole
>> >> UTF-8 sequences?
>> >>
>> >> > As it seems like I'm building up the Automaton at least three times,
>> and
>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>> >> this
>> >> > the most memory efficient way of building such an Automaton?
>> >>
>> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>> >> your case, probably more than the original automaton.
>> >>
>> >> I suppose you don't actually need to keep the Automaton around once
>> >> the RunAutomaton is built, but Lucene doesn't make this pos

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Michael McCandless
We may be able to make DaciukMihovAutomatonBuilder's state registry
more ram efficient too ... I think it's essentially the same thing as
the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
automata.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss  wrote:
> You could try using morfologik's byte-based implementation:
>
> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>
> I can't guarantee it'll be fast enough -- you need to sort those input
> sequences and even this may take a while. The construction of the
> automaton after that is fairly fast. What are the time limits you have
> with respect to input data sizes? Perhaps it's just unrealistic to
> assume everything is performed as part of a single request?
>
> Dawid
>
> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion  wrote:
>> Hi Mike,
>>
>> Thanks for the suggestion, I've tried Operations.run on a Automaton and
>> it's fast enough for my use case.
>>
>> However, the real problem I have is in building the Automaton
>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
>> think). I'm using this in an app that also serves realtime requests, which
>> timeout during building of the Automaton. It's an inherited application
>> design and not the best (ideally I'd be building the Automaton offline in
>> another process) and I don't expect it's a use case considered for
>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
>> into other data structures for doing prefix matching. The dk.brics
>> Automaton has a similar performance profile, while the PatriciaTrie from
>> Apache Commons Collections seems to consume less CPU and produce less
>> garbage during build, although it has a less than ideal interface (it's a
>> Map).
>>
>> Any further suggestion though would be most welcome!
>>
>> Regards,
>>
>> Oliver
>>
>> On 14 February 2017 at 22:56, Michael McCandless 
>> wrote:
>>
>>> Wow, 2G heap, that's horrible!
>>>
>>> How much heap does the automaton itself take?
>>>
>>> You can use the automaton's step method to transition from a state
>>> given the next input character to another state (or -1 if that state
>>> doesn't accept that character); it will be slower than the 2 GB run
>>> automaton, but perhaps fast enough?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion 
>>> wrote:
>>> > Thanks Mike for getting back to me, sounds like I'm on the right track.
>>> >
>>> > I'm building the automaton from around 1.7million strings, and it ends up
>>> > with about 3.8million states and it turns out building a
>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
>>> > suprised!), with negligible performance difference at run time. At your
>>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>>> > CharacterRunAutomaton
>>> > in terms of heap and run time.  So for now I'm going to just stick to an
>>> > Automaton.
>>> >
>>> > On 14 February 2017 at 00:41, Michael McCandless <
>>> luc...@mikemccandless.com>
>>> > wrote:
>>> >
>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion 
>>> >> wrote:
>>> >>
>>> >> > I'd like to construct an Automaton to prefix match against a large
>>> set of
>>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
>>> >> than
>>> >> > Automaton.
>>> >>
>>> >> That's correct.
>>> >>
>>> >> > Are there any other differences between the three Automaton
>>> >> > classes, for example, in memory usage?
>>> >>
>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>>> >> things that are likely not interesting for your usage.
>>> >>
>>> >> > Would the general approach for such a problem be to use
>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
>>> list
>>> >> of
>>> >> > my string set, set all the states to accept (to enable prefix
>>> matching),
>>> >> > then pass the Automaton into the constructor of a
>>> CharacterRunAutomaton,
>>> >> > and use the run method on the CharacterRunAutomaton to match any
>>> queries?
>>> >>
>>> >> That sounds right.
>>> >>
>>> >> You could also try doing everything in UTF8 space instead, and use
>>> >> ByteRunAutomaton: it will likely be faster since it will do faster
>>> >> lookups on each transition.  It should still be safe to set all states
>>> >> as accept, even though some of those states will be inside a single
>>> >> Unicode character, as long as the strings you test against are whole
>>> >> UTF-8 sequences?
>>> >>
>>> >> > As it seems like I'm building up the Automaton at least three times,
>>> and
>>>

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Dawid Weiss
Yep, true. I just wonder whether it's worth complicating the code...
Could be easier to build an FST and then recreate a RunAutomaton
from that directly... :)

Dawid

On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless
 wrote:
> We may be able to make DaciukMihovAutomatonBuilder's state registry
> more ram efficient too ... I think it's essentially the same thing as
> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
> automata.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss  wrote:
>> You could try using morfologik's byte-based implementation:
>>
>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>>
>> I can't guarantee it'll be fast enough -- you need to sort those input
>> sequences and even this may take a while. The construction of the
>> automaton after that is fairly fast. What are the time limits you have
>> with respect to input data sizes? Perhaps it's just unrealistic to
>> assume everything is performed as part of a single request?
>>
>> Dawid
>>
>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion  wrote:
>>> Hi Mike,
>>>
>>> Thanks for the suggestion, I've tried Operations.run on a Automaton and
>>> it's fast enough for my use case.
>>>
>>> However, the real problem I have is in building the Automaton
>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
>>> think). I'm using this in an app that also serves realtime requests, which
>>> timeout during building of the Automaton. It's an inherited application
>>> design and not the best (ideally I'd be building the Automaton offline in
>>> another process) and I don't expect it's a use case considered for
>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
>>> into other data structures for doing prefix matching. The dk.brics
>>> Automaton has a similar performance profile, while the PatriciaTrie from
>>> Apache Commons Collections seems to consume less CPU and produce less
>>> garbage during build, although it has a less than ideal interface (it's a
>>> Map).
>>>
>>> Any further suggestion though would be most welcome!
>>>
>>> Regards,
>>>
>>> Oliver
>>>
>>> On 14 February 2017 at 22:56, Michael McCandless 
>>> wrote:
>>>
 Wow, 2G heap, that's horrible!

 How much heap does the automaton itself take?

 You can use the automaton's step method to transition from a state
 given the next input character to another state (or -1 if that state
 doesn't accept that character); it will be slower than the 2 GB run
 automaton, but perhaps fast enough?

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion 
 wrote:
 > Thanks Mike for getting back to me, sounds like I'm on the right track.
 >
 > I'm building the automaton from around 1.7million strings, and it ends up
 > with about 3.8million states and it turns out building a
 > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
 > suprised!), with negligible performance difference at run time. At your
 > suggestion I tried the ByteRunAutomaton and it was similar to the
 > CharacterRunAutomaton
 > in terms of heap and run time.  So for now I'm going to just stick to an
 > Automaton.
 >
 > On 14 February 2017 at 00:41, Michael McCandless <
 luc...@mikemccandless.com>
 > wrote:
 >
 >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion 
 >> wrote:
 >>
 >> > I'd like to construct an Automaton to prefix match against a large
 set of
 >> > strings. I gather a RunAutomation is immutable, thread safe and faster
 >> than
 >> > Automaton.
 >>
 >> That's correct.
 >>
 >> > Are there any other differences between the three Automaton
 >> > classes, for example, in memory usage?
 >>
 >> CompiledAutomaton is just a thin wrapper to hold onto both the
 >> original Automaton and the RunAutomaton, plus some other corner-casey
 >> things that are likely not interesting for your usage.
 >>
 >> > Would the general approach for such a problem be to use
 >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
 list
 >> of
 >> > my string set, set all the states to accept (to enable prefix
 matching),
 >> > then pass the Automaton into the constructor of a
 CharacterRunAutomaton,
 >> > and use the run method on the CharacterRunAutomaton to match any
 queries?
 >>
 >> That sounds right.
 >>
 >> You could also try doing everything in UTF8 space instead, and use
 >> ByteRunAutomaton: it will likely be faster since it will do fast

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Michael McCandless
Actually, that's a great idea to try (Oliver).  It would be a
relatively simple conversion... maybe Lucene could add some sugar on
top, e.g. to convert an FST to an automaton.  Hmm, maybe it even
exists somewhere already...

But even the FST Builder's NodeHash can be non-trivial in its heap
usage, but hopefully less than DaciukMihovAutomatonBuilder.

(And yes I do love how simple DaciukMihovAutomatonBuilder is).

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 15, 2017 at 5:39 AM, Dawid Weiss  wrote:
> Yep, true. I just wonder whether it's worth complicating the code...
> Could be easier to build an FST and then recreate a RunAutomaton
> from that directly... :)
>
> Dawid
>
> On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless
>  wrote:
>> We may be able to make DaciukMihovAutomatonBuilder's state registry
>> more ram efficient too ... I think it's essentially the same thing as
>> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
>> automata.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss  wrote:
>>> You could try using morfologik's byte-based implementation:
>>>
>>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>>>
>>> I can't guarantee it'll be fast enough -- you need to sort those input
>>> sequences and even this may take a while. The construction of the
>>> automaton after that is fairly fast. What are the time limits you have
>>> with respect to input data sizes? Perhaps it's just unrealistic to
>>> assume everything is performed as part of a single request?
>>>
>>> Dawid
>>>
>>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion  
>>> wrote:
 Hi Mike,

 Thanks for the suggestion, I've tried Operations.run on a Automaton and
 it's fast enough for my use case.

 However, the real problem I have is in building the Automaton
 via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
 bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
 of garbage produced during build (DaciukMihovAutomatonBuilder.States I
 think). I'm using this in an app that also serves realtime requests, which
 timeout during building of the Automaton. It's an inherited application
 design and not the best (ideally I'd be building the Automaton offline in
 another process) and I don't expect it's a use case considered for
 DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
 into other data structures for doing prefix matching. The dk.brics
 Automaton has a similar performance profile, while the PatriciaTrie from
 Apache Commons Collections seems to consume less CPU and produce less
 garbage during build, although it has a less than ideal interface (it's a
 Map).

 Any further suggestion though would be most welcome!

 Regards,

 Oliver

 On 14 February 2017 at 22:56, Michael McCandless 
 
 wrote:

> Wow, 2G heap, that's horrible!
>
> How much heap does the automaton itself take?
>
> You can use the automaton's step method to transition from a state
> given the next input character to another state (or -1 if that state
> doesn't accept that character); it will be slower than the 2 GB run
> automaton, but perhaps fast enough?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion 
> wrote:
> > Thanks Mike for getting back to me, sounds like I'm on the right track.
> >
> > I'm building the automaton from around 1.7million strings, and it ends 
> > up
> > with about 3.8million states and it turns out building a
> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
> > suprised!), with negligible performance difference at run time. At your
> > suggestion I tried the ByteRunAutomaton and it was similar to the
> > CharacterRunAutomaton
> > in terms of heap and run time.  So for now I'm going to just stick to an
> > Automaton.
> >
> > On 14 February 2017 at 00:41, Michael McCandless <
> luc...@mikemccandless.com>
> > wrote:
> >
> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion 
> >> wrote:
> >>
> >> > I'd like to construct an Automaton to prefix match against a large
> set of
> >> > strings. I gather a RunAutomation is immutable, thread safe and 
> >> > faster
> >> than
> >> > Automaton.
> >>
> >> That's correct.
> >>
> >> > Are there any other differences between the three Automaton
> >> > classes, for example, in memory usage?
> >>
> >> CompiledAutomaton is just a thin wrapper to hold onto both the
> >> original Automaton and the RunAutomaton, plus some other corner-casey
> >> things that are

Re: Numeric Ranges Faceting

2017-02-15 Thread Chitra R
Hi,
  Thanks for the suggestion. But in the case of drill sideways
search, retrieving allDimensions (using Facets.getAllDimension()) threw an
exception which is shown below...

1. While opening DocValuesReaderState, global ordinals and ordinals Range
map will be computed for '$facets' field only.
2. NumericDocValuesField never indexes under '$ facets' so ordinal RangeMap
will be null for the numeric field ie 'time'.

java.lang.IllegalArgumentException: dimension "time" was not indexed
>
> at
>> org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts.getTopChildren(SortedSetDocValuesFacetCounts.java:91)
>
> t org.apache.lucene.facet.MultiFacets.getAllDims(MultiFacets.java:74)
>
>
In my use case,

   - Both string pathTraversed and Numeric PathTraversedRanges will occur.
   - And both faceted search and drill sideways search will be used.

So how can I add path-traversed numericRanges?

Am I missed anything?


Kindly post your suggestions.


Regards,
Chitra

On Wed, Feb 15, 2017 at 3:28 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hi, have a look at the RangeFacetsExample.java under the lucene/demo
> module... it shows how to do this.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 14, 2017 at 12:07 PM, Chitra R  wrote:
> > Hi,
> >We have planned to implement both string and numeric faceting using
> docvalues field.
> >
> > For string faceting, we have added pathtraversed dimensions in
> drilldownquery. But for numeric faceting , how and where can we add
> pathtraversed ranges during nextlevel faceted search.?
> > And which is the better way to add pathtraversed ranges
> > ( ie adding pathtraversed ranges in numericRangeQuery or
> > adding pathtraversed ranges in filter).??Or Any other solution.???
> >
> > Thanks & Regards,
> > Chitra
> >
> >
> > Sent from my iPhone
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>


[ANNOUNCE] Apache Lucene 5.5.4 released

2017-02-15 Thread Adrien Grand
15 February 2017, Apache Lucene™ 5.5.4 available

The Lucene PMC is pleased to announce the release of Apache Lucene 5.5.4

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.

This release contains 8 bug fixes and 4 other changes since the 5.5.3
release, in particular:
 * Made stored fields reclaim native memory more aggressively
 * Fixed a potential memory leak with LRUQueryCache and (Span)TermQuery
 * MmapDirectory's unmapping code is now compatible with Java 9 (EA build
150 and later)

The release is available for immediate download at:

  http://www.apache.org/dyn/closer.lua/lucene/java/5.5.4

Please read CHANGES.txt for a full list of new features and changes:

  https://lucene.apache.org/core/5_5_4/changes/Changes.html

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

-- 
Adrien Grand


Recommended number of fields in one lucene index

2017-02-15 Thread Kumaran Ramasubramanian
Hi All,

Elasticsearch allows 1000 fields by default. In lucene, What are the
indexing and searching performance impacts of having 10 fields vs 3000
fields in a lucene index?

In my case,
while indexing, i index and store all fields and so i can provide update on
one field where we use to take out all stored fields ( except field to be
updated) and index everything again ( remove and add remaining fields ).

While searching, i use _all_ blob field to search in texts of all fields
data.


--
Kumaran R


Re: Recommended number of fields in one lucene index

2017-02-15 Thread Adrien Grand
I think it is hard to come up with a general rule, but there is certainly a
per-field overhead. There are some things that we need to store per field
per segment in memory, so if you multiply the number of fields you have,
you could run out of memory. In most cases I have seen where the index had
so many fields, it was due to the fact that the application wanted to index
arbitrary documents and provide search for them, which cannot scale, or to
the fact that the index contained many unrelated documents that should have
been put into different indices. This limit has been very useful to catch
such design problems early instead of waiting for the production server to
go out of memory due to the multiplication of fields.

Le mer. 15 févr. 2017 à 19:44, Kumaran Ramasubramanian 
a écrit :

> While searching, i use _all_ blob field to search in texts of all fields
> data.
>

This is interesting: if all your searches go to a catch-all field, then it
means that you do not need those thousands of fields but could just have a
single indexed field that is used for searching, and a binary blob that
stores all the data so that you can perform updates. So this only requires
two fields from a Lucene perspective.


Re: Recommended number of fields in one lucene index

2017-02-15 Thread Kumaran Ramasubramanian
Hi Adrien Grand,

Thanks for the response.

a binary blob that
> stores all the data so that you can perform updates.


Could you elaborate on this? Do you mean to have StoredField as mentioned
below to store all other fields which are needed only for updates? is there
any way to use updatedocuments api for this kind of updates instead of
taking out storedfields and delete-add updated documents?


Use a StoredField. You can pass in either the BytesRef, or the byte array
>> itself into the field:
>
>
>> byte[] myByteArray = new byte[10];
>
> document.add(new StoredField("bin1", myByteArray));
>
> As far as retrieving the value, you are on about the right track there
>> already. Something like:
>
>
>> Document resultDoc = searcher.doc(docno);
>
> BytesRef bin1ref = resultDoc.getBinaryValue("bin1");
>
> bytes[] bin1bytes = bin1ref.bytes;
>
>
Snippet from: http://stackoverflow.com/a/34324561/1382168


--
Kumaran R




On Thu, Feb 16, 2017 at 12:38 AM, Adrien Grand  wrote:

> I think it is hard to come up with a general rule, but there is certainly a
> per-field overhead. There are some things that we need to store per field
> per segment in memory, so if you multiply the number of fields you have,
> you could run out of memory. In most cases I have seen where the index had
> so many fields, it was due to the fact that the application wanted to index
> arbitrary documents and provide search for them, which cannot scale, or to
> the fact that the index contained many unrelated documents that should have
> been put into different indices. This limit has been very useful to catch
> such design problems early instead of waiting for the production server to
> go out of memory due to the multiplication of fields.
>
> Le mer. 15 févr. 2017 à 19:44, Kumaran Ramasubramanian  >
> a écrit :
>
> > While searching, i use _all_ blob field to search in texts of all fields
> > data.
> >
>
> This is interesting: if all your searches go to a catch-all field, then it
> means that you do not need those thousands of fields but could just have a
> single indexed field that is used for searching, and a binary blob that
> stores all the data so that you can perform updates. So this only requires
> two fields from a Lucene perspective.
>