Re: Numeric Ranges Faceting
Hi, have a look at the RangeFacetsExample.java under the lucene/demo module... it shows how to do this. Mike McCandless http://blog.mikemccandless.com On Tue, Feb 14, 2017 at 12:07 PM, Chitra R wrote: > Hi, >We have planned to implement both string and numeric faceting using > docvalues field. > > For string faceting, we have added pathtraversed dimensions in > drilldownquery. But for numeric faceting , how and where can we add > pathtraversed ranges during nextlevel faceted search.? > And which is the better way to add pathtraversed ranges > ( ie adding pathtraversed ranges in numericRangeQuery or > adding pathtraversed ranges in filter).??Or Any other solution.??? > > Thanks & Regards, > Chitra > > > Sent from my iPhone > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Hi Mike, Thanks for the suggestion, I've tried Operations.run on a Automaton and it's fast enough for my use case. However, the real problem I have is in building the Automaton via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb of garbage produced during build (DaciukMihovAutomatonBuilder.States I think). I'm using this in an app that also serves realtime requests, which timeout during building of the Automaton. It's an inherited application design and not the best (ideally I'd be building the Automaton offline in another process) and I don't expect it's a use case considered for DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking into other data structures for doing prefix matching. The dk.brics Automaton has a similar performance profile, while the PatriciaTrie from Apache Commons Collections seems to consume less CPU and produce less garbage during build, although it has a less than ideal interface (it's a Map). Any further suggestion though would be most welcome! Regards, Oliver On 14 February 2017 at 22:56, Michael McCandless wrote: > Wow, 2G heap, that's horrible! > > How much heap does the automaton itself take? > > You can use the automaton's step method to transition from a state > given the next input character to another state (or -1 if that state > doesn't accept that character); it will be slower than the 2 GB run > automaton, but perhaps fast enough? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion > wrote: > > Thanks Mike for getting back to me, sounds like I'm on the right track. > > > > I'm building the automaton from around 1.7million strings, and it ends up > > with about 3.8million states and it turns out building a > > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite > > suprised!), with negligible performance difference at run time. At your > > suggestion I tried the ByteRunAutomaton and it was similar to the > > CharacterRunAutomaton > > in terms of heap and run time. So for now I'm going to just stick to an > > Automaton. > > > > On 14 February 2017 at 00:41, Michael McCandless < > luc...@mikemccandless.com> > > wrote: > > > >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion > >> wrote: > >> > >> > I'd like to construct an Automaton to prefix match against a large > set of > >> > strings. I gather a RunAutomation is immutable, thread safe and faster > >> than > >> > Automaton. > >> > >> That's correct. > >> > >> > Are there any other differences between the three Automaton > >> > classes, for example, in memory usage? > >> > >> CompiledAutomaton is just a thin wrapper to hold onto both the > >> original Automaton and the RunAutomaton, plus some other corner-casey > >> things that are likely not interesting for your usage. > >> > >> > Would the general approach for such a problem be to use > >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted > list > >> of > >> > my string set, set all the states to accept (to enable prefix > matching), > >> > then pass the Automaton into the constructor of a > CharacterRunAutomaton, > >> > and use the run method on the CharacterRunAutomaton to match any > queries? > >> > >> That sounds right. > >> > >> You could also try doing everything in UTF8 space instead, and use > >> ByteRunAutomaton: it will likely be faster since it will do faster > >> lookups on each transition. It should still be safe to set all states > >> as accept, even though some of those states will be inside a single > >> Unicode character, as long as the strings you test against are whole > >> UTF-8 sequences? > >> > >> > As it seems like I'm building up the Automaton at least three times, > and > >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is > >> this > >> > the most memory efficient way of building such an Automaton? > >> > >> Yeah, it is. The RunAutomaton will likely require substantial heap in > >> your case, probably more than the original automaton. > >> > >> I suppose you don't actually need to keep the Automaton around once > >> the RunAutomaton is built, but Lucene doesn't make this possible > >> today, since the RunAutomaton holds onto the Automaton... > >> > >> > Thanks in advance, > >> > >> You're welcome! > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> >
Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
You could try using morfologik's byte-based implementation: https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java I can't guarantee it'll be fast enough -- you need to sort those input sequences and even this may take a while. The construction of the automaton after that is fairly fast. What are the time limits you have with respect to input data sizes? Perhaps it's just unrealistic to assume everything is performed as part of a single request? Dawid On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion wrote: > Hi Mike, > > Thanks for the suggestion, I've tried Operations.run on a Automaton and > it's fast enough for my use case. > > However, the real problem I have is in building the Automaton > via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a > bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb > of garbage produced during build (DaciukMihovAutomatonBuilder.States I > think). I'm using this in an app that also serves realtime requests, which > timeout during building of the Automaton. It's an inherited application > design and not the best (ideally I'd be building the Automaton offline in > another process) and I don't expect it's a use case considered for > DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking > into other data structures for doing prefix matching. The dk.brics > Automaton has a similar performance profile, while the PatriciaTrie from > Apache Commons Collections seems to consume less CPU and produce less > garbage during build, although it has a less than ideal interface (it's a > Map). > > Any further suggestion though would be most welcome! > > Regards, > > Oliver > > On 14 February 2017 at 22:56, Michael McCandless > wrote: > >> Wow, 2G heap, that's horrible! >> >> How much heap does the automaton itself take? >> >> You can use the automaton's step method to transition from a state >> given the next input character to another state (or -1 if that state >> doesn't accept that character); it will be slower than the 2 GB run >> automaton, but perhaps fast enough? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion >> wrote: >> > Thanks Mike for getting back to me, sounds like I'm on the right track. >> > >> > I'm building the automaton from around 1.7million strings, and it ends up >> > with about 3.8million states and it turns out building a >> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite >> > suprised!), with negligible performance difference at run time. At your >> > suggestion I tried the ByteRunAutomaton and it was similar to the >> > CharacterRunAutomaton >> > in terms of heap and run time. So for now I'm going to just stick to an >> > Automaton. >> > >> > On 14 February 2017 at 00:41, Michael McCandless < >> luc...@mikemccandless.com> >> > wrote: >> > >> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion >> >> wrote: >> >> >> >> > I'd like to construct an Automaton to prefix match against a large >> set of >> >> > strings. I gather a RunAutomation is immutable, thread safe and faster >> >> than >> >> > Automaton. >> >> >> >> That's correct. >> >> >> >> > Are there any other differences between the three Automaton >> >> > classes, for example, in memory usage? >> >> >> >> CompiledAutomaton is just a thin wrapper to hold onto both the >> >> original Automaton and the RunAutomaton, plus some other corner-casey >> >> things that are likely not interesting for your usage. >> >> >> >> > Would the general approach for such a problem be to use >> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted >> list >> >> of >> >> > my string set, set all the states to accept (to enable prefix >> matching), >> >> > then pass the Automaton into the constructor of a >> CharacterRunAutomaton, >> >> > and use the run method on the CharacterRunAutomaton to match any >> queries? >> >> >> >> That sounds right. >> >> >> >> You could also try doing everything in UTF8 space instead, and use >> >> ByteRunAutomaton: it will likely be faster since it will do faster >> >> lookups on each transition. It should still be safe to set all states >> >> as accept, even though some of those states will be inside a single >> >> Unicode character, as long as the strings you test against are whole >> >> UTF-8 sequences? >> >> >> >> > As it seems like I'm building up the Automaton at least three times, >> and >> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is >> >> this >> >> > the most memory efficient way of building such an Automaton? >> >> >> >> Yeah, it is. The RunAutomaton will likely require substantial heap in >> >> your case, probably more than the original automaton. >> >> >> >> I suppose you don't actually need to keep the Automaton around once >> >> the RunAutomaton is built, but Lucene doesn't make this pos
Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
We may be able to make DaciukMihovAutomatonBuilder's state registry more ram efficient too ... I think it's essentially the same thing as the FST.Builder's NodeHash, just minus the outputs that FSTs have vs automata. Mike McCandless http://blog.mikemccandless.com On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss wrote: > You could try using morfologik's byte-based implementation: > > https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java > > I can't guarantee it'll be fast enough -- you need to sort those input > sequences and even this may take a while. The construction of the > automaton after that is fairly fast. What are the time limits you have > with respect to input data sizes? Perhaps it's just unrealistic to > assume everything is performed as part of a single request? > > Dawid > > On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion wrote: >> Hi Mike, >> >> Thanks for the suggestion, I've tried Operations.run on a Automaton and >> it's fast enough for my use case. >> >> However, the real problem I have is in building the Automaton >> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a >> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb >> of garbage produced during build (DaciukMihovAutomatonBuilder.States I >> think). I'm using this in an app that also serves realtime requests, which >> timeout during building of the Automaton. It's an inherited application >> design and not the best (ideally I'd be building the Automaton offline in >> another process) and I don't expect it's a use case considered for >> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking >> into other data structures for doing prefix matching. The dk.brics >> Automaton has a similar performance profile, while the PatriciaTrie from >> Apache Commons Collections seems to consume less CPU and produce less >> garbage during build, although it has a less than ideal interface (it's a >> Map). >> >> Any further suggestion though would be most welcome! >> >> Regards, >> >> Oliver >> >> On 14 February 2017 at 22:56, Michael McCandless >> wrote: >> >>> Wow, 2G heap, that's horrible! >>> >>> How much heap does the automaton itself take? >>> >>> You can use the automaton's step method to transition from a state >>> given the next input character to another state (or -1 if that state >>> doesn't accept that character); it will be slower than the 2 GB run >>> automaton, but perhaps fast enough? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion >>> wrote: >>> > Thanks Mike for getting back to me, sounds like I'm on the right track. >>> > >>> > I'm building the automaton from around 1.7million strings, and it ends up >>> > with about 3.8million states and it turns out building a >>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite >>> > suprised!), with negligible performance difference at run time. At your >>> > suggestion I tried the ByteRunAutomaton and it was similar to the >>> > CharacterRunAutomaton >>> > in terms of heap and run time. So for now I'm going to just stick to an >>> > Automaton. >>> > >>> > On 14 February 2017 at 00:41, Michael McCandless < >>> luc...@mikemccandless.com> >>> > wrote: >>> > >>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion >>> >> wrote: >>> >> >>> >> > I'd like to construct an Automaton to prefix match against a large >>> set of >>> >> > strings. I gather a RunAutomation is immutable, thread safe and faster >>> >> than >>> >> > Automaton. >>> >> >>> >> That's correct. >>> >> >>> >> > Are there any other differences between the three Automaton >>> >> > classes, for example, in memory usage? >>> >> >>> >> CompiledAutomaton is just a thin wrapper to hold onto both the >>> >> original Automaton and the RunAutomaton, plus some other corner-casey >>> >> things that are likely not interesting for your usage. >>> >> >>> >> > Would the general approach for such a problem be to use >>> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted >>> list >>> >> of >>> >> > my string set, set all the states to accept (to enable prefix >>> matching), >>> >> > then pass the Automaton into the constructor of a >>> CharacterRunAutomaton, >>> >> > and use the run method on the CharacterRunAutomaton to match any >>> queries? >>> >> >>> >> That sounds right. >>> >> >>> >> You could also try doing everything in UTF8 space instead, and use >>> >> ByteRunAutomaton: it will likely be faster since it will do faster >>> >> lookups on each transition. It should still be safe to set all states >>> >> as accept, even though some of those states will be inside a single >>> >> Unicode character, as long as the strings you test against are whole >>> >> UTF-8 sequences? >>> >> >>> >> > As it seems like I'm building up the Automaton at least three times, >>> and >>>
Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Yep, true. I just wonder whether it's worth complicating the code... Could be easier to build an FST and then recreate a RunAutomaton from that directly... :) Dawid On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless wrote: > We may be able to make DaciukMihovAutomatonBuilder's state registry > more ram efficient too ... I think it's essentially the same thing as > the FST.Builder's NodeHash, just minus the outputs that FSTs have vs > automata. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss wrote: >> You could try using morfologik's byte-based implementation: >> >> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java >> >> I can't guarantee it'll be fast enough -- you need to sort those input >> sequences and even this may take a while. The construction of the >> automaton after that is fairly fast. What are the time limits you have >> with respect to input data sizes? Perhaps it's just unrealistic to >> assume everything is performed as part of a single request? >> >> Dawid >> >> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion wrote: >>> Hi Mike, >>> >>> Thanks for the suggestion, I've tried Operations.run on a Automaton and >>> it's fast enough for my use case. >>> >>> However, the real problem I have is in building the Automaton >>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a >>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb >>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I >>> think). I'm using this in an app that also serves realtime requests, which >>> timeout during building of the Automaton. It's an inherited application >>> design and not the best (ideally I'd be building the Automaton offline in >>> another process) and I don't expect it's a use case considered for >>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking >>> into other data structures for doing prefix matching. The dk.brics >>> Automaton has a similar performance profile, while the PatriciaTrie from >>> Apache Commons Collections seems to consume less CPU and produce less >>> garbage during build, although it has a less than ideal interface (it's a >>> Map). >>> >>> Any further suggestion though would be most welcome! >>> >>> Regards, >>> >>> Oliver >>> >>> On 14 February 2017 at 22:56, Michael McCandless >>> wrote: >>> Wow, 2G heap, that's horrible! How much heap does the automaton itself take? You can use the automaton's step method to transition from a state given the next input character to another state (or -1 if that state doesn't accept that character); it will be slower than the 2 GB run automaton, but perhaps fast enough? Mike McCandless http://blog.mikemccandless.com On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion wrote: > Thanks Mike for getting back to me, sounds like I'm on the right track. > > I'm building the automaton from around 1.7million strings, and it ends up > with about 3.8million states and it turns out building a > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite > suprised!), with negligible performance difference at run time. At your > suggestion I tried the ByteRunAutomaton and it was similar to the > CharacterRunAutomaton > in terms of heap and run time. So for now I'm going to just stick to an > Automaton. > > On 14 February 2017 at 00:41, Michael McCandless < luc...@mikemccandless.com> > wrote: > >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion >> wrote: >> >> > I'd like to construct an Automaton to prefix match against a large set of >> > strings. I gather a RunAutomation is immutable, thread safe and faster >> than >> > Automaton. >> >> That's correct. >> >> > Are there any other differences between the three Automaton >> > classes, for example, in memory usage? >> >> CompiledAutomaton is just a thin wrapper to hold onto both the >> original Automaton and the RunAutomaton, plus some other corner-casey >> things that are likely not interesting for your usage. >> >> > Would the general approach for such a problem be to use >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list >> of >> > my string set, set all the states to accept (to enable prefix matching), >> > then pass the Automaton into the constructor of a CharacterRunAutomaton, >> > and use the run method on the CharacterRunAutomaton to match any queries? >> >> That sounds right. >> >> You could also try doing everything in UTF8 space instead, and use >> ByteRunAutomaton: it will likely be faster since it will do fast
Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Actually, that's a great idea to try (Oliver). It would be a relatively simple conversion... maybe Lucene could add some sugar on top, e.g. to convert an FST to an automaton. Hmm, maybe it even exists somewhere already... But even the FST Builder's NodeHash can be non-trivial in its heap usage, but hopefully less than DaciukMihovAutomatonBuilder. (And yes I do love how simple DaciukMihovAutomatonBuilder is). Mike McCandless http://blog.mikemccandless.com On Wed, Feb 15, 2017 at 5:39 AM, Dawid Weiss wrote: > Yep, true. I just wonder whether it's worth complicating the code... > Could be easier to build an FST and then recreate a RunAutomaton > from that directly... :) > > Dawid > > On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless > wrote: >> We may be able to make DaciukMihovAutomatonBuilder's state registry >> more ram efficient too ... I think it's essentially the same thing as >> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs >> automata. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss wrote: >>> You could try using morfologik's byte-based implementation: >>> >>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java >>> >>> I can't guarantee it'll be fast enough -- you need to sort those input >>> sequences and even this may take a while. The construction of the >>> automaton after that is fairly fast. What are the time limits you have >>> with respect to input data sizes? Perhaps it's just unrealistic to >>> assume everything is performed as part of a single request? >>> >>> Dawid >>> >>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion >>> wrote: Hi Mike, Thanks for the suggestion, I've tried Operations.run on a Automaton and it's fast enough for my use case. However, the real problem I have is in building the Automaton via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb of garbage produced during build (DaciukMihovAutomatonBuilder.States I think). I'm using this in an app that also serves realtime requests, which timeout during building of the Automaton. It's an inherited application design and not the best (ideally I'd be building the Automaton offline in another process) and I don't expect it's a use case considered for DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking into other data structures for doing prefix matching. The dk.brics Automaton has a similar performance profile, while the PatriciaTrie from Apache Commons Collections seems to consume less CPU and produce less garbage during build, although it has a less than ideal interface (it's a Map). Any further suggestion though would be most welcome! Regards, Oliver On 14 February 2017 at 22:56, Michael McCandless wrote: > Wow, 2G heap, that's horrible! > > How much heap does the automaton itself take? > > You can use the automaton's step method to transition from a state > given the next input character to another state (or -1 if that state > doesn't accept that character); it will be slower than the 2 GB run > automaton, but perhaps fast enough? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion > wrote: > > Thanks Mike for getting back to me, sounds like I'm on the right track. > > > > I'm building the automaton from around 1.7million strings, and it ends > > up > > with about 3.8million states and it turns out building a > > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite > > suprised!), with negligible performance difference at run time. At your > > suggestion I tried the ByteRunAutomaton and it was similar to the > > CharacterRunAutomaton > > in terms of heap and run time. So for now I'm going to just stick to an > > Automaton. > > > > On 14 February 2017 at 00:41, Michael McCandless < > luc...@mikemccandless.com> > > wrote: > > > >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion > >> wrote: > >> > >> > I'd like to construct an Automaton to prefix match against a large > set of > >> > strings. I gather a RunAutomation is immutable, thread safe and > >> > faster > >> than > >> > Automaton. > >> > >> That's correct. > >> > >> > Are there any other differences between the three Automaton > >> > classes, for example, in memory usage? > >> > >> CompiledAutomaton is just a thin wrapper to hold onto both the > >> original Automaton and the RunAutomaton, plus some other corner-casey > >> things that are
Re: Numeric Ranges Faceting
Hi, Thanks for the suggestion. But in the case of drill sideways search, retrieving allDimensions (using Facets.getAllDimension()) threw an exception which is shown below... 1. While opening DocValuesReaderState, global ordinals and ordinals Range map will be computed for '$facets' field only. 2. NumericDocValuesField never indexes under '$ facets' so ordinal RangeMap will be null for the numeric field ie 'time'. java.lang.IllegalArgumentException: dimension "time" was not indexed > > at >> org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts.getTopChildren(SortedSetDocValuesFacetCounts.java:91) > > t org.apache.lucene.facet.MultiFacets.getAllDims(MultiFacets.java:74) > > In my use case, - Both string pathTraversed and Numeric PathTraversedRanges will occur. - And both faceted search and drill sideways search will be used. So how can I add path-traversed numericRanges? Am I missed anything? Kindly post your suggestions. Regards, Chitra On Wed, Feb 15, 2017 at 3:28 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Hi, have a look at the RangeFacetsExample.java under the lucene/demo > module... it shows how to do this. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Feb 14, 2017 at 12:07 PM, Chitra R wrote: > > Hi, > >We have planned to implement both string and numeric faceting using > docvalues field. > > > > For string faceting, we have added pathtraversed dimensions in > drilldownquery. But for numeric faceting , how and where can we add > pathtraversed ranges during nextlevel faceted search.? > > And which is the better way to add pathtraversed ranges > > ( ie adding pathtraversed ranges in numericRangeQuery or > > adding pathtraversed ranges in filter).??Or Any other solution.??? > > > > Thanks & Regards, > > Chitra > > > > > > Sent from my iPhone > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > >
[ANNOUNCE] Apache Lucene 5.5.4 released
15 February 2017, Apache Lucene™ 5.5.4 available The Lucene PMC is pleased to announce the release of Apache Lucene 5.5.4 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. This release contains 8 bug fixes and 4 other changes since the 5.5.3 release, in particular: * Made stored fields reclaim native memory more aggressively * Fixed a potential memory leak with LRUQueryCache and (Span)TermQuery * MmapDirectory's unmapping code is now compatible with Java 9 (EA build 150 and later) The release is available for immediate download at: http://www.apache.org/dyn/closer.lua/lucene/java/5.5.4 Please read CHANGES.txt for a full list of new features and changes: https://lucene.apache.org/core/5_5_4/changes/Changes.html Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. -- Adrien Grand
Recommended number of fields in one lucene index
Hi All, Elasticsearch allows 1000 fields by default. In lucene, What are the indexing and searching performance impacts of having 10 fields vs 3000 fields in a lucene index? In my case, while indexing, i index and store all fields and so i can provide update on one field where we use to take out all stored fields ( except field to be updated) and index everything again ( remove and add remaining fields ). While searching, i use _all_ blob field to search in texts of all fields data. -- Kumaran R
Re: Recommended number of fields in one lucene index
I think it is hard to come up with a general rule, but there is certainly a per-field overhead. There are some things that we need to store per field per segment in memory, so if you multiply the number of fields you have, you could run out of memory. In most cases I have seen where the index had so many fields, it was due to the fact that the application wanted to index arbitrary documents and provide search for them, which cannot scale, or to the fact that the index contained many unrelated documents that should have been put into different indices. This limit has been very useful to catch such design problems early instead of waiting for the production server to go out of memory due to the multiplication of fields. Le mer. 15 févr. 2017 à 19:44, Kumaran Ramasubramanian a écrit : > While searching, i use _all_ blob field to search in texts of all fields > data. > This is interesting: if all your searches go to a catch-all field, then it means that you do not need those thousands of fields but could just have a single indexed field that is used for searching, and a binary blob that stores all the data so that you can perform updates. So this only requires two fields from a Lucene perspective.
Re: Recommended number of fields in one lucene index
Hi Adrien Grand, Thanks for the response. a binary blob that > stores all the data so that you can perform updates. Could you elaborate on this? Do you mean to have StoredField as mentioned below to store all other fields which are needed only for updates? is there any way to use updatedocuments api for this kind of updates instead of taking out storedfields and delete-add updated documents? Use a StoredField. You can pass in either the BytesRef, or the byte array >> itself into the field: > > >> byte[] myByteArray = new byte[10]; > > document.add(new StoredField("bin1", myByteArray)); > > As far as retrieving the value, you are on about the right track there >> already. Something like: > > >> Document resultDoc = searcher.doc(docno); > > BytesRef bin1ref = resultDoc.getBinaryValue("bin1"); > > bytes[] bin1bytes = bin1ref.bytes; > > Snippet from: http://stackoverflow.com/a/34324561/1382168 -- Kumaran R On Thu, Feb 16, 2017 at 12:38 AM, Adrien Grand wrote: > I think it is hard to come up with a general rule, but there is certainly a > per-field overhead. There are some things that we need to store per field > per segment in memory, so if you multiply the number of fields you have, > you could run out of memory. In most cases I have seen where the index had > so many fields, it was due to the fact that the application wanted to index > arbitrary documents and provide search for them, which cannot scale, or to > the fact that the index contained many unrelated documents that should have > been put into different indices. This limit has been very useful to catch > such design problems early instead of waiting for the production server to > go out of memory due to the multiplication of fields. > > Le mer. 15 févr. 2017 à 19:44, Kumaran Ramasubramanian > > a écrit : > > > While searching, i use _all_ blob field to search in texts of all fields > > data. > > > > This is interesting: if all your searches go to a catch-all field, then it > means that you do not need those thousands of fields but could just have a > single indexed field that is used for searching, and a binary blob that > stores all the data so that you can perform updates. So this only requires > two fields from a Lucene perspective. >