Unfortunately, SpanNearQuery is a very costly query. What slop are you passing?
You might want to check out https://issues.apache.org/jira/browse/LUCENE-5288 ... it adds proximity boosting to queries, but it's still very early in the iterating, and if you need a precise count of only those documents matching the SpanNearQuery, then that issue won't help. Mike McCandless http://blog.mikemccandless.com On Thu, Oct 17, 2013 at 6:05 PM, Igor Shalyminov <ishalymi...@yandex-team.ru> wrote: > Mike, > > For now I'm using just a SpanQuery over a ~600MB index segment > single-threadedly (one segment - one thread, the complete setup is 30 > segments with the total of 20GB). > > I'm trying to use Lucene for the morphologically annotated text corpus > (namely, Russian National Corpus). > The main query type in it is co-occurrence search with desired word > morphological features and distance between tokens. > > In my test case I work with a single field - grammar (it is word-level - > every word in the corpus has one). Full grammar annotation of a word is a set > of atomic grammar features. > For an example, the verb "book" has in its grammar: > - POS tag (V); > - time (pres); > > and the noun "book": > - POS tag (N) > - number (sg). > > In general one grammar annotation has approximately 8 atomic features. > > Words are treated as initially ambiguous, so that for the word "book" > occurrence in the text we get grammar tokens: > V pres N sg > 2 parses: "V,pres" and "N,sg" are just independent tokens with > positionIncrement=0 in the index. > > Moreover, each such token has parse bitmask in its payload: > V|0001 pres|0001 N|0010 sg|0010 > > Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the > maximum of 4 parse variants. It allows me to find the word "book" for the > query "V" & "pres" but not for the query "V" & "sg". > > So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} > with position and payload checking over a 600MB segment and getting the > precise doc hits number and overall matches number via iterating over > getSpans(). > > This takes me about 20 seconds, even if everything is in RAM. > The next thing I'm going to explore is compression, I'll try > DirectPostingsFormat as you suggested. > > -- > Best Regards, > Igor > > 17.10.2013, 20:26, "Michael McCandless" <luc...@mikemccandless.com>: >> DirectPostingsFormat holds all postings in RAM, uncompressed, as >> simple java arrays. But it's quite RAM heavy... >> >> The hotspots may also be in the queries you are running ... maybe you >> can describe more how you're using Lucene? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov >> <ishalymi...@yandex-team.ru> wrote: >> >>> Hello! >>> >>> I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both >>> work the same for me (the same bad:( ). >>> Thus, I think my problem is not disk access (although I always see >>> getPayload() in the VisualVM top). >>> So, maybe the hard part in the postings traversal is decompression? >>> Are there Lucene codecs which use light postings compression (maybe none >>> at all)? >>> >>> And, getting back to in-memory index topic, is lucene.codecs.memory >>> somewhat similar to RAMDirectory? >>> >>> -- >>> Best Regards, >>> Igor >>> >>> 10.10.2013, 03:01, "Vitaly Funstein" <vfunst...@gmail.com>: >>>> I don't think you want to load indexes of this size into a RAMDirectory. >>>> The reasons have been listed multiple times here... in short, just use >>>> MMapDirectory. >>>> >>>> On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov >>>> <ishalymi...@yandex-team.ru>wrote: >>>>> Hello! >>>>> >>>>> I need to perform an experiment of loading the entire index in RAM and >>>>> seeing how the search performance changes. >>>>> My index has TermVectors with payload and position info, StoredFields, >>>>> and >>>>> DocValues. It takes ~30GB on disk (the server has 48). >>>>> >>>>> _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new >>>>> File(_indexDirectory))); >>>>> >>>>> Is the line above the only thing I have to do to complete my goal? >>>>> >>>>> And also: >>>>> - will all the data be loaded in the RAM right after opening, or during >>>>> the reading stage? >>>>> - will the index data be stored in RAM as it is on disk, or will it be >>>>> uncompressed first? >>>>> >>>>> -- >>>>> Best Regards, >>>>> Igor >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org