But why is it so costly? In a regular query we walk postings and match document numbers, in a SpanQuery we match position numbers (or position segments), what's the principal difference? I think it's just that #documents << #positions.
For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1. I wrap them into an ordered SpanNearQuery with the slop=0. I see getPayload() in the profiler top. I think I can emulate payload checking with cleverly assigned position increments (and then maximum position in a document might jump up to ~10^9 - I hope it won't blow the whole index up). If I remove payload matching and keep only position checking, will it speed up everything, or the positions and payloads are the same? My main goal is getting the precise results for a query, so proximity boosting won't help, unfortunately. -- Best Regards, Igor 18.10.2013, 23:37, "Michael McCandless" <luc...@mikemccandless.com>: > Unfortunately, SpanNearQuery is a very costly query. What slop are you > passing? > > You might want to check out > https://issues.apache.org/jira/browse/LUCENE-5288 ... it adds > proximity boosting to queries, but it's still very early in the > iterating, and if you need a precise count of only those documents > matching the SpanNearQuery, then that issue won't help. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Oct 17, 2013 at 6:05 PM, Igor Shalyminov > <ishalymi...@yandex-team.ru> wrote: > >> Mike, >> >> For now I'm using just a SpanQuery over a ~600MB index segment >> single-threadedly (one segment - one thread, the complete setup is 30 >> segments with the total of 20GB). >> >> I'm trying to use Lucene for the morphologically annotated text corpus >> (namely, Russian National Corpus). >> The main query type in it is co-occurrence search with desired word >> morphological features and distance between tokens. >> >> In my test case I work with a single field - grammar (it is word-level - >> every word in the corpus has one). Full grammar annotation of a word is a >> set of atomic grammar features. >> For an example, the verb "book" has in its grammar: >> - POS tag (V); >> - time (pres); >> >> and the noun "book": >> - POS tag (N) >> - number (sg). >> >> In general one grammar annotation has approximately 8 atomic features. >> >> Words are treated as initially ambiguous, so that for the word "book" >> occurrence in the text we get grammar tokens: >> V pres N sg >> 2 parses: "V,pres" and "N,sg" are just independent tokens with >> positionIncrement=0 in the index. >> >> Moreover, each such token has parse bitmask in its payload: >> V|0001 pres|0001 N|0010 sg|0010 >> >> Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the >> maximum of 4 parse variants. It allows me to find the word "book" for the >> query "V" & "pres" but not for the query "V" & "sg". >> >> So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} >> with position and payload checking over a 600MB segment and getting the >> precise doc hits number and overall matches number via iterating over >> getSpans(). >> >> This takes me about 20 seconds, even if everything is in RAM. >> The next thing I'm going to explore is compression, I'll try >> DirectPostingsFormat as you suggested. >> >> -- >> Best Regards, >> Igor >> >> 17.10.2013, 20:26, "Michael McCandless" <luc...@mikemccandless.com>: >>> DirectPostingsFormat holds all postings in RAM, uncompressed, as >>> simple java arrays. But it's quite RAM heavy... >>> >>> The hotspots may also be in the queries you are running ... maybe you >>> can describe more how you're using Lucene? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov >>> <ishalymi...@yandex-team.ru> wrote: >>>> Hello! >>>> >>>> I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. >>>> Both work the same for me (the same bad:( ). >>>> Thus, I think my problem is not disk access (although I always see >>>> getPayload() in the VisualVM top). >>>> So, maybe the hard part in the postings traversal is decompression? >>>> Are there Lucene codecs which use light postings compression (maybe none >>>> at all)? >>>> >>>> And, getting back to in-memory index topic, is lucene.codecs.memory >>>> somewhat similar to RAMDirectory? >>>> >>>> -- >>>> Best Regards, >>>> Igor >>>> >>>> 10.10.2013, 03:01, "Vitaly Funstein" <vfunst...@gmail.com>: >>>>> I don't think you want to load indexes of this size into a RAMDirectory. >>>>> The reasons have been listed multiple times here... in short, just use >>>>> MMapDirectory. >>>>> >>>>> On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov >>>>> <ishalymi...@yandex-team.ru>wrote: >>>>>> Hello! >>>>>> >>>>>> I need to perform an experiment of loading the entire index in RAM and >>>>>> seeing how the search performance changes. >>>>>> My index has TermVectors with payload and position info, >>>>>> StoredFields, and >>>>>> DocValues. It takes ~30GB on disk (the server has 48). >>>>>> >>>>>> _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new >>>>>> File(_indexDirectory))); >>>>>> >>>>>> Is the line above the only thing I have to do to complete my goal? >>>>>> >>>>>> And also: >>>>>> - will all the data be loaded in the RAM right after opening, or >>>>>> during >>>>>> the reading stage? >>>>>> - will the index data be stored in RAM as it is on disk, or will it be >>>>>> uncompressed first? >>>>>> >>>>>> -- >>>>>> Best Regards, >>>>>> Igor >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org