Re: Lucene in-memory index

2013-10-31 Thread Michael McCandless
On Fri, Oct 25, 2013 at 9:58 AM, Igor Shalyminov wrote: > What is ProxBooleanTermQuery? > I couldn't find it in the trunk and in that ticket's > (https://issues.apache.org/jira/browse/LUCENE-2878) patch. Sorry, this is on https://issues.apache.org/jira/browse/LUCENE-5288 Next time try searchin

Re: Lucene in-memory index

2013-10-25 Thread Igor Shalyminov
What is ProxBooleanTermQuery? I couldn't find it in the trunk and in that ticket's (https://issues.apache.org/jira/browse/LUCENE-2878) patch. And for now it's very fuzzy to me how the searching/scoring works. Are there any tutorials or talks on how do Queries, Scorers, Collectors interoperate?

Re: Lucene in-memory index

2013-10-23 Thread Michael McCandless
On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov wrote: > Thanks for the link, I'll definitely dig into SpanQuery internals very soon. You could also just make a custom query. If you start from the ProxBooleanTermQuery on that issue, but change it so that it rejects hits that didn't have terms

Re: Lucene in-memory index

2013-10-22 Thread Igor Shalyminov
Hello Mike! 19.10.2013, 14:54, "Michael McCandless" : > On Fri, Oct 18, 2013 at 5:50 PM, Igor Shalyminov > wrote: > >>  But why is it so costly? > > I think because the matching is inherently complex?  But also because > it does high-cost things like allocating new List and Set for every > match

Re: Lucene in-memory index

2013-10-19 Thread Michael McCandless
On Fri, Oct 18, 2013 at 5:50 PM, Igor Shalyminov wrote: > But why is it so costly? I think because the matching is inherently complex? But also because it does high-cost things like allocating new List and Set for every matched doc (e.g. NearSpansOrdered.shrinkToAfterShortestMatch) to hold all p

Re: Lucene in-memory index

2013-10-18 Thread Igor Shalyminov
But why is it so costly? In a regular query we walk postings and match document numbers, in a SpanQuery we match position numbers (or position segments), what's the principal difference? I think it's just that #documents << #positions. For "A,sg" and "A,pl" I use unordered SpanNearQueries with

Re: Lucene in-memory index

2013-10-18 Thread Michael McCandless
On Fri, Oct 18, 2013 at 1:19 PM, Igor Shalyminov wrote: > OK, it turns out that DirectPostingsFormat is really an extreme thing: 8GB of > index couldn't fit into 20+ java heap. > I wonder if there is a postings format that works from disk the standard way > but uses no compression? Yes, it's v

Re: Lucene in-memory index

2013-10-18 Thread Michael McCandless
Unfortunately, SpanNearQuery is a very costly query. What slop are you passing? You might want to check out https://issues.apache.org/jira/browse/LUCENE-5288 ... it adds proximity boosting to queries, but it's still very early in the iterating, and if you need a precise count of only those docume

Re: Lucene in-memory index

2013-10-18 Thread Igor Shalyminov
Hello! OK, it turns out that DirectPostingsFormat is really an extreme thing: 8GB of index couldn't fit into 20+ java heap. I wonder if there is a postings format that works from disk the standard way but uses no compression? -- Best Regards, Igor 18.10.2013, 02:06, "Igor Shalyminov" : > Mik

Re: Lucene in-memory index

2013-10-17 Thread Igor Shalyminov
Mike, For now I'm using just a SpanQuery over a ~600MB index segment single-threadedly (one segment - one thread, the complete setup is 30 segments with the total of 20GB). I'm trying to use Lucene for the morphologically annotated text corpus (namely, Russian National Corpus). The main query

Re: Lucene in-memory index

2013-10-17 Thread Michael McCandless
DirectPostingsFormat holds all postings in RAM, uncompressed, as simple java arrays. But it's quite RAM heavy... The hotspots may also be in the queries you are running ... maybe you can describe more how you're using Lucene? Mike McCandless http://blog.mikemccandless.com On Thu, Oct 17, 2013

Re: Lucene in-memory index

2013-10-17 Thread Igor Shalyminov
Hello! I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. Both work the same for me (the same bad:( ). Thus, I think my problem is not disk access (although I always see getPayload() in the VisualVM top). So, maybe the hard part in the postings traversal is decompression? Are

Re: Lucene in-memory index

2013-10-09 Thread Vitaly Funstein
I don't think you want to load indexes of this size into a RAMDirectory. The reasons have been listed multiple times here... in short, just use MMapDirectory. On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov wrote: > Hello! > > I need to perform an experiment of loading the entire index in RAM an