Re: Getting payload data in search results

2009-09-16 Thread Grant Ingersoll
On Sep 14, 2009, at 2:42 PM, Sherrill, Delsey wrote: I think I have a problem that would benefit from the new term payload feature, but I'm not sure. Every example of payload usage that I can find factors them into the scoring, but doesn't return them with the search results. In my case

RE: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Uwe Schindler
Hi Thomas, I think we found the root of the problem. We opened https://issues.apache.org/jira/browse/LUCENE-1911 . Could you please try the attached patch, if it solves your problems? It has to do with the work of CachingWrapperFilter and QueryWrapperFilter together, which changed in 2.9. -

RE: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Uwe Schindler
I am currently preparing a patch for CachingWrapperFilter that has a boolean ctor parameter (useOpenBitSetCache) and the method getDocIdSet will then do what QueryWrapperFilter did before LUCENE-1427. I would not do this in QueryWrapperFilter like before, because it would slowdown MultiTermQuery i

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Yeah, thanks Uwe - I had read it quickly, but was just rereading and it was sinking in. I hadn't cross correlated the issues yet. Makes perfect sense. Very nice catch. >Maybe we need some change to CachingWrapperFilter that caches the DocIdSets >as before, but optionally would wrap it into an Ope

RE: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Uwe Schindler
See my mail about the CachingWrapperFilter and QueryWrapperFilter, I think it explains this behaviour (and Thomas ran some warming queries before). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Mark Mi

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Nevermind. I see advance wasn't around in 2.4. This is part of the DocIdSetIterator changes. Anyway - either these are just not comparable runs, or there is a major bug (which seems unlikely). Just to keep pointing out the obvious: 2.4 calls doc 195,000 times 2.9 calls docId 1.4 million times T

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Notice that while DisjunctionScorer.advance and DisjuntionScorer.advanceAfterCurrent appear to be called in 2.9, in 2.4, I am only seeing DisjuntionScorer.advanceAfterCurrent called. Can someone explain that? Mark Miller wrote: > Something is very odd about this if they both cover the same search

RE: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Uwe Schindler
I found one thing in your debug output: You are using a lot of CachingWrapperFilters around QueryWrapperFilter. According to http://issues.apache.org/jira/browse/LUCENE-1427, QueryWrapperFilter does not copy the scorer's doc ids into a OpenBitSet, it instead returns the scorer itself as DocIdSet (

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Something is very odd about this if they both cover the same search and the environ for both is identical. Even if one search was done twice, and we divide the numbers for the new api by 2 - its still *very* odd. With 2.4, ScorerDocQueue.topDoc is called half a million times. With 2.9, its called

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Thomas Becker
No it's only a single segment. But two calls. One doing a getHitsCount first and the other doing the actual search. I'll paste both methods below if someone's interested. Will dig into lucene's sources and compare 2.4 search behaviour for my case with 2.9 tomorrow. It was about time to get more in

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Thomas Becker
This might be a change in my getHitCounts method. I will dig into that by tomorrow. I'm really sorry, but I've to leave now. Otherwise I'll have other issues. Uwe Schindler wrote: >> http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png >> >> Have to verify that the last one is not by

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Thomas Becker
So here's a debug message showing the query: 2009-09-16 18:53:59,642 [DEBUG] [http-8440-2] [] [2144122] [] service.impl.LuceneBaseService: items search('viewable:(FINDALL 0 1 2 )', BooleanFilter( +CachingWrapperFilter(QueryWrapperFilter(+issalesallo wed:true)) +CachingWrapperFilter(QueryWrapperFil

Re: What would be the fastest BooleanQuery possible?

2009-09-16 Thread Mark Miller
The Weight is responsible for creating the Scorer for each Query, so I don't think there is much you can do to get around it in 2.3.2. Not easily anyway. The Weight does its work in the constructor because it cannot hold onto the Searcher do to Serialization requirements. -- - Mark http://www.l

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Ah - that explains a bit. Though if you divide by 2, the new one still appears to overcall each method in comparison to 2.4. - Mark Uwe Schindler wrote: >> http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png >> >> Have to verify that the last one is not by accident more than one reque

RE: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Uwe Schindler
> > http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png > > > > Have to verify that the last one is not by accident more than one > request. > > Will > > do the run again and then post the required info. > > The last figure shows, that IndexSearcher.searchWithFilter was called > twice

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Very interesting. Something can't be going right here. You are searching against a single segment, yet, just for example, while before, DisjunctionSumScorer.advanceAfterCurrent was being called 154,000 times, now its being called 1.3 million times. Other scoring methods have similar crazy jumps. G

RE: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Uwe Schindler
> http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png > > Have to verify that the last one is not by accident more than one request. > Will > do the run again and then post the required info. The last figure shows, that IndexSearcher.searchWithFilter was called twice in contrast to th

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
I agree - best to find the cause before making a decision. There are enough smart people in the wings, I can't imagine this should take us that long. We have solved a good chunk of it already, and have only just begun chunk two. -- - Mark http://www.lucidimagination.com Thomas Becker wrote: >

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Yonik Seeley
On Wed, Sep 16, 2009 at 12:33 PM, Uwe Schindler wrote: > How should we proceed? Stop the final artifact build and voting or proceed > with the release of 2.9? We waited so long and for most people it is faster > than slower! I think we know that 2.9 will not be faster for everyone: - Per-segmen

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Thomas Becker
I suggest to find the root cause and then decide about the release. Tomorrow I will spent the whole working day on the issue if no prio1 pops up. Sadly I've to leave early today, since I'm moving to a new flat... :( Uwe Schindler wrote: > How should we proceed? Stop the final artifact build and v

RE: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Uwe Schindler
How should we proceed? Stop the final artifact build and voting or proceed with the release of 2.9? We waited so long and for most people it is faster than slower! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message-

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Thomas Becker
New Profiling sessions with invocation counts. A single lucene search request with huge resultset (169k items). Quite interesting results though and there's definetly something wrong with luc 2.9 and the way I'm using it. But see yourself: http://ankeschwarzer.de/tmp/lucene_24_oldapi_singlereq.pn

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
bq. I'll do some profiling now again and let you know the results. Great - it will be interesting to see the results. My guess, based on the 2.9 new api profiling, is that your queries may not be agreeing with some of the changes somehow. Along with the profiling, can you fill us in on the query t

RE: New "Stream closed" exception with Java 6 - solved

2009-09-16 Thread Chris Bamford
Hoss, It turns out that the cause of the exceptions is in fact adding an item twice - so you were correct right at the start :-) I ran a test where I attempt to insert the same item twice and guess what ... I get a "Stream closed" exception on the 2nd attempt. Understanding this is a great r

Re: Problems with ItemBasedRecommender with Lucene

2009-09-16 Thread Grant Ingersoll
On Sep 16, 2009, at 9:48 AM, Thomas Rewig wrote: Hello, I build a "real time ItemBasedRecommender" based on a users history and a (sparse) item similarity matrix with lucene. Some time ago Ted Dunning recommended me this approach at the mahout mailing list to create a ItemBasedRecommende

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Thomas Becker
Tests run on tmpfs: config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=18920301 answer=850258539, ms=8090, MB/sec=935.4907787391842 config: impl=ChannelFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=18920301 answer=850258903

Re: Run your Lucene Applications on Google AppEngine with GAELucene

2009-09-16 Thread Kerang Lv
Hi Daniel Shane, The GAEFile/GAEIndexInput was a bit like the RAMFile/RAMInputStream, it can hold the file's byte data in memory, so the performance should be acceptable once most file segments were loaded from the google datastore. Wish your test result. - Original Message From: Dan

Re: What would be the fastest BooleanQuery possible?

2009-09-16 Thread Benjamin Pasero
Thanks, I tried this but profiling showed me that I get similar results. Most time is spent in - BooleanQuery.createWeight() - BooleanScorer.next() If I am not interested in scores, do I still need the heavy weight computation? On Wed, Sep 16, 2009 at 4:16 PM, Michael McCandless wrote: > You cou

Re: Run your Lucene Applications on Google AppEngine with GAELucene

2009-09-16 Thread Daniel Shane
I question the performance of such an approach. For lucene to be fast, disk access need to be fast, and the transaction stuff with google is not that good. I'll have to test it out to see, but I anticipate a huge performance hit compared to lucene running with a real HDD access. Daniel Shane

Re: What would be the fastest BooleanQuery possible?

2009-09-16 Thread Michael McCandless
You could get the Scorer and call next() yourself; this would avoid scoring. EG something like this: Weight weight = query.weight(searcher); Scorer scorer = weight.scorer(searcher.getIndexReader()); while(scorer.next()) { final int docID = scorer.doc(); /* do som

Problems with ItemBasedRecommender with Lucene

2009-09-16 Thread Thomas Rewig
Hello, I build a "real time ItemBasedRecommender" based on a users history and a (sparse) item similarity matrix with lucene. Some time ago Ted Dunning recommended me this approach at the mahout mailing list to create a ItemBasedRecommender: "It is actually very easy to do. The output of the

Re: What would be the fastest BooleanQuery possible?

2009-09-16 Thread Benjamin Pasero
Ah wow that sounds great. I am using 2.3.2 though (and have to use it for now). Anything in that version that could speed things up? On Wed, Sep 16, 2009 at 6:48 PM, Mark Miller wrote: > With the new Collector API in Lucene 2.9, you no longer have to compute the > score. > > Now a Collector is pa

Re: What would be the fastest BooleanQuery possible?

2009-09-16 Thread Mark Miller
With the new Collector API in Lucene 2.9, you no longer have to compute the score. Now a Collector is passed a Scorer if they want to use it, but you can just ignore it. -- - Mark http://www.lucidimagination.com Benjamin Pasero wrote: > Hi, > > I am using Lucene not only for smart fulltext s

What would be the fastest BooleanQuery possible?

2009-09-16 Thread Benjamin Pasero
Hi, I am using Lucene not only for smart fulltext searches but also for getting the results for a DB-like query, where I am not tokenizing the terms at all. For this query, I am interested in all results and for that I am using my own HitCollector. Now, while profiling I noticed that quite some t

Re: Finding duplicate records from a result set

2009-09-16 Thread syedfa
Thanks very much Henok for your reply. I would be very much interested in your thesis, and any code that you may provide. Is your thesis published online? Is it in english? Your approach seems very interesting, and I would be very interested in looking at the details. Some ideas I had were us

Re: Finding duplicate records from a result set

2009-09-16 Thread henok sahilu
i have a thesis work which i have done. it was on lega documents. the XML IR systems are very susceptible for producing duplicate or near duplicate contents (not in concept, but in textual content ). here is what i did . i tag each article content in the legal documents, with their status, and th

Finding duplicate records from a result set

2009-09-16 Thread syedfa
Dear Fellow Java/Lucene developers: One annoyance that people have when searching for information online is the occurance of duplicate records (i.e. multiple sites that carry news feeds from the SAME news source like reuters or the associated press, and do not provide any additional pieces of inf