On Sep 14, 2009, at 2:42 PM, Sherrill, Delsey wrote:
I think I have a problem that would benefit from the new term
payload feature, but I'm not sure. Every example of payload usage
that I can find factors them into the scoring, but doesn't return
them with the search results.
In my case
Hi Thomas,
I think we found the root of the problem. We opened
https://issues.apache.org/jira/browse/LUCENE-1911 .
Could you please try the attached patch, if it solves your problems? It has
to do with the work of CachingWrapperFilter and QueryWrapperFilter together,
which changed in 2.9.
-
I am currently preparing a patch for CachingWrapperFilter that has a boolean
ctor parameter (useOpenBitSetCache) and the method getDocIdSet will then do
what QueryWrapperFilter did before LUCENE-1427.
I would not do this in QueryWrapperFilter like before, because it would
slowdown MultiTermQuery i
Yeah, thanks Uwe - I had read it quickly, but was just rereading and it
was sinking in. I hadn't cross correlated the issues yet.
Makes perfect sense. Very nice catch.
>Maybe we need some change to CachingWrapperFilter that caches the DocIdSets
>as before, but optionally would wrap it into an Ope
See my mail about the CachingWrapperFilter and QueryWrapperFilter, I think
it explains this behaviour (and Thomas ran some warming queries before).
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Mark Mi
Nevermind. I see advance wasn't around in 2.4. This is part of the
DocIdSetIterator changes.
Anyway - either these are just not comparable runs, or there is a major
bug (which seems unlikely).
Just to keep pointing out the obvious:
2.4 calls doc 195,000 times
2.9 calls docId 1.4 million times
T
Notice that while DisjunctionScorer.advance and
DisjuntionScorer.advanceAfterCurrent appear to be called
in 2.9, in 2.4, I am only seeing DisjuntionScorer.advanceAfterCurrent
called.
Can someone explain that?
Mark Miller wrote:
> Something is very odd about this if they both cover the same search
I found one thing in your debug output:
You are using a lot of CachingWrapperFilters around QueryWrapperFilter.
According to http://issues.apache.org/jira/browse/LUCENE-1427,
QueryWrapperFilter does not copy the scorer's doc ids into a OpenBitSet, it
instead returns the scorer itself as DocIdSet (
Something is very odd about this if they both cover the same search and
the environ for both is identical. Even if one search was done twice,
and we divide the numbers for the new api by 2 - its still *very* odd.
With 2.4, ScorerDocQueue.topDoc is called half a million times.
With 2.9, its called
No it's only a single segment. But two calls. One doing a getHitsCount first and
the other doing the actual search. I'll paste both methods below if someone's
interested.
Will dig into lucene's sources and compare 2.4 search behaviour for my case with
2.9 tomorrow. It was about time to get more in
This might be a change in my getHitCounts method. I will dig into that by
tomorrow.
I'm really sorry, but I've to leave now. Otherwise I'll have other issues.
Uwe Schindler wrote:
>> http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png
>>
>> Have to verify that the last one is not by
So here's a debug message showing the query:
2009-09-16 18:53:59,642 [DEBUG] [http-8440-2] [] [2144122] []
service.impl.LuceneBaseService: items search('viewable:(FINDALL 0 1 2 )',
BooleanFilter( +CachingWrapperFilter(QueryWrapperFilter(+issalesallo
wed:true)) +CachingWrapperFilter(QueryWrapperFil
The Weight is responsible for creating the Scorer for each Query, so I
don't think there is much you can do to get around
it in 2.3.2. Not easily anyway.
The Weight does its work in the constructor because it cannot hold onto
the Searcher do to Serialization requirements.
--
- Mark
http://www.l
Ah - that explains a bit. Though if you divide by 2, the new one still
appears to overcall each method
in comparison to 2.4.
- Mark
Uwe Schindler wrote:
>> http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png
>>
>> Have to verify that the last one is not by accident more than one reque
> > http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png
> >
> > Have to verify that the last one is not by accident more than one
> request.
> > Will
> > do the run again and then post the required info.
>
> The last figure shows, that IndexSearcher.searchWithFilter was called
> twice
Very interesting. Something can't be going right here. You are searching
against a single segment, yet, just for example,
while before, DisjunctionSumScorer.advanceAfterCurrent was being called
154,000 times, now its being called 1.3 million times.
Other scoring methods have similar crazy jumps.
G
> http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png
>
> Have to verify that the last one is not by accident more than one request.
> Will
> do the run again and then post the required info.
The last figure shows, that IndexSearcher.searchWithFilter was called twice
in contrast to th
I agree - best to find the cause before making a decision. There are
enough smart people in the wings, I can't imagine this should take us
that long. We have solved a good chunk of it already, and have only just
begun chunk two.
--
- Mark
http://www.lucidimagination.com
Thomas Becker wrote:
>
On Wed, Sep 16, 2009 at 12:33 PM, Uwe Schindler wrote:
> How should we proceed? Stop the final artifact build and voting or proceed
> with the release of 2.9? We waited so long and for most people it is faster
> than slower!
I think we know that 2.9 will not be faster for everyone:
- Per-segmen
I suggest to find the root cause and then decide about the release. Tomorrow I
will spent the whole working day on the issue if no prio1 pops up.
Sadly I've to leave early today, since I'm moving to a new flat... :(
Uwe Schindler wrote:
> How should we proceed? Stop the final artifact build and v
How should we proceed? Stop the final artifact build and voting or proceed
with the release of 2.9? We waited so long and for most people it is faster
than slower!
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
New Profiling sessions with invocation counts. A single lucene search request
with huge resultset (169k items).
Quite interesting results though and there's definetly something wrong with luc
2.9 and the way I'm using it. But see yourself:
http://ankeschwarzer.de/tmp/lucene_24_oldapi_singlereq.pn
bq. I'll do some profiling now again and let you know the results.
Great - it will be interesting to see the results. My guess, based on
the 2.9 new api profiling, is that your queries may not be agreeing with
some of the changes somehow. Along with the profiling, can you fill us
in on the query t
Hoss,
It turns out that the cause of the exceptions is in fact adding an item twice -
so you were correct right at the start :-) I ran a test where I attempt to
insert the same item twice and guess what ... I get a "Stream closed"
exception on the 2nd attempt.
Understanding this is a great r
On Sep 16, 2009, at 9:48 AM, Thomas Rewig wrote:
Hello,
I build a "real time ItemBasedRecommender" based on a users history
and a (sparse) item similarity matrix with lucene. Some time ago Ted
Dunning recommended me this approach at the mahout mailing list to
create a ItemBasedRecommende
Tests run on tmpfs:
config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024
poolsize=2 filelen=18920301
answer=850258539, ms=8090, MB/sec=935.4907787391842
config: impl=ChannelFile serial=false nThreads=4 iterations=100 bufsize=1024
poolsize=2 filelen=18920301
answer=850258903
Hi Daniel Shane,
The GAEFile/GAEIndexInput was a bit like the RAMFile/RAMInputStream, it can
hold the file's byte data in memory, so the performance should be acceptable
once most file segments were loaded from the google datastore.
Wish your test result.
- Original Message
From: Dan
Thanks, I tried this but profiling showed me that I get similar
results. Most time is spent in
- BooleanQuery.createWeight()
- BooleanScorer.next()
If I am not interested in scores, do I still need the heavy weight computation?
On Wed, Sep 16, 2009 at 4:16 PM, Michael McCandless
wrote:
> You cou
I question the performance of such an approach. For lucene to be fast,
disk access need to be fast, and the transaction stuff with google is
not that good.
I'll have to test it out to see, but I anticipate a huge performance hit
compared to lucene running with a real HDD access.
Daniel Shane
You could get the Scorer and call next() yourself; this would avoid
scoring. EG something like this:
Weight weight = query.weight(searcher);
Scorer scorer = weight.scorer(searcher.getIndexReader());
while(scorer.next()) {
final int docID = scorer.doc();
/* do som
Hello,
I build a "real time ItemBasedRecommender" based on a users history and
a (sparse) item similarity matrix with lucene. Some time ago Ted Dunning
recommended me this approach at the mahout mailing list to create a
ItemBasedRecommender:
"It is actually very easy to do. The output of the
Ah wow that sounds great. I am using 2.3.2 though (and have to use it
for now). Anything
in that version that could speed things up?
On Wed, Sep 16, 2009 at 6:48 PM, Mark Miller wrote:
> With the new Collector API in Lucene 2.9, you no longer have to compute the
> score.
>
> Now a Collector is pa
With the new Collector API in Lucene 2.9, you no longer have to compute the
score.
Now a Collector is passed a Scorer if they want to use it, but you can
just ignore it.
--
- Mark
http://www.lucidimagination.com
Benjamin Pasero wrote:
> Hi,
>
> I am using Lucene not only for smart fulltext s
Hi,
I am using Lucene not only for smart fulltext searches but also for
getting the results for a DB-like query, where I am not tokenizing the
terms at all. For this query, I am interested in all results and for
that
I am using my own HitCollector.
Now, while profiling I noticed that quite some t
Thanks very much Henok for your reply. I would be very much interested in
your thesis, and any code that you may provide. Is your thesis published
online? Is it in english? Your approach seems very interesting, and I
would be very interested in looking at the details. Some ideas I had were
us
i have a thesis work which i have done. it was on lega documents. the XML IR
systems are very susceptible for producing duplicate or near duplicate contents
(not in concept, but in textual content ).
here is what i did .
i tag each article content in the legal documents, with their status, and th
Dear Fellow Java/Lucene developers:
One annoyance that people have when searching for information online is the
occurance of duplicate records (i.e. multiple sites that carry news feeds
from the SAME news source like reuters or the associated press, and do not
provide any additional pieces of inf
37 matches
Mail list logo