Getting Payload data from BooleanQuery results
Hello, I have indexed documents with two fields, "ARTICLE" for an article of text and "PUB_DATE" for the article's publication date. Given a specific single word, I want to search my index for all documents that contain this word within the last two weeks, and have them sorted by date: TermQuery tq = new TermQuery(new Term("ARTICLE",mySearchWord)); Calendar cal = Calendar.getInstance(); // Date of last two weeks cal.add(Calendar.DATE, -14); ConstantScoreRangeQuery csrq = new ConstantScoreRangeQuery("PUB_DATE",DateTools.dateToString(cal.getTime(),DateTools.Resolution.HOUR),null,true,true); BooleanQuery bq = new BooleanQuery(); bq.add(tq, BooleanClause.Occur.MUST); bq.add(csrq, BooleanClause.Occur.MUST); TopFieldDocs docs = searcher.search(bq, null, 10, new Sort("PUB_DATE")); My goal now is to search through the recovered documents an obtain the Term instances (each term position) within each document and retrieve the payload data associated with each Term instance. The trouble I am having is in getting access to the TermPositions following such a query. If I only needed to query on a single term (without my date restriction), I could easily do (and have done) this: SpanTermQuery query = new SpanTermQuery(new Term("ARTICLE",mySearchWord)); TermSpans spans = (TermSpans) query.getSpans(indexReader); tp = spans.getPositions(); and then iterate over each position calling tp.getPayload(dataBuffer,0); for example. But alas, I cannot seem to get access to any TermPositions from my above BooleanQuery. I have looked into the contributed SpanExtractorClass but ConstantScoreRangeQuery seems unsupported and I am at a loos as to how to best use Spans here. Any help appreciated, C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: Getting Payload data from BooleanQuery results
thanks for the tip. I don't see a way to integrate the QueryWrapperFilter (or any Filter) into SpanTermQuery.getSpans(indexReader) however. I can use a SpanQuery with an IndexSearcher as per susual but that leaves me back where I started. Any thoughts? Also, I will need to sort these results by date so that the most recent, say 5 are returned... thanks again, C>T> On Thu, Sep 24, 2009 at 3:22 PM, Chris Hostetter wrote: > > : But alas, I cannot seem to get access to any TermPositions from my above > : BooleanQuery. > > I would suggest refactoring your "date" restriction into a Filter (there's > fairly easy to use Filter that wraps a Query) and then execute a > SPanTermQuery just as you describe. > > > -Hoss > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
TermPositions with custom Tokenizer
Hello, I have created a custom Tokenizer and am trying to set and extract my own positions for each Token using: reusableToken.reinit(word.getWord(),tokenStart,tokenEnd); later when querying my index using a SpanTermQuery the start() and end() tags don't correspond to these values but seem to correspond to the order the token was tokenized during the indexing process, e.g. start: 5 end: 6 for a given token. I realize that the these values come from TermPositions but how can I effectively get my custom toke nstart and end offsets into TermPositions for recovery? thanks - C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
IndexWriter optimize() deadlock
Hello, I am trying to track down the cause of my code hanging on calling IndexWriter.optimize() at its doWait() method. It appears, thus that it is watiing on other merges to happen which is a bit confusing to me: My application is a simple producer consumer model where documents are added to a queue by producers and then one consumer with one indexwriter (the only in the application) periodically calls addDocument() on a batch of these jobs and then calls optimize(), commit(). and then close(). There is only one thread running the consumer so I am confused as to how the indexwriter might be deadlocking itself. Indeed this is the only thread active when the deadlock occurs so it seems to be a problem of reentry. Importantly, the deadlocking occurs only when the thread is trying to shutdown - that is the Thread running this lucene consumer has a Future that has had its cancel(true) interrupting method called. Is it possible that an internal Lucene lock is obtained during addDocument() and on interruption is never released so the subsequent optimize() call hangs? This doesn't appear to be happening... Any help appreciated. thanks, C>T> what might I be missing here? -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: IndexWriter optimize() deadlock
thanks for getting back. I do not lock on the IndexWriter object itself but all methods in my consumer class that use IndexWriter are synchronized (locking my singleton consumer object itself). The thread is waiting at IndexWriter.doWait(). What might cuase this? thanks - C>T> On Fri, Oct 16, 2009 at 12:58 PM, Uwe Schindler wrote: > Do you use the IndexWriter as mutex in a synchronized() block? This is not > supported and may hang. Never lock on IndexWriter instances. IndexWriter > itself is thread safe. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Christopher Tignor [mailto:ctig...@thinkmap.com] > > Sent: Friday, October 16, 2009 6:50 PM > > To: java-user > > Subject: IndexWriter optimize() deadlock > > > > Hello, > > > > I am trying to track down the cause of my code hanging on calling > > IndexWriter.optimize() at its doWait() method. > > It appears, thus that it is watiing on other merges to happen which is a > > bit > > confusing to me: > > > > My application is a simple producer consumer model where documents are > > added > > to a queue by producers and then one consumer with one indexwriter (the > > only > > in the application) periodically calls addDocument() on a batch of these > > jobs and then calls optimize(), commit(). and then close(). There is > only > > one thread running the consumer so I am confused as to how the > indexwriter > > might be deadlocking itself. Indeed this is the only thread active when > > the > > deadlock occurs so it seems to be a problem of reentry. > > > > Importantly, the deadlocking occurs only when the thread is trying to > > shutdown - that is the Thread running this lucene consumer has a Future > > that > > has had its cancel(true) interrupting method called. Is it possible that > > an > > internal Lucene lock is obtained during addDocument() and on interruption > > is > > never released so the subsequent optimize() call hangs? This doesn't > > appear > > to be happening... > > > > Any help appreciated. > > > > thanks, > > > > C>T> > > > > what might I be missing here? > > > > -- > > TH!NKMAP > > > > Christopher Tignor | Senior Software Architect > > 155 Spring Street NY, NY 10012 > > p.212-285-8600 x385 f.212-285-8999 > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: IndexWriter optimize() deadlock
It doesn't look like my Future.cancel(true) is actually interrupting the thread. It only does so "if necessary" and in this case seems to be letting the Thread finish gracefully without need for interruption. The stack trace leading up to the hanging IndexWriter.optimize() method is below, though not terribly useful I imagine. Here only 5 documents are trying to be added to an index taking up only 39.2 MB on disk. Again this deadlocking only happens after I use the Future for this task to cancel the task... Thread [pool-3-thread-5] (Suspended) IndexWriter.doWait() line: 4494 IndexWriter.optimize(int, boolean) line: 2283 IndexWriter.optimize(boolean) line: 2218 IndexWriter.optimize() line: 2198 LuceneResultsPersister.commit() line: 97 PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer) line: 105 PersistenceJobConsumer.consume() line: 46 PersistenceJobConsumer.run() line: 67 Executors$RunnableAdapter.call() line: 441 FutureTask$Sync.innerRun() line: 303 FutureTask.run() line: 138 ThreadPoolExecutor$Worker.runTask(Runnable) line: 886 ThreadPoolExecutor$Worker.run() line: 908 Thread.run() line: 619 thanks, C>T> On Fri, Oct 16, 2009 at 1:58 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > My guess is it's the invocation of Thread.interrupt (which > Future.cancel(true) calls if the task is running) that lead to the > deadlock. > > Is it possible to get the stack trace of the thrown exception when the > thread was interrupted? Maybe indeed something in IW isn't cleaning > up its state on being interrupted. > > Mike > > On Fri, Oct 16, 2009 at 1:43 PM, Christopher Tignor > wrote: > > thanks for getting back. > > > > I do not lock on the IndexWriter object itself but all methods in my > > consumer class that use IndexWriter are synchronized (locking my > singleton > > consumer object itself). > > The thread is waiting at IndexWriter.doWait(). What might cuase this? > > > > thanks - > > > > C>T> > > > > On Fri, Oct 16, 2009 at 12:58 PM, Uwe Schindler wrote: > > > >> Do you use the IndexWriter as mutex in a synchronized() block? This is > not > >> supported and may hang. Never lock on IndexWriter instances. IndexWriter > >> itself is thread safe. > >> > >> - > >> Uwe Schindler > >> H.-H.-Meier-Allee 63, D-28213 Bremen > >> http://www.thetaphi.de > >> eMail: u...@thetaphi.de > >> > >> > -Original Message- > >> > From: Christopher Tignor [mailto:ctig...@thinkmap.com] > >> > Sent: Friday, October 16, 2009 6:50 PM > >> > To: java-user > >> > Subject: IndexWriter optimize() deadlock > >> > > >> > Hello, > >> > > >> > I am trying to track down the cause of my code hanging on calling > >> > IndexWriter.optimize() at its doWait() method. > >> > It appears, thus that it is watiing on other merges to happen which is > a > >> > bit > >> > confusing to me: > >> > > >> > My application is a simple producer consumer model where documents are > >> > added > >> > to a queue by producers and then one consumer with one indexwriter > (the > >> > only > >> > in the application) periodically calls addDocument() on a batch of > these > >> > jobs and then calls optimize(), commit(). and then close(). There is > >> only > >> > one thread running the consumer so I am confused as to how the > >> indexwriter > >> > might be deadlocking itself. Indeed this is the only thread active > when > >> > the > >> > deadlock occurs so it seems to be a problem of reentry. > >> > > >> > Importantly, the deadlocking occurs only when the thread is trying to > >> > shutdown - that is the Thread running this lucene consumer has a > Future > >> > that > >> > has had its cancel(true) interrupting method called. Is it possible > that > >> > an > >> > internal Lucene lock is obtained during addDocument() and on > interruption > >> > is > >> > never released so the subsequent optimize() call hangs? This doesn't > >> > appear > >> > to be happening... > >> > > >> > Any help appreciated. > >> > > >> > thanks, > >> > > >> > C>T> > >> > > >> > what might I be missing here? > >> > > >> > -- > >> > TH!NKMAP > >>
Re: IndexWriter optimize() deadlock
After tracing through the lucene source more it seems that what is happening is after I call Future.cancel(true) on my parent thread, optimize() is called and this method launches it's own thread using a ConcurrentMergeScheduler$MergeThread to do the actual merging. When this Thread comes around to calling mergeInit() on my index writer - a synchronized method - it hangs. For some reason it seems to no long have the mutex perhaps? Trace of this thread's stall below... Daemon Thread [Lucene Merge Thread #0] (Suspended) IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971 IndexWriter.merge(MergePolicy$OneMerge) line: 3879 ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205 ConcurrentMergeScheduler$MergeThread.run() line: 260 thanks again, C>T> On Fri, Oct 16, 2009 at 3:53 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > But if the Future.cancel call turns out to be a no-op (simply waits > instead of interrupting the thread), how could it be that the deadlock > only happens when you call it? Weird. Are you really sure it's not > actually calling Thread.interrupt? > > That stack trace looks like a normal "optimize is waiting for the > background merges to complete". Is it possible your background merges > are hitting exceptions? You should see them on your error console if > so... > > Mike > > On Fri, Oct 16, 2009 at 3:17 PM, Christopher Tignor > wrote: > > It doesn't look like my Future.cancel(true) is actually interrupting the > > thread. It only does so "if necessary" and in this case seems to be > letting > > the Thread finish gracefully without need for interruption. > > > > The stack trace leading up to the hanging IndexWriter.optimize() method > is > > below, though not terribly useful I imagine. Here only 5 documents are > > trying to be added to an index taking up only 39.2 MB on disk. Again > this > > deadlocking only happens after I use the Future for this task to cancel > the > > task... > > > > Thread [pool-3-thread-5] (Suspended) > >IndexWriter.doWait() line: 4494 > >IndexWriter.optimize(int, boolean) line: 2283 > >IndexWriter.optimize(boolean) line: 2218 > >IndexWriter.optimize() line: 2198 > >LuceneResultsPersister.commit() line: 97 > >PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer) > line: > > 105 > >PersistenceJobConsumer.consume() line: 46 > >PersistenceJobConsumer.run() line: 67 > >Executors$RunnableAdapter.call() line: 441 > >FutureTask$Sync.innerRun() line: 303 > >FutureTask.run() line: 138 > >ThreadPoolExecutor$Worker.runTask(Runnable) line: 886 > >ThreadPoolExecutor$Worker.run() line: 908 > >Thread.run() line: 619 > > > > thanks, > > > > C>T> > > > > On Fri, Oct 16, 2009 at 1:58 PM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >> My guess is it's the invocation of Thread.interrupt (which > >> Future.cancel(true) calls if the task is running) that lead to the > >> deadlock. > >> > >> Is it possible to get the stack trace of the thrown exception when the > >> thread was interrupted? Maybe indeed something in IW isn't cleaning > >> up its state on being interrupted. > >> > >> Mike > >> > >> On Fri, Oct 16, 2009 at 1:43 PM, Christopher Tignor > >> wrote: > >> > thanks for getting back. > >> > > >> > I do not lock on the IndexWriter object itself but all methods in my > >> > consumer class that use IndexWriter are synchronized (locking my > >> singleton > >> > consumer object itself). > >> > The thread is waiting at IndexWriter.doWait(). What might cuase this? > >> > > >> > thanks - > >> > > >> > C>T> > >> > > >> > On Fri, Oct 16, 2009 at 12:58 PM, Uwe Schindler > wrote: > >> > > >> >> Do you use the IndexWriter as mutex in a synchronized() block? This > is > >> not > >> >> supported and may hang. Never lock on IndexWriter instances. > IndexWriter > >> >> itself is thread safe. > >> >> > >> >> - > >> >> Uwe Schindler > >> >> H.-H.-Meier-Allee 63, D-28213 Bremen > >> >> http://www.thetaphi.de > >> >> eMail: u...@thetaphi.de > >> >> > >> >> > -Original Message- > >> >> > From: Christopher Tignor [mailto:ctig...@t
Re: IndexWriter optimize() deadlock
Indeed it looks like the thread running MergerThread started (After passing off to ConcurentMergeScheduler) by the thread calling IndexWriter.optimize() is indeed waiting on the mutex for the IndexWriter to be free so it can use the object to call mergeInit(). The IndexWriter however has entered a synchronized() waiting loop, waking up every second ( in doWait() ) and checking if there are running merges left - which of course there are as the thread responsible for doing the merging can't get in. Deadlocked stack traces are below: Thread [pool-2-thread-5] (Suspended) owns: IndexWriter (id=71) owns: LuceneResultsPersister (id=85) IndexWriter.optimize(int, boolean) line: 2283 IndexWriter.optimize(boolean) line: 2218 IndexWriter.optimize() line: 2198 LuceneResultsPersister.commit() line: 97 PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer) line: 105 PersistenceJobConsumer.consume() line: 46 PersistenceJobConsumer.run() line: 67 Executors$RunnableAdapter.call() line: 441 FutureTask$Sync.innerRun() line: 303 FutureTask.run() line: 138 ThreadPoolExecutor$Worker.runTask(Runnable) line: 886 ThreadPoolExecutor$Worker.run() line: 908 Thread.run() line: 619 Daemon Thread [Lucene Merge Thread #0] (Suspended) waiting for: IndexWriter (id=71) IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971 IndexWriter.merge(MergePolicy$OneMerge) line: 3879 ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205 ConcurrentMergeScheduler$MergeThread.run() line: 260 I don't really understand how this code is supposed to (and has) worked before this and what the problem thus might be here: In IndexWriter.optimize() at line 2263 we have the synchronized block where doWait becomes true using the parameter-less call to optimize(): if (doWait) { synchronized(this) { while(true) { if (mergeExceptions.size() > 0) { // Forward any exceptions in background merge // threads to the current thread: final int size = mergeExceptions.size(); for(int i=0;iT> On Fri, Oct 16, 2009 at 4:11 PM, Christopher Tignor wrote: > After tracing through the lucene source more it seems that what is > happening is after I call Future.cancel(true) on my parent thread, > optimize() is called and this method launches it's own thread using a > ConcurrentMergeScheduler$MergeThread to do the actual merging. > > When this Thread comes around to calling mergeInit() on my index writer - a > synchronized method - it hangs. For some reason it seems to no long have > the mutex perhaps? Trace of this thread's stall below... > > Daemon Thread [Lucene Merge Thread #0] (Suspended) > IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971 > IndexWriter.merge(MergePolicy$OneMerge) line: 3879 > ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205 > ConcurrentMergeScheduler$MergeThread.run() line: 260 > > thanks again, > > C>T> > > > > On Fri, Oct 16, 2009 at 3:53 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> But if the Future.cancel call turns out to be a no-op (simply waits >> instead of interrupting the thread), how could it be that the deadlock >> only happens when you call it? Weird. Are you really sure it's not >> actually calling Thread.interrupt? >> >> That stack trace looks like a normal "optimize is waiting for the >> background merges to complete". Is it possible your background merges >> are hitting exceptions? You should see them on your error console if >> so... >> >> Mike >> >> On Fri, Oct 16, 2009 at 3:17 PM, Christopher Tignor >> wrote: >> > It doesn't look like my Future.cancel(true) is actually interrupting the >> > thread. It only does so "if necessary" and in this case seems to be >> letting >> > the Thread finish gracefully without need for interruption. >> > >> > The stack trace leading up to the hanging IndexWriter.optimize() method >> is >> > below, though not terribly useful I imagine. Here only 5 documents are >> > trying to be added to an index taking up only 39.2 MB on disk. Again >> this >> > deadlocking only happens after I use the Future for this task to cancel >> the >> > task... >> > >> > Thread [pool-3-thread-5] (Suspended) >> >IndexWriter.doWait() line: 4494 >> >IndexWriter.optimize(int, boolean) line: 2283 >> >IndexWriter.optimize(boolean) line: 2218 >> >IndexWriter.optimize() line: 2198 >> >LuceneResultsPersister.commit() line: 97 >> >PersistenceJobQueue.persistAndCommitBatch(Persistenc
Re: IndexWriter optimize() deadlock
I discovered the problem and fixed it's effect on my code: Using the source for Lucene version 2.4.1, in IndexWriter.optimize() there is a call to doWait() on line 2283 this method attempts to wait for a second in order to give other threads it has spawned a chance to acquire it's mutex and complete index merges. It checks if there are any merges left after it wakes up. if there aren't int proceeds. However, using using a Future object associated with this thread and calling cancel(true) doesn't allow the Thread to enter into Object.wait() (or the executorservice observing this Future immediately wakes it up or something) so it returns immediately and the thread holding the lock on the IndexWriter never sceeds it's lock to threads running a merge. I re-wrote my code such that it didn't need to call the interruptible version of Future.cancel() to solve it, i.e. Future.cancel(false) thanks, C>T> On Fri, Oct 16, 2009 at 4:44 PM, Christopher Tignor wrote: > Indeed it looks like the thread running MergerThread started (After passing > off to ConcurentMergeScheduler) by the thread calling IndexWriter.optimize() > is indeed waiting on the mutex for the IndexWriter to be free so it can use > the object to call mergeInit(). > > The IndexWriter however has entered a synchronized() waiting loop, waking > up every second ( in doWait() ) and checking if there are running merges > left - which of course there are as the thread responsible for doing the > merging can't get in. Deadlocked stack traces are below: > > Thread [pool-2-thread-5] (Suspended) > owns: IndexWriter (id=71) > owns: LuceneResultsPersister (id=85) > IndexWriter.optimize(int, boolean) line: 2283 > IndexWriter.optimize(boolean) line: 2218 > IndexWriter.optimize() line: 2198 > LuceneResultsPersister.commit() line: 97 > PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer) line: > 105 > PersistenceJobConsumer.consume() line: 46 > PersistenceJobConsumer.run() line: 67 > Executors$RunnableAdapter.call() line: 441 > FutureTask$Sync.innerRun() line: 303 > FutureTask.run() line: 138 > ThreadPoolExecutor$Worker.runTask(Runnable) line: 886 > ThreadPoolExecutor$Worker.run() line: 908 > Thread.run() line: 619 > > Daemon Thread [Lucene Merge Thread #0] (Suspended) > waiting for: IndexWriter (id=71) > IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971 > IndexWriter.merge(MergePolicy$OneMerge) line: 3879 > ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205 > ConcurrentMergeScheduler$MergeThread.run() line: 260 > > I don't really understand how this code is supposed to (and has) worked > before this and what the problem thus might be here: > > In IndexWriter.optimize() at line 2263 we have the synchronized block where > doWait becomes true using the parameter-less call to optimize(): > > if (doWait) { > synchronized(this) { > while(true) { > if (mergeExceptions.size() > 0) { > // Forward any exceptions in background merge > // threads to the current thread: > final int size = mergeExceptions.size(); > for(int i=0;i final MergePolicy.OneMerge merge = (MergePolicy.OneMerge) > mergeExceptions.get(0); > if (merge.optimize) { > IOException err = new IOException("background merge hit > exception: " + merge.segString(directory)); > final Throwable t = merge.getException(); > if (t != null) > err.initCause(t); > throw err; > } > } > } > > if (optimizeMergesPending()) > doWait(); > else > break; > } > } > > It is holding this lock while the thread it started to do the mergin is > trying to call it's mergeInit() method. > > Any thoughts? > > thanks, > > C>T> > > > On Fri, Oct 16, 2009 at 4:11 PM, Christopher Tignor > wrote: > >> After tracing through the lucene source more it seems that what is >> happening is after I call Future.cancel(true) on my parent thread, >> optimize() is called and this method launches it's own thread using a >> ConcurrentMergeScheduler$MergeThread to do the actual merging. >> >> When this Thread comes around to calling mergeInit() on my index writer - >> a synchronized method - it hangs. For some reason it seems to no long have >> the mutex perhaps? Trace of this thread's stall below... >> >> Daemon Thread [Lucene Merge Thread #0] (Suspended) >> IndexWriter.mergeInit(Merg
Re: IndexWriter optimize() deadlock
the doWait() call is synchronized on IndexWriter but it is also, as you suggest in a loop in a block synchronized on IndexWriter. The doWait() call returns immediately, still holding the IndexWriter lock from the loop in the synchornized block as my stack trace shows, without blocking and giving the merger thread a chance to merge. It keeps repeating this doWait procedure which unfortunately never actually waits and the MergerThread is starved C>T> On Fri, Oct 16, 2009 at 6:53 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > I'm glad you worked around it! But I don't fully understand the > issue. That doWait is inside a sync(writer) block... if the Future > manages to interrupt it, then that thread will release the lock when > it exits that sync block. > > Actually, if the thread was indeed interrupted, you may be hitting this: > >http://issues.apache.org/jira/browse/LUCENE-1573 > > (which is fixed in 2.9). > > Mike > > On Fri, Oct 16, 2009 at 6:01 PM, Christopher Tignor > wrote: > > I discovered the problem and fixed it's effect on my code: > > > > Using the source for Lucene version 2.4.1, > > in IndexWriter.optimize() there is a call to doWait() on line 2283 > > this method attempts to wait for a second in order to give other threads > it > > has spawned a chance to acquire it's mutex and complete index merges. > > It checks if there are any merges left after it wakes up. if there > aren't > > int proceeds. > > > > However, using using a Future object associated with this thread and > calling > > cancel(true) doesn't allow the Thread to enter into Object.wait() (or the > > executorservice observing this Future immediately wakes it up or > something) > > so it returns immediately and the thread holding the lock on the > IndexWriter > > never sceeds it's lock to threads running a merge. > > > > I re-wrote my code such that it didn't need to call the interruptible > > version of Future.cancel() to solve it, i.e. Future.cancel(false) > > > > thanks, > > > > C>T> > > > > On Fri, Oct 16, 2009 at 4:44 PM, Christopher Tignor < > ctig...@thinkmap.com>wrote: > > > >> Indeed it looks like the thread running MergerThread started (After > passing > >> off to ConcurentMergeScheduler) by the thread calling > IndexWriter.optimize() > >> is indeed waiting on the mutex for the IndexWriter to be free so it can > use > >> the object to call mergeInit(). > >> > >> The IndexWriter however has entered a synchronized() waiting loop, > waking > >> up every second ( in doWait() ) and checking if there are running merges > >> left - which of course there are as the thread responsible for doing the > >> merging can't get in. Deadlocked stack traces are below: > >> > >> Thread [pool-2-thread-5] (Suspended) > >> owns: IndexWriter (id=71) > >> owns: LuceneResultsPersister (id=85) > >> IndexWriter.optimize(int, boolean) line: 2283 > >> IndexWriter.optimize(boolean) line: 2218 > >> IndexWriter.optimize() line: 2198 > >> LuceneResultsPersister.commit() line: 97 > >> PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer) > line: > >> 105 > >> PersistenceJobConsumer.consume() line: 46 > >> PersistenceJobConsumer.run() line: 67 > >> Executors$RunnableAdapter.call() line: 441 > >> FutureTask$Sync.innerRun() line: 303 > >> FutureTask.run() line: 138 > >> ThreadPoolExecutor$Worker.runTask(Runnable) line: 886 > >> ThreadPoolExecutor$Worker.run() line: 908 > >> Thread.run() line: 619 > >> > >> Daemon Thread [Lucene Merge Thread #0] (Suspended) > >> waiting for: IndexWriter (id=71) > >> IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971 > >> IndexWriter.merge(MergePolicy$OneMerge) line: 3879 > >> ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205 > >> ConcurrentMergeScheduler$MergeThread.run() line: 260 > >> > >> I don't really understand how this code is supposed to (and has) worked > >> before this and what the problem thus might be here: > >> > >> In IndexWriter.optimize() at line 2263 we have the synchronized block > where > >> doWait becomes true using the parameter-less call to optimize(): > >> > >> if (doWait) { > >> synchronized(this) { > >> while(true)
Token character positions
Hello, Hoping someone might clear up a question for me: When Tokenizing we provide the start and end character offsets for each token locating it within the source text. If I tokenize the text "word" and then serach for the term "word" in the same field, how can I recover this character offset information in the matching documents to precisely locate the word? I have been storing this character info myself using payload data but if lucene stores it, then I am doing so needlessly. If recovering this character offset info isn't possible, what is this charcter offset info used for? thanks so much, C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
token positions
Hello, Hoping someone might clear up a question for me: When Tokenizing we provide the start and end character offsets for each token locating it within the source text. If I tokenize the text "word" and then search for the term "word" in the same field, how can I recover this character offset information in the matching documents to precisely locate the word? I have been storing this character info myself using payload data but if lucene stores it, then I am doing so needlessly. If recovering this character offset info isn't possible, what is this character offset info used for? thanks so much, C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
recovering terms hit from wildcard queries
Hello, Firstly, thanks for all the good answers and support form this mailing list. Would it be possible and if so, what would be the best way to recover the terms filled in for a wildcard query following a successful search? For example: If I parse and execute a query using the string "my*" and get a collection of document ids that match this search, is there a good way to determine whether this query found "myopic", "mylar" or some other term without loading/searching the returned documents? thanks! C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Phrase query with terms at same location
Hello, I have indexed words in my documents with part of speech tags at the same location as these words using a custom Tokenizer as described, very helpfully, here: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3c20060712115026.38897.qm...@web26002.mail.ukl.yahoo.com%3e I would like to do a search that retrieves documents when a given word is used with a specific part of speech, e.g. all docs where "report" is used as a noun. I was hoping I could use something like a PhraseQuery with "report _n" (_n is my noun part of speech tag) with some sort of identifier that describes the words as having to be at the same location - like a null slop or something. Any thoughts on how to do this? thanks so much, C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: recovering terms hit from wildcard queries
Thanks - that might work though I believe would produce many queries instead of just one to maintain the specific Term used to match a given hit document. I presume then I would get all the actual terms from the WildcardTermEnum that my wildcard containing string refers to and then use them each in separate query so I could know precisely which Term is associated with a given document. thanks, C>T> On Wed, Nov 18, 2009 at 5:16 PM, Simon Willnauer < simon.willna...@googlemail.com> wrote: > You could use WildcardTermEnum directly and pass your term and the > reader to it. This will allow you to enumerate all terms that match > your wildcard term. > Is that what are you asking for? > > simon > > On Wed, Nov 18, 2009 at 10:39 PM, Christopher Tignor > wrote: > > Hello, > > > > Firstly, thanks for all the good answers and support form this mailing > list. > > > > Would it be possible and if so, what would be the best way to recover the > > terms filled in for a wildcard query following a successful search? > > > > For example: > > If I parse and execute a query using the string "my*" and get a > collection > > of document ids that match this search, > > is there a good way to determine whether this query found "myopic", > "mylar" > > or some other term without loading/searching the returned documents? > > > > thanks! > > > > C>T> > > > > -- > > TH!NKMAP > > > > Christopher Tignor | Senior Software Architect > > 155 Spring Street NY, NY 10012 > > p.212-285-8600 x385 f.212-285-8999 > > > > ----- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: Phrase query with terms at same location
Thanks, Erick - Indeed every word will have a part of speech token but Is this how the slop actually works? My understanding was that if I have two tokens in the same location then each will not effect searches involving other in terms of the slop as slop indicates the number of words *between* search terms in a phrase. Are tokens at the same location actually adjacent in their ordinal values, thus affecting the slop as you describe? If so, Is there a predictable way to determine which comes before the other - perhaps the order they are inserted when being tokenized? thanks, C>T> On Thu, Nov 19, 2009 at 8:35 AM, Erick Erickson wrote: > If I'm reading this right, your tokenizer creates two tokens. One > "report" and one "_n"... I suspect if so that this will create some > "interesting" > behaviors. For instance, if you put two tokens in place, are you going > to double the slop when you don't care about part of speech? Is every > word going to get a marker? etc. > > I'm not sure payloads would be useful here, but you might check it out... > > What I'd think about, though, is a variant of synonyms. That is, index > report and report_n (note no space) at the same location. Then, when > you wanted to create a part-of-speech-aware query, you'd attach the > various markers to your terms (_n, _v, _adj, _adv etc.) and not have to > worry about unexpected side-effects. > > HTH > Erick > > On Wed, Nov 18, 2009 at 5:20 PM, Christopher Tignor >wrote: > > > Hello, > > > > I have indexed words in my documents with part of speech tags at the same > > location as these words using a custom Tokenizer as described, very > > helpfully, here: > > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3c20060712115026.38897.qm...@web26002.mail.ukl.yahoo.com%3e > > > > I would like to do a search that retrieves documents when a given word is > > used with a specific part of speech, e.g. all docs where "report" is used > > as > > a noun. > > > > I was hoping I could use something like a PhraseQuery with "report _n" > (_n > > is my noun part of speech tag) with some sort of identifier that > describes > > the words as having to be at the same location - like a null slop or > > something. > > > > Any thoughts on how to do this? > > > > thanks so much, > > > > C>T> > > > > -- > > TH!NKMAP > > > > Christopher Tignor | Senior Software Architect > > 155 Spring Street NY, NY 10012 > > p.212-285-8600 x385 f.212-285-8999 > > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: Phrase query with terms at same location
Thanks again for this. I would like to able to do several things with this data if possible. As per Mark's post, I'd like to be able to query for phrases like "He _v"~1 (where _v is my verb part of speech token) to recover string like: "He later apologized". This already in fact seems to be working. But I'd also like to be able to say give me all the times "report" is used as a noun, i.e. when "report" and "_n" occur at the same location. But isn't the slop for PhraseQueries the "edit distance"<http://content18.wuala.com/contents/cborealis/Docs/lucene/api/org/apache/lucene/search/PhraseQuery.html#setSlop%28int%29>and shouldn't "report _n"~1 achieve my above goal, moving "_n" onto the location of "report" in one edit step? If so, it seems I would need to be able to also specify that the query is restricted from also interpreting the slop the other way, i.e. also recovering "report to him", allowing one term between report and him. Perhaps PhraseQuery can't do this? It seems like your suggestion of creating part-of-speech tag prefixed tokens might be the only way to accommodate both, e.g. creating a token "_n_reporting" as well as "reporting" and maybe also an additional "_n" token to avoid having to use more expensive Wildcard matches to recover all nouns. The only problem here is that I also have *other* tags at the same location adding semantics to "reporting" as encountered in the text: it's stemmed form "^report" for example as well as more fine grained part of speech tag from the NUPOS set, e.g. "_n2_" and I can imagine additional future semantics. To create new combinatorial terms for all thes esemantic tags explodes the token count exponentially... thanks - C>T> On Thu, Nov 19, 2009 at 10:30 AM, Erick Erickson wrote: > Ahhh, I should have followed the link. I was interpreting your first note > as > emitting two tokens NOT at the same offset. My mistake, ignore my nonsense > about unexpected consequences. Your original assumption is correct, zero > offsets are pretty transparent. > > What do you really want to do here? Mark's email (at the link) allows > you to create queries queries expressing "find all phrases > of the form noun-verb-adverb" say. The slop allows for intervening words. > > Your original post seems to want different semantics. > > << is > used with a specific part of speech, e.g. all docs where "report" is used > as > a noun>>>. > > For that, my suggestion seems simpler, which is not surprising since it > addresses a less general problem. So instead of including a general > part of speech token, just suffix your original word with your marker and > use that for your "synonym. > > Then expressing your intent is simply tacking on the part of speech > marker to the words you care about (e.g. report_n when you wanted > report as a noun). No phrases or slop required, at the expense of > more terms. > > H, if you wanted to, say, "find all the nouns in the index", you > could *prefix* the word (e.g. n_report) which would group all the > nouns together in the term enumerations > > Sorry for the confusion > Erick > > > On Thu, Nov 19, 2009 at 9:38 AM, Christopher Tignor >wrote: > > > Thanks, Erick - > > > > Indeed every word will have a part of speech token but Is this how the > slop > > actually works? My understanding was that if I have two tokens in the > same > > location then each will not effect searches involving other in terms of > the > > slop as slop indicates the number of words *between* search terms in a > > phrase. > > > > > Are tokens at the same location actually adjacent in their ordinal values, > > thus affecting the slop as you describe? > > > > If so, Is there a predictable way to determine which comes before the > other > > - perhaps the order they are inserted when being tokenized? > > > > thanks, > > > > C>T> > > > > On Thu, Nov 19, 2009 at 8:35 AM, Erick Erickson > >wrote: > > > > > If I'm reading this right, your tokenizer creates two tokens. One > > > "report" and one "_n"... I suspect if so that this will create some > > > "interesting" > > > behaviors. For instance, if you put two tokens in place, are you going > > > to double the slop when you don't care about part of speech? Is every > > > word going to get a marker? etc. > > > > > > I'm not sure payloads would be useful here, but you might check it > out.
SpanQuery for Terms at same position
Hello, I would like to search for all documents that contain both "plan" and "_v" (my part of speech token for verb) at the same position. I have tokenized the documents accordingly so these tokens exists at the same location. I can achieve programaticaly using PhraseQueries by adding the Terms explicitly at the same position but I need to be able to recover the Payload data for each term found within the matched instance of my query. Unfortunately the PayloadSpanUtil doesn't seem to return the same results as the PhraseQuery, possibly becuase it is converting it inoto Spans first which do not support searching for Terms at the same document position? Any help appreciated. thanks, C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: SpanQuery for Terms at same position
Tested it out. It doesn't work. A slop of zero indicates no words between the provided terms. E.g. my query of "plan" "_n" returns entries like "contingency "plan". My work around for this problem is to use a PhraseQuery, where you can explicitly set Terms to occur at the same location, t orecover the desired document ids. Then, because I need the payload data for each match, I create a SpanTermQuery for all the individual terms used, use a modified version PayloadSpanUtil to recover only the PayloadSpans for each query from the document ids collected above and then find the intersection of all these sets making sure to factor in where the each span starts (the end will just be one ordinal value after) within each document to ensure they're at the same position. Definitely more work than it needs to be I think. Still looking for another way. C>T> On Sat, Nov 21, 2009 at 10:47 PM, Adriano Crestani < adrianocrest...@gmail.com> wrote: > Hi, > > I didn't test, but you might want to try SpanNearQuery and set slop to > zero. > Give it a try and let me know if it worked. > > Regards, > Adriano Crestani > > On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor >wrote: > > > Hello, > > > > I would like to search for all documents that contain both "plan" and > "_v" > > (my part of speech token for verb) at the same position. > > I have tokenized the documents accordingly so these tokens exists at the > > same location. > > > > I can achieve programaticaly using PhraseQueries by adding the Terms > > explicitly at the same position but I need to be able to recover the > > Payload > > data for each > > term found within the matched instance of my query. > > > > Unfortunately the PayloadSpanUtil doesn't seem to return the same results > > as > > the PhraseQuery, possibly becuase it is converting it inoto Spans first > > which do not support searching for Terms at the same document position? > > > > Any help appreciated. > > > > thanks, > > > > C>T> > > > > -- > > TH!NKMAP > > > > Christopher Tignor | Senior Software Architect > > 155 Spring Street NY, NY 10012 > > p.212-285-8600 x385 f.212-285-8999 > > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: SpanQuery for Terms at same position
A slop of -1 doesn't work either. I get no results returned. this would be a *really* helpful feature for me if someone might suggest an implementation as I would really like to be able to do arbitrary span searches where tokens may be at the same position and also in other positions where the ordering of subsequent terms may be restricted as per the normal span API. thanks, C>T> On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot wrote: > Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: > > Hi, > > > > I didn't test, but you might want to try SpanNearQuery and set slop to > zero. > > Give it a try and let me know if it worked. > > The slop is the number of positions "in between", so zero would still be > too > much to only match at the same position. > > SpanNearQuery may or may not work for a slop of -1, but one could try > that for both the ordered and unordered cases. > One way to do that is to start from the existing test cases. > > Regards, > Paul Elschot > > > > > Regards, > > Adriano Crestani > > > > On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor < > ctig...@thinkmap.com>wrote: > > > > > Hello, > > > > > > I would like to search for all documents that contain both "plan" and > "_v" > > > (my part of speech token for verb) at the same position. > > > I have tokenized the documents accordingly so these tokens exists at > the > > > same location. > > > > > > I can achieve programaticaly using PhraseQueries by adding the Terms > > > explicitly at the same position but I need to be able to recover the > > > Payload > > > data for each > > > term found within the matched instance of my query. > > > > > > Unfortunately the PayloadSpanUtil doesn't seem to return the same > results > > > as > > > the PhraseQuery, possibly becuase it is converting it inoto Spans first > > > which do not support searching for Terms at the same document position? > > > > > > Any help appreciated. > > > > > > thanks, > > > > > > C>T> > > > > > > -- > > > TH!NKMAP > > > > > > Christopher Tignor | Senior Software Architect > > > 155 Spring Street NY, NY 10012 > > > p.212-285-8600 x385 f.212-285-8999 > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: SpanQuery for Terms at same position
Thanks so much for this. Using an un-ordered query, the -1 slop indeed returns the correct results, matching tokens at the same position. I tried the same query but ordered both after and before rebuilding the source with Paul's changes to NearSpansOrdered but the query was still failing, returning no results. C>T> On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller wrote: > Your trying -1 with ordered right? Try it with non ordered. > > Christopher Tignor wrote: > > A slop of -1 doesn't work either. I get no results returned. > > > > this would be a *really* helpful feature for me if someone might suggest > an > > implementation as I would really like to be able to do arbitrary span > > searches where tokens may be at the same position and also in other > > positions where the ordering of subsequent terms may be restricted as per > > the normal span API. > > > > thanks, > > > > C>T> > > > > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot >wrote: > > > > > >> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: > >> > >>> Hi, > >>> > >>> I didn't test, but you might want to try SpanNearQuery and set slop to > >>> > >> zero. > >> > >>> Give it a try and let me know if it worked. > >>> > >> The slop is the number of positions "in between", so zero would still be > >> too > >> much to only match at the same position. > >> > >> SpanNearQuery may or may not work for a slop of -1, but one could try > >> that for both the ordered and unordered cases. > >> One way to do that is to start from the existing test cases. > >> > >> Regards, > >> Paul Elschot > >> > >> > >>> Regards, > >>> Adriano Crestani > >>> > >>> On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor < > >>> > >> ctig...@thinkmap.com>wrote: > >> > >>>> Hello, > >>>> > >>>> I would like to search for all documents that contain both "plan" and > >>>> > >> "_v" > >> > >>>> (my part of speech token for verb) at the same position. > >>>> I have tokenized the documents accordingly so these tokens exists at > >>>> > >> the > >> > >>>> same location. > >>>> > >>>> I can achieve programaticaly using PhraseQueries by adding the Terms > >>>> explicitly at the same position but I need to be able to recover the > >>>> Payload > >>>> data for each > >>>> term found within the matched instance of my query. > >>>> > >>>> Unfortunately the PayloadSpanUtil doesn't seem to return the same > >>>> > >> results > >> > >>>> as > >>>> the PhraseQuery, possibly becuase it is converting it inoto Spans > first > >>>> which do not support searching for Terms at the same document > position? > >>>> > >>>> Any help appreciated. > >>>> > >>>> thanks, > >>>> > >>>> C>T> > >>>> > >>>> -- > >>>> TH!NKMAP > >>>> > >>>> Christopher Tignor | Senior Software Architect > >>>> 155 Spring Street NY, NY 10012 > >>>> p.212-285-8600 x385 f.212-285-8999 > >>>> > >>>> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > >> > > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: SpanQuery for Terms at same position
Also, I noticed that with the above edit to NearSpansOrdered I am getting erroneous results fo normal ordered searches using searches like: "_n" followed by "work" where because "_n" and "work" are at the same position the code changes accept their pairing as a valid in-order result now that the eqaul to clause has been added to the inequality. C>T> On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor wrote: > Thanks so much for this. > > Using an un-ordered query, the -1 slop indeed returns the correct results, > matching tokens at the same position. > > I tried the same query but ordered both after and before rebuilding the > source with Paul's changes to NearSpansOrdered but the query was still > failing, returning no results. > > C>T> > > > On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller wrote: > >> Your trying -1 with ordered right? Try it with non ordered. >> >> Christopher Tignor wrote: >> > A slop of -1 doesn't work either. I get no results returned. >> > >> > this would be a *really* helpful feature for me if someone might suggest >> an >> > implementation as I would really like to be able to do arbitrary span >> > searches where tokens may be at the same position and also in other >> > positions where the ordering of subsequent terms may be restricted as >> per >> > the normal span API. >> > >> > thanks, >> > >> > C>T> >> > >> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot > >wrote: >> > >> > >> >> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: >> >> >> >>> Hi, >> >>> >> >>> I didn't test, but you might want to try SpanNearQuery and set slop to >> >>> >> >> zero. >> >> >> >>> Give it a try and let me know if it worked. >> >>> >> >> The slop is the number of positions "in between", so zero would still >> be >> >> too >> >> much to only match at the same position. >> >> >> >> SpanNearQuery may or may not work for a slop of -1, but one could try >> >> that for both the ordered and unordered cases. >> >> One way to do that is to start from the existing test cases. >> >> >> >> Regards, >> >> Paul Elschot >> >> >> >> >> >>> Regards, >> >>> Adriano Crestani >> >>> >> >>> On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor < >> >>> >> >> ctig...@thinkmap.com>wrote: >> >> >> >>>> Hello, >> >>>> >> >>>> I would like to search for all documents that contain both "plan" and >> >>>> >> >> "_v" >> >> >> >>>> (my part of speech token for verb) at the same position. >> >>>> I have tokenized the documents accordingly so these tokens exists at >> >>>> >> >> the >> >> >> >>>> same location. >> >>>> >> >>>> I can achieve programaticaly using PhraseQueries by adding the Terms >> >>>> explicitly at the same position but I need to be able to recover the >> >>>> Payload >> >>>> data for each >> >>>> term found within the matched instance of my query. >> >>>> >> >>>> Unfortunately the PayloadSpanUtil doesn't seem to return the same >> >>>> >> >> results >> >> >> >>>> as >> >>>> the PhraseQuery, possibly becuase it is converting it inoto Spans >> first >> >>>> which do not support searching for Terms at the same document >> position? >> >>>> >> >>>> Any help appreciated. >> >>>> >> >>>> thanks, >> >>>> >> >>>> C>T> >> >>>> >> >>>> -- >> >>>> TH!NKMAP >> >>>> >> >>>> Christopher Tignor | Senior Software Architect >> >>>> 155 Spring Street NY, NY 10012 >> >>>> p.212-285-8600 x385 f.212-285-8999 >> >>>> >> >>>> >> >> - >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> >> > >> > >> > >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > TH!NKMAP > > Christopher Tignor | Senior Software Architect > 155 Spring Street NY, NY 10012 > p.212-285-8600 x385 f.212-285-8999 > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: SpanQuery for Terms at same position
yes that indeed works for me. thanks, C>T> On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot wrote: > Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor: > > Also, I noticed that with the above edit to NearSpansOrdered I am getting > > erroneous results fo normal ordered searches using searches like: > > > > "_n" followed by "work" > > > > where because "_n" and "work" are at the same position the code changes > > accept their pairing as a valid in-order result now that the eqaul to > clause > > has been added to the inequality. > > Thanks for trying this. Indeed the "followed by" semantics is broken for > the ordered case when spans at the same positions are considered > ordered. > > Did I understand correctly that the unordered case with a slop of -1 > and without the edit works to match terms at the same position? > In that case it may be worthwhile to add that to the javadocs, > and also add a few testcases. > > Regards, > Paul Elschot > > > > > C>T> > > > > On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor > > wrote: > > > > > Thanks so much for this. > > > > > > Using an un-ordered query, the -1 slop indeed returns the correct > results, > > > matching tokens at the same position. > > > > > > I tried the same query but ordered both after and before rebuilding the > > > source with Paul's changes to NearSpansOrdered but the query was still > > > failing, returning no results. > > > > > > C>T> > > > > > > > > > On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller >wrote: > > > > > >> Your trying -1 with ordered right? Try it with non ordered. > > >> > > >> Christopher Tignor wrote: > > >> > A slop of -1 doesn't work either. I get no results returned. > > >> > > > >> > this would be a *really* helpful feature for me if someone might > suggest > > >> an > > >> > implementation as I would really like to be able to do arbitrary > span > > >> > searches where tokens may be at the same position and also in other > > >> > positions where the ordering of subsequent terms may be restricted > as > > >> per > > >> > the normal span API. > > >> > > > >> > thanks, > > >> > > > >> > C>T> > > >> > > > >> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot < > paul.elsc...@xs4all.nl > > >> >wrote: > > >> > > > >> > > > >> >> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: > > >> >> > > >> >>> Hi, > > >> >>> > > >> >>> I didn't test, but you might want to try SpanNearQuery and set > slop to > > >> >>> > > >> >> zero. > > >> >> > > >> >>> Give it a try and let me know if it worked. > > >> >>> > > >> >> The slop is the number of positions "in between", so zero would > still > > >> be > > >> >> too > > >> >> much to only match at the same position. > > >> >> > > >> >> SpanNearQuery may or may not work for a slop of -1, but one could > try > > >> >> that for both the ordered and unordered cases. > > >> >> One way to do that is to start from the existing test cases. > > >> >> > > >> >> Regards, > > >> >> Paul Elschot > > >> >> > > >> >> > > >> >>> Regards, > > >> >>> Adriano Crestani > > >> >>> > > >> >>> On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor < > > >> >>> > > >> >> ctig...@thinkmap.com>wrote: > > >> >> > > >> >>>> Hello, > > >> >>>> > > >> >>>> I would like to search for all documents that contain both "plan" > and > > >> >>>> > > >> >> "_v" > > >> >> > > >> >>>> (my part of speech token for verb) at the same position. > > >> >>>> I have tokenized the documents accordingly so these tokens exists > at > > >> >>>> > > >
customized SpanQuery Payload usage
Hello, For certain span queries I construct problematically by piecing together my own SpanTermQueries I would like to enforce that Payload data is not returned for matches on those specific terms used by the constituent SapnTermQueries. For exmaple if I search for a position match with a SpanQuery referencing the tokens "_n" and "work" and there is Payload data for each (there needs to be for other types of queries) I would like to be able to screen out the payload data originating from any matched "_n" tokens. I thought for the tokens I am not interested in receiving payload data from I might simply create (anonymously) my own subclass of SpanTermQuery which overrides getSpans and returns another custom class which extends TermSpans but there simply overrides isPayloadAvailable to return false: new SpanTermQuery(new Term(myField, myTokenString)) { public Spans getSpans(IndexReader reader) throws IOException { return new TermSpans(reader.termPositions(term), term) { public boolean isPayloadAvailable() { return false; } }; } }); This however seems to eliminating payload data for all matches though I'm not sure why and am tracing through the code, looking at NearSpansUnordered. Any thoughts? thanks so much, C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: NearSpansUnordered payloads
I am also having a hard time understanding the NewSpansUnordered isPayloadAvilable() method. For my test case where 2 tokens are at the same position, the code below seems to be failing in traversing the 2 SpansCells. The first SpansCell it retrieves has its next field set to null so it cannot find the second one. Is this normal behavior? // TODO: Remove warning after API has been finalized public boolean isPayloadAvailable() { SpansCell pointer = min(); while (pointer != null) { if (pointer.isPayloadAvailable()) { return true; } pointer = pointer.next; } return false; } When the linked list of SpanCells is first created they are linked together normally but their order is reversed when adding them to the queue in list toQueue() such that the last SpansCell with it's next field set o to null is retrieved first. C>T> On Fri, Nov 20, 2009 at 6:49 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > I'm interested in getting the payload information from the > matching span, however it's unclear from the javadocs why > NearSpansUnordered is different than NearSpansOrdered in this > regard. > > NearSpansUnordered returns payloads in a hash set that's > computed each method call by iterating over the SpanCell as a > linked list, whereas NearSpansOrdered stores the payloads in a > list (which is ordered) only when collectPayloads is true. > > At first glance I'm not sure how to correlate the payload with > the span match using NSU, nor why they're different. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: customized SpanQuery Payload usage
The problem is that I need to be able to match spans resulting from a a SpanNearQuery with the Term they came from so I can eliminate using Payloads from certain Terms on a query-by-query basis. I still need this term to effect the results of a NearSpanQuery as per the usual logic, I just need to know when iterating over the resulting Spans that when I hit one originating from a certain Term not to load it's payload data. I recently solved the problem fairly simply after doing much research into the source. When I am building the query and encounter a term I don't want to recover payload data from, I add my own anonymous sub-type of SpanTermQuery to my developing SpanNearQuery that itself creates an anonymous sub-type of SpanTerm which simply returns an empty Collection for it's payload data. new SpanTermQuery(new Term(QueryVocabTracker.CONTENT_FIELD, tagToken)) { @Override public Spans getSpans(IndexReader reader) throws IOException { return new TermSpans(reader.termPositions(term), term) { @Override public Collection getPayload() throws IOException { // no payload data for this TermSpan return Collections.emptyList(); } }; } } thanks, C>T> On Wed, Nov 25, 2009 at 8:10 AM, Grant Ingersoll wrote: > > On Nov 24, 2009, at 9:56 AM, Christopher Tignor wrote: > > > Hello, > > > > For certain span queries I construct problematically by piecing together > my > > own SpanTermQueries I would like to enforce that Payload data is not > > returned for matches on those specific terms used by the constituent > > SapnTermQueries. > > I'm not sure I follow. For those terms you don't want payloads, why can't > you just avoid getting payloads? Span queries themselves do not require > payloads for execution. Can you share your code for iterating over the > spans? > > > > > For exmaple if I search for a position match with a SpanQuery referencing > > the tokens "_n" and "work" and there is Payload data for each (there > needs > > to be for other types of queries) I would like to be able to screen out > the > > payload data originating from any matched "_n" tokens. > > > > I thought for the tokens I am not interested in receiving payload data > from > > I might simply create (anonymously) my own subclass of SpanTermQuery > which > > overrides getSpans and returns another custom class which extends > TermSpans > > but there simply overrides isPayloadAvailable to return false: > > > > new SpanTermQuery(new Term(myField, myTokenString)) { > > > > > > > >public Spans getSpans(IndexReader reader) > >throws IOException { > >return new > > TermSpans(reader.termPositions(term), term) { > > > >public boolean > isPayloadAvailable() > > { > >return false; > >} > > > >}; > > } > >}); > > > > This however seems to eliminating payload data for all matches though I'm > > not sure why and am tracing through the code, looking at > NearSpansUnordered. > > > > Any thoughts? > > > > thanks so much, > > > > C>T> > > > > > > -- > > TH!NKMAP > > > > Christopher Tignor | Senior Software Architect > > 155 Spring Street NY, NY 10012 > > p.212-285-8600 x385 f.212-285-8999 > > ---------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: SpanQuery for Terms at same position
It's worth noting however that this -1 slop doesn't seem to work for cases where oyu want to discover instances of more than two terms at the same position. Would be nice to be able to explicitly set this in the query construction. thanks, C>T> On Tue, Nov 24, 2009 at 9:17 AM, Christopher Tignor wrote: > yes that indeed works for me. > > thanks, > > C>T> > > > On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot wrote: > >> Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor: >> > Also, I noticed that with the above edit to NearSpansOrdered I am >> getting >> > erroneous results fo normal ordered searches using searches like: >> > >> > "_n" followed by "work" >> > >> > where because "_n" and "work" are at the same position the code changes >> > accept their pairing as a valid in-order result now that the eqaul to >> clause >> > has been added to the inequality. >> >> Thanks for trying this. Indeed the "followed by" semantics is broken for >> the ordered case when spans at the same positions are considered >> ordered. >> >> Did I understand correctly that the unordered case with a slop of -1 >> and without the edit works to match terms at the same position? >> In that case it may be worthwhile to add that to the javadocs, >> and also add a few testcases. >> >> Regards, >> Paul Elschot >> >> > >> > C>T> >> > >> > On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor >> > wrote: >> > >> > > Thanks so much for this. >> > > >> > > Using an un-ordered query, the -1 slop indeed returns the correct >> results, >> > > matching tokens at the same position. >> > > >> > > I tried the same query but ordered both after and before rebuilding >> the >> > > source with Paul's changes to NearSpansOrdered but the query was still >> > > failing, returning no results. >> > > >> > > C>T> >> > > >> > > >> > > On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller > >wrote: >> > > >> > >> Your trying -1 with ordered right? Try it with non ordered. >> > >> >> > >> Christopher Tignor wrote: >> > >> > A slop of -1 doesn't work either. I get no results returned. >> > >> > >> > >> > this would be a *really* helpful feature for me if someone might >> suggest >> > >> an >> > >> > implementation as I would really like to be able to do arbitrary >> span >> > >> > searches where tokens may be at the same position and also in other >> > >> > positions where the ordering of subsequent terms may be restricted >> as >> > >> per >> > >> > the normal span API. >> > >> > >> > >> > thanks, >> > >> > >> > >> > C>T> >> > >> > >> > >> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot < >> paul.elsc...@xs4all.nl >> > >> >wrote: >> > >> > >> > >> > >> > >> >> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani: >> > >> >> >> > >> >>> Hi, >> > >> >>> >> > >> >>> I didn't test, but you might want to try SpanNearQuery and set >> slop to >> > >> >>> >> > >> >> zero. >> > >> >> >> > >> >>> Give it a try and let me know if it worked. >> > >> >>> >> > >> >> The slop is the number of positions "in between", so zero would >> still >> > >> be >> > >> >> too >> > >> >> much to only match at the same position. >> > >> >> >> > >> >> SpanNearQuery may or may not work for a slop of -1, but one could >> try >> > >> >> that for both the ordered and unordered cases. >> > >> >> One way to do that is to start from the existing test cases. >> > >> >> >> > >> >> Regards, >> > >> >> Paul Elschot >> > >> >> >> > >> >> >> > >> >>> Regards, >> > >> >>> Adri
Re: SpanQuery for Terms at same position
my own tests with my own data show you are correct and the 1-n slop works for matching terms at the same ordinal position. thanks! C>T> On Wed, Nov 25, 2009 at 4:25 PM, Paul Elschot wrote: > Op woensdag 25 november 2009 21:20:33 schreef Christopher Tignor: > > It's worth noting however that this -1 slop doesn't seem to work for > cases > > where oyu want to discover instances of more than two terms at the same > > position. Would be nice to be able to explicitly set this in the query > > construction. > > I think requiring n terms at the same position would need a slop of 1-n, > and I'd like to have some test cases added for that. > Now if I only had some time... > > Regards, > Paul Elschot > > > > > thanks, > > > > C>T> > > On Tue, Nov 24, 2009 at 9:17 AM, Christopher Tignor < > ctig...@thinkmap.com>wrote: > > > > > yes that indeed works for me. > > > > > > thanks, > > > > > > C>T> > > > > > > > > > On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot >wrote: > > > > > >> Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor: > > >> > Also, I noticed that with the above edit to NearSpansOrdered I am > > >> getting > > >> > erroneous results fo normal ordered searches using searches like: > > >> > > > >> > "_n" followed by "work" > > >> > > > >> > where because "_n" and "work" are at the same position the code > changes > > >> > accept their pairing as a valid in-order result now that the eqaul > to > > >> clause > > >> > has been added to the inequality. > > >> > > >> Thanks for trying this. Indeed the "followed by" semantics is broken > for > > >> the ordered case when spans at the same positions are considered > > >> ordered. > > >> > > >> Did I understand correctly that the unordered case with a slop of -1 > > >> and without the edit works to match terms at the same position? > > >> In that case it may be worthwhile to add that to the javadocs, > > >> and also add a few testcases. > > >> > > >> Regards, > > >> Paul Elschot > > >> > > >> > > > >> > C>T> > > >> > > > >> > On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor > > >> > wrote: > > >> > > > >> > > Thanks so much for this. > > >> > > > > >> > > Using an un-ordered query, the -1 slop indeed returns the correct > > >> results, > > >> > > matching tokens at the same position. > > >> > > > > >> > > I tried the same query but ordered both after and before > rebuilding > > >> the > > >> > > source with Paul's changes to NearSpansOrdered but the query was > still > > >> > > failing, returning no results. > > >> > > > > >> > > C>T> > > >> > > > > >> > > > > >> > > On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller < > markrmil...@gmail.com > > >> >wrote: > > >> > > > > >> > >> Your trying -1 with ordered right? Try it with non ordered. > > >> > >> > > >> > >> Christopher Tignor wrote: > > >> > >> > A slop of -1 doesn't work either. I get no results returned. > > >> > >> > > > >> > >> > this would be a *really* helpful feature for me if someone > might > > >> suggest > > >> > >> an > > >> > >> > implementation as I would really like to be able to do > arbitrary > > >> span > > >> > >> > searches where tokens may be at the same position and also in > other > > >> > >> > positions where the ordering of subsequent terms may be > restricted > > >> as > > >> > >> per > > >> > >> > the normal span API. > > >> > >> > > > >> > >> > thanks, > > >> > >> > > > >> > >> > C>T> > > >> > >> > > > >> > >> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot < > > >> paul.elsc...@xs4all.nl > > &
Re: SpanQuery for Terms at same position
It would take a bit of work work / learning (haven't used a RAMDirectory yet) to make them into test cases usable by others and am deep into this project and under the gun right now. But if some time surfaces I will for sure... thanks - C>T> On Wed, Nov 25, 2009 at 7:49 PM, Erick Erickson wrote: > Hmmm, are they unit tests? Or would you be wiling to create stand-alone > unit tests demonstrating this and submit it as a patch? > > Best > er...@alwaystrollingforworkfromothers.opportunistic. > > On Wed, Nov 25, 2009 at 5:38 PM, Christopher Tignor >wrote: > > > my own tests with my own data show you are correct and the 1-n slop works > > for matching terms at the same ordinal position. > > > > thanks! > > > > C>T> > > > > On Wed, Nov 25, 2009 at 4:25 PM, Paul Elschot > >wrote: > > > > > Op woensdag 25 november 2009 21:20:33 schreef Christopher Tignor: > > > > It's worth noting however that this -1 slop doesn't seem to work for > > > cases > > > > where oyu want to discover instances of more than two terms at the > same > > > > position. Would be nice to be able to explicitly set this in the > query > > > > construction. > > > > > > I think requiring n terms at the same position would need a slop of > 1-n, > > > and I'd like to have some test cases added for that. > > > Now if I only had some time... > > > > > > Regards, > > > Paul Elschot > > > > > > > > > > > thanks, > > > > > > > > C>T> > > > > On Tue, Nov 24, 2009 at 9:17 AM, Christopher Tignor < > > > ctig...@thinkmap.com>wrote: > > > > > > > > > yes that indeed works for me. > > > > > > > > > > thanks, > > > > > > > > > > C>T> > > > > > > > > > > > > > > > On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot < > > paul.elsc...@xs4all.nl > > > >wrote: > > > > > > > > > >> Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor: > > > > >> > Also, I noticed that with the above edit to NearSpansOrdered I > am > > > > >> getting > > > > >> > erroneous results fo normal ordered searches using searches > like: > > > > >> > > > > > >> > "_n" followed by "work" > > > > >> > > > > > >> > where because "_n" and "work" are at the same position the code > > > changes > > > > >> > accept their pairing as a valid in-order result now that the > eqaul > > > to > > > > >> clause > > > > >> > has been added to the inequality. > > > > >> > > > > >> Thanks for trying this. Indeed the "followed by" semantics is > broken > > > for > > > > >> the ordered case when spans at the same positions are considered > > > > >> ordered. > > > > >> > > > > >> Did I understand correctly that the unordered case with a slop of > -1 > > > > >> and without the edit works to match terms at the same position? > > > > >> In that case it may be worthwhile to add that to the javadocs, > > > > >> and also add a few testcases. > > > > >> > > > > >> Regards, > > > > >> Paul Elschot > > > > >> > > > > >> > > > > > >> > C>T> > > > > >> > > > > > >> > On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor > > > > >> > wrote: > > > > >> > > > > > >> > > Thanks so much for this. > > > > >> > > > > > > >> > > Using an un-ordered query, the -1 slop indeed returns the > > correct > > > > >> results, > > > > >> > > matching tokens at the same position. > > > > >> > > > > > > >> > > I tried the same query but ordered both after and before > > > rebuilding > > > > >> the > > > > >> > > source with Paul's changes to NearSpansOrdered but the query > was > > > still > > > > >> > > failing, returning no results. > > > > >> > > > > > > >&g
minimum range for SpanQueries
Is there are way to implement a minimum range for a SpanQuery or combination thereof? For example, using: "The boy said hello to the boy" I'd like to use a SpanNearQuery consisting of the terms "The" and "boy" that returns one span including the entire sentence but not a span for the first two words. Thus, I'd like to specify a minimum range of at least 1 and a maximum of say, 5 here. I note that using a SpanNotQuery consisting of two SpanNearQueries with the same terms and these above ranges does not work as the desired longer SpanResult will include the shorter one and get weeded out. thanks - C>T> -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: recovering payload from fields
Hello, To my knoweldge, the character position of the tokens is not preserved by Lucene - only the ordinal postion of token's within a document / field is preserved. Thus you need to store this character offset information separately, say, as Payload data. best, C>T> On Fri, Feb 26, 2010 at 3:41 PM, Christopher Condit wrote: > I'm trying to store semantic information in payloads at index time. I > believe this part is successful - but I'm having trouble getting access to > the payload locations after the index is created. I'd like to know the > offset in the original text for the token with the payload - and get this > information for all payloads that are set in a Field even if they don't > relate to the query. I tried (from the highlighting filter): > TokenStream tokens = TokenSources.getTokenStream(reader, 0, "body"); > while (tokens.incrementToken()) { >TermAttribute term = tokens.getAttribute(TermAttribute.class); >if (toker.hasAttribute(PayloadAttribute.class)) { > PayloadAttribute payload = > tokens.getAttribute(PayloadAttribute.class); > OffsetAttribute offset = toker.getAttribute(OffsetAttribute.class); >} > } > But the OffsetAttribute never seems to contain any information. > In my token filter do I need to do more than: > offsetAtt = addAttribute(OffsetAttribute.class); > during construction in order to store Offset information? > > Thanks, > -Chris > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: recovering payload from fields
What I'd ideally like to do is to take SpanQuery, loop over the PayloadSpans returned from SpanQuery.getPayloadSpans() and store all PayloadSpans for a given document in a Map by their doc id. Then later after deciding in memory which documents I need, load the Payload data for just those PayloadSpans pulled out of my Map. But it seems I can't do this as loading Payload data is only done through the PayloadSpans iterator so must iterate through the entire collection to get to my PaylaodSpan. Is there not a way to just save a PayloadSpan and loads it's payload data later as needed? thanks, C>T> On Sat, Feb 27, 2010 at 5:42 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > You can also access payloads through the TermPositions enum, but, this > is by term and then by doc. > > It sounds like you need to iterate through all terms sequentially in a > given field in the doc, accessing offset & payload? In which case > reanalyzing at search time may be the best way to go. > > You can store term vectors in the index, which will store offsets (if > you ask it to), but, payloads are not currently stored with term > vectors. > > Mike > > On Fri, Feb 26, 2010 at 7:42 PM, Christopher Condit > wrote: > >> Payload Data is accessed through PayloadSpans so using SpanQUeries is > the > >> netry point it seems. There are tools like PayloadSpanUtil that convert > other > >> queries into SpanQueries for this purpose if needed but the api for > Payloads > >> looks it like it goes through Spans is the bottom line. > > > > So there's no way to iterate through all the payloads for a given field? > I can't use the SpanQuery mechanism because in this case the entire field > will be displayed - and I can't search for "*". Is there some trick I'm not > thinking of? > > > >> this is the tip of the iceberg; a big dangerous iceberg... > > > > Yes - I'm beginning to see that... > > > > Thanks, > > -Chris > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
custom scoring help
Hello, I'm having a hard time implementing / understanding a very simple custom scoring situation. I have created my Similarity class for testing which overrides all the relevant (I think) methods below, returning 1 for all but coord(int, int) which returns q / maxOverlap so scores are scaled between 0. and 1.. I call writer.setSimilarity(new HashHitSimilarity()) when indexing and searcher.setSimilarity(new HashHitSimilarity()) when searching. The similarity is definitely affecting the scoring but not how I expect. I am looking for a straight average of the hits calculated, i.e. totalHits for a doc / totalHits in search. The above score with my test search and index of 6 docs should return the scores below for all 6 documents in my index: 0.8387096774193549 0.3548387096774194 0.3548387096774194 0.25806451612903225 0.1935483870967742 0.12903225806451613 but the scores appear "stretched" and return these instead though I'm unsure as to where this "stretching" happens: 0.9078212 0.75977653 0.57541895 0.5670391 0.5223464 0.37150836 public class HashHitSimilarity extends Similarity { /** * */ private static final long serialVersionUID = 811419737205284733L; public float tf(float freq) { return 1f; } public float lengthNorm(String fieldName, int numTokens) { return 1f; } public float queryNorm(float sumOfSquaredWeights) { return 1f; } @Override public float coord(int overlap, int maxOverlap) { return 1f / (float) maxOverlap; } @Override public float idf(int docFreq, int numDocs) { return 1f; } @Override public float sloppyFreq(int distance) { return 0f; } } -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999
Re: custom scoring help
This code is in fact working. I had an error in my test case. Things seem to work as advertised. sorry / thanks - C>T> On Fri, Apr 2, 2010 at 10:20 AM, Christopher Tignor wrote: > Hello, > > I'm having a hard time implementing / understanding a very simple custom > scoring situation. > > I have created my Similarity class for testing which overrides all the > relevant (I think) methods below, returning 1 for all but coord(int, int) > which returns q / maxOverlap so scores are scaled between 0. and 1.. > > I call writer.setSimilarity(new HashHitSimilarity()) when indexing > and searcher.setSimilarity(new HashHitSimilarity()) when searching. > > The similarity is definitely affecting the scoring but not how I expect. I > am looking for a straight average of the hits calculated, i.e. > totalHits for a doc / totalHits in search. > > The above score with my test search and index of 6 docs should return the > scores below for all 6 documents in my index: > > 0.8387096774193549 > 0.3548387096774194 > 0.3548387096774194 > 0.25806451612903225 > 0.1935483870967742 > 0.12903225806451613 > > but the scores appear "stretched" and return these instead though I'm > unsure as to where this "stretching" happens: > > 0.9078212 > 0.75977653 > 0.57541895 > 0.5670391 > 0.5223464 > 0.37150836 > > public class HashHitSimilarity extends Similarity { > > /** > * > */ > private static final long serialVersionUID = 811419737205284733L; > > public float tf(float freq) { > return 1f; > } > > public float lengthNorm(String fieldName, int numTokens) { > return 1f; > } > > public float queryNorm(float sumOfSquaredWeights) { > return 1f; > } > > @Override > public float coord(int overlap, int maxOverlap) { > return 1f / (float) maxOverlap; > } > > @Override > public float idf(int docFreq, int numDocs) { > return 1f; > } > > @Override > public float sloppyFreq(int distance) { > return 0f; > } > > } > > > > > -- > TH!NKMAP > > Christopher Tignor | Senior Software Architect > 155 Spring Street NY, NY 10012 > p.212-285-8600 x385 f.212-285-8999 > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999