Getting Payload data from BooleanQuery results

2009-09-24 Thread Christopher Tignor
Hello,

I have indexed documents with two fields, "ARTICLE" for an article of text
and "PUB_DATE" for the article's publication date.

Given a specific single word, I want to search my index for all documents
that contain this word within the last two weeks, and have them sorted by
date:

TermQuery tq = new TermQuery(new Term("ARTICLE",mySearchWord));
Calendar cal = Calendar.getInstance();
// Date of last two weeks
cal.add(Calendar.DATE, -14);
ConstantScoreRangeQuery csrq = new
ConstantScoreRangeQuery("PUB_DATE",DateTools.dateToString(cal.getTime(),DateTools.Resolution.HOUR),null,true,true);
BooleanQuery bq = new BooleanQuery();
bq.add(tq, BooleanClause.Occur.MUST);
bq.add(csrq, BooleanClause.Occur.MUST);
TopFieldDocs docs = searcher.search(bq, null, 10, new Sort("PUB_DATE"));

My goal now is to search through the recovered documents an obtain the Term
instances (each term position) within each document and retrieve the payload
data associated with each Term instance.

The trouble I am having is in getting access to the TermPositions following
such a query.
If I only needed to query on a single term (without my date restriction), I
could easily do (and have done) this:

SpanTermQuery query = new SpanTermQuery(new Term("ARTICLE",mySearchWord));
TermSpans spans = (TermSpans) query.getSpans(indexReader);
tp = spans.getPositions();

and then iterate over each position calling

tp.getPayload(dataBuffer,0);

for example.

But alas, I cannot seem to get access to any TermPositions from my above
BooleanQuery.
I have looked into the contributed SpanExtractorClass but
ConstantScoreRangeQuery seems unsupported
and I am at a loos as to how to best use Spans here.

Any help appreciated,

C>T>

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: Getting Payload data from BooleanQuery results

2009-09-24 Thread Christopher Tignor
thanks for the tip.

I don't see a way to integrate the QueryWrapperFilter (or any Filter) into
SpanTermQuery.getSpans(indexReader) however.
I can use a SpanQuery with an IndexSearcher as per susual but that leaves me
back where I started.  Any thoughts?

Also,  I will need to sort these results by date so that the most recent,
say 5 are returned...

thanks again,

C>T>

On Thu, Sep 24, 2009 at 3:22 PM, Chris Hostetter
wrote:

>
> : But alas, I cannot seem to get access to any TermPositions from my above
> : BooleanQuery.
>
> I would suggest refactoring your "date" restriction into a Filter (there's
> fairly easy to use Filter that wraps a Query) and then execute a
> SPanTermQuery just as you describe.
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


TermPositions with custom Tokenizer

2009-10-01 Thread Christopher Tignor
Hello,

I have created a custom Tokenizer and am trying to set and extract my own
positions for each Token using:

reusableToken.reinit(word.getWord(),tokenStart,tokenEnd);

later when querying my index using a SpanTermQuery the start() and end()
tags don't correspond to these values but seem to correspond to the order
the token was tokenized during the indexing process, e.g.

start: 5
end: 6

for a given token.  I realize that the these values come from TermPositions
but how can I effectively get my custom toke nstart and end offsets into
TermPositions for recovery?

thanks -

C>T>

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


IndexWriter optimize() deadlock

2009-10-16 Thread Christopher Tignor
Hello,

I am trying to track down the cause of my code hanging on calling
IndexWriter.optimize() at its doWait() method.
It appears, thus that it is watiing on other merges to happen which is a bit
confusing to me:

My application is a simple producer consumer model where documents are added
to a queue by producers and then one consumer with one indexwriter (the only
in the application) periodically calls addDocument() on a batch of these
jobs and then calls optimize(), commit(). and then close().  There is only
one thread running the consumer so I am confused as to how the indexwriter
might be deadlocking itself.  Indeed this is the only thread active when the
deadlock occurs so it seems to be a problem of reentry.

Importantly, the deadlocking occurs only when the thread is trying to
shutdown - that is the Thread running this lucene consumer has a Future that
has had its cancel(true) interrupting method called.  Is it possible that an
internal Lucene lock is obtained during addDocument() and on interruption is
never released so the subsequent optimize() call hangs?  This doesn't appear
to be happening...

Any help appreciated.

thanks,

C>T>

what might I be missing here?

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: IndexWriter optimize() deadlock

2009-10-16 Thread Christopher Tignor
thanks for getting back.

I do not lock on the IndexWriter object itself but all methods in my
consumer class that use IndexWriter are synchronized (locking my singleton
consumer object itself).
The thread is waiting at IndexWriter.doWait().  What might cuase this?

thanks -

C>T>

On Fri, Oct 16, 2009 at 12:58 PM, Uwe Schindler  wrote:

> Do you use the IndexWriter as mutex in a synchronized() block? This is not
> supported and may hang. Never lock on IndexWriter instances. IndexWriter
> itself is thread safe.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Christopher Tignor [mailto:ctig...@thinkmap.com]
> > Sent: Friday, October 16, 2009 6:50 PM
> > To: java-user
> > Subject: IndexWriter optimize() deadlock
> >
> > Hello,
> >
> > I am trying to track down the cause of my code hanging on calling
> > IndexWriter.optimize() at its doWait() method.
> > It appears, thus that it is watiing on other merges to happen which is a
> > bit
> > confusing to me:
> >
> > My application is a simple producer consumer model where documents are
> > added
> > to a queue by producers and then one consumer with one indexwriter (the
> > only
> > in the application) periodically calls addDocument() on a batch of these
> > jobs and then calls optimize(), commit(). and then close().  There is
> only
> > one thread running the consumer so I am confused as to how the
> indexwriter
> > might be deadlocking itself.  Indeed this is the only thread active when
> > the
> > deadlock occurs so it seems to be a problem of reentry.
> >
> > Importantly, the deadlocking occurs only when the thread is trying to
> > shutdown - that is the Thread running this lucene consumer has a Future
> > that
> > has had its cancel(true) interrupting method called.  Is it possible that
> > an
> > internal Lucene lock is obtained during addDocument() and on interruption
> > is
> > never released so the subsequent optimize() call hangs?  This doesn't
> > appear
> > to be happening...
> >
> > Any help appreciated.
> >
> > thanks,
> >
> > C>T>
> >
> > what might I be missing here?
> >
> > --
> > TH!NKMAP
> >
> > Christopher Tignor | Senior Software Architect
> > 155 Spring Street NY, NY 10012
> > p.212-285-8600 x385 f.212-285-8999
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: IndexWriter optimize() deadlock

2009-10-16 Thread Christopher Tignor
It doesn't look like my Future.cancel(true) is actually interrupting the
thread.  It only does so "if necessary" and in this case seems to be letting
the Thread finish gracefully without need for interruption.

The stack trace leading up to the hanging IndexWriter.optimize() method is
below, though not terribly useful I imagine.  Here only 5 documents are
trying to be added to an index taking up only 39.2 MB on disk.  Again this
deadlocking only happens after I use the Future for this task to cancel the
task...

Thread [pool-3-thread-5] (Suspended)
IndexWriter.doWait() line: 4494
IndexWriter.optimize(int, boolean) line: 2283
IndexWriter.optimize(boolean) line: 2218
IndexWriter.optimize() line: 2198
LuceneResultsPersister.commit() line: 97
PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer) line:
105
PersistenceJobConsumer.consume() line: 46
PersistenceJobConsumer.run() line: 67
Executors$RunnableAdapter.call() line: 441
FutureTask$Sync.innerRun() line: 303
FutureTask.run() line: 138
ThreadPoolExecutor$Worker.runTask(Runnable) line: 886
ThreadPoolExecutor$Worker.run() line: 908
Thread.run() line: 619

thanks,

C>T>

On Fri, Oct 16, 2009 at 1:58 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> My guess is it's the invocation of Thread.interrupt (which
> Future.cancel(true) calls if the task is running) that lead to the
> deadlock.
>
> Is it possible to get the stack trace of the thrown exception when the
> thread was interrupted?  Maybe indeed something in IW isn't cleaning
> up its state on being interrupted.
>
> Mike
>
> On Fri, Oct 16, 2009 at 1:43 PM, Christopher Tignor
>  wrote:
> > thanks for getting back.
> >
> > I do not lock on the IndexWriter object itself but all methods in my
> > consumer class that use IndexWriter are synchronized (locking my
> singleton
> > consumer object itself).
> > The thread is waiting at IndexWriter.doWait().  What might cuase this?
> >
> > thanks -
> >
> > C>T>
> >
> > On Fri, Oct 16, 2009 at 12:58 PM, Uwe Schindler  wrote:
> >
> >> Do you use the IndexWriter as mutex in a synchronized() block? This is
> not
> >> supported and may hang. Never lock on IndexWriter instances. IndexWriter
> >> itself is thread safe.
> >>
> >> -
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: u...@thetaphi.de
> >>
> >> > -Original Message-
> >> > From: Christopher Tignor [mailto:ctig...@thinkmap.com]
> >> > Sent: Friday, October 16, 2009 6:50 PM
> >> > To: java-user
> >> > Subject: IndexWriter optimize() deadlock
> >> >
> >> > Hello,
> >> >
> >> > I am trying to track down the cause of my code hanging on calling
> >> > IndexWriter.optimize() at its doWait() method.
> >> > It appears, thus that it is watiing on other merges to happen which is
> a
> >> > bit
> >> > confusing to me:
> >> >
> >> > My application is a simple producer consumer model where documents are
> >> > added
> >> > to a queue by producers and then one consumer with one indexwriter
> (the
> >> > only
> >> > in the application) periodically calls addDocument() on a batch of
> these
> >> > jobs and then calls optimize(), commit(). and then close().  There is
> >> only
> >> > one thread running the consumer so I am confused as to how the
> >> indexwriter
> >> > might be deadlocking itself.  Indeed this is the only thread active
> when
> >> > the
> >> > deadlock occurs so it seems to be a problem of reentry.
> >> >
> >> > Importantly, the deadlocking occurs only when the thread is trying to
> >> > shutdown - that is the Thread running this lucene consumer has a
> Future
> >> > that
> >> > has had its cancel(true) interrupting method called.  Is it possible
> that
> >> > an
> >> > internal Lucene lock is obtained during addDocument() and on
> interruption
> >> > is
> >> > never released so the subsequent optimize() call hangs?  This doesn't
> >> > appear
> >> > to be happening...
> >> >
> >> > Any help appreciated.
> >> >
> >> > thanks,
> >> >
> >> > C>T>
> >> >
> >> > what might I be missing here?
> >> >
> >> > --
> >> > TH!NKMAP
> >>

Re: IndexWriter optimize() deadlock

2009-10-16 Thread Christopher Tignor
After tracing through the lucene source more it seems that what is happening
is after I call Future.cancel(true) on my parent thread, optimize() is
called and this method launches it's own thread using a
ConcurrentMergeScheduler$MergeThread to do the actual merging.

When this Thread comes around to calling mergeInit() on my index writer - a
synchronized method - it hangs.  For some reason it seems to no long have
the mutex perhaps?  Trace of this thread's stall below...

Daemon Thread [Lucene Merge Thread #0] (Suspended)
IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971
IndexWriter.merge(MergePolicy$OneMerge) line: 3879
ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205
ConcurrentMergeScheduler$MergeThread.run() line: 260

thanks again,

C>T>


On Fri, Oct 16, 2009 at 3:53 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> But if the Future.cancel call turns out to be a no-op (simply waits
> instead of interrupting the thread), how could it be that the deadlock
> only happens when you call it?  Weird.  Are you really sure it's not
> actually calling Thread.interrupt?
>
> That stack trace looks like a normal "optimize is waiting for the
> background merges to complete".  Is it possible your background merges
> are hitting exceptions?  You should see them on your error console if
> so...
>
> Mike
>
> On Fri, Oct 16, 2009 at 3:17 PM, Christopher Tignor
>  wrote:
> > It doesn't look like my Future.cancel(true) is actually interrupting the
> > thread.  It only does so "if necessary" and in this case seems to be
> letting
> > the Thread finish gracefully without need for interruption.
> >
> > The stack trace leading up to the hanging IndexWriter.optimize() method
> is
> > below, though not terribly useful I imagine.  Here only 5 documents are
> > trying to be added to an index taking up only 39.2 MB on disk.  Again
> this
> > deadlocking only happens after I use the Future for this task to cancel
> the
> > task...
> >
> > Thread [pool-3-thread-5] (Suspended)
> >IndexWriter.doWait() line: 4494
> >IndexWriter.optimize(int, boolean) line: 2283
> >IndexWriter.optimize(boolean) line: 2218
> >IndexWriter.optimize() line: 2198
> >LuceneResultsPersister.commit() line: 97
> >PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer)
> line:
> > 105
> >PersistenceJobConsumer.consume() line: 46
> >PersistenceJobConsumer.run() line: 67
> >Executors$RunnableAdapter.call() line: 441
> >FutureTask$Sync.innerRun() line: 303
> >FutureTask.run() line: 138
> >ThreadPoolExecutor$Worker.runTask(Runnable) line: 886
> >ThreadPoolExecutor$Worker.run() line: 908
> >Thread.run() line: 619
> >
> > thanks,
> >
> > C>T>
> >
> > On Fri, Oct 16, 2009 at 1:58 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> My guess is it's the invocation of Thread.interrupt (which
> >> Future.cancel(true) calls if the task is running) that lead to the
> >> deadlock.
> >>
> >> Is it possible to get the stack trace of the thrown exception when the
> >> thread was interrupted?  Maybe indeed something in IW isn't cleaning
> >> up its state on being interrupted.
> >>
> >> Mike
> >>
> >> On Fri, Oct 16, 2009 at 1:43 PM, Christopher Tignor
> >>  wrote:
> >> > thanks for getting back.
> >> >
> >> > I do not lock on the IndexWriter object itself but all methods in my
> >> > consumer class that use IndexWriter are synchronized (locking my
> >> singleton
> >> > consumer object itself).
> >> > The thread is waiting at IndexWriter.doWait().  What might cuase this?
> >> >
> >> > thanks -
> >> >
> >> > C>T>
> >> >
> >> > On Fri, Oct 16, 2009 at 12:58 PM, Uwe Schindler 
> wrote:
> >> >
> >> >> Do you use the IndexWriter as mutex in a synchronized() block? This
> is
> >> not
> >> >> supported and may hang. Never lock on IndexWriter instances.
> IndexWriter
> >> >> itself is thread safe.
> >> >>
> >> >> -
> >> >> Uwe Schindler
> >> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >> http://www.thetaphi.de
> >> >> eMail: u...@thetaphi.de
> >> >>
> >> >> > -Original Message-
> >> >> > From: Christopher Tignor [mailto:ctig...@t

Re: IndexWriter optimize() deadlock

2009-10-16 Thread Christopher Tignor
Indeed it looks like the thread running MergerThread started (After passing
off to ConcurentMergeScheduler) by the thread calling IndexWriter.optimize()
is indeed waiting on the mutex for the IndexWriter to be free so it can use
the object to call mergeInit().

The IndexWriter however has entered a synchronized() waiting loop, waking up
every second ( in doWait() ) and checking if there are running merges left -
which of course there are as the thread responsible for doing the merging
can't get in.  Deadlocked stack traces are below:

Thread [pool-2-thread-5] (Suspended)
owns: IndexWriter  (id=71)
owns: LuceneResultsPersister  (id=85)
IndexWriter.optimize(int, boolean) line: 2283
IndexWriter.optimize(boolean) line: 2218
IndexWriter.optimize() line: 2198
LuceneResultsPersister.commit() line: 97
PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer) line:
105
PersistenceJobConsumer.consume() line: 46
PersistenceJobConsumer.run() line: 67
Executors$RunnableAdapter.call() line: 441
FutureTask$Sync.innerRun() line: 303
FutureTask.run() line: 138
ThreadPoolExecutor$Worker.runTask(Runnable) line: 886
ThreadPoolExecutor$Worker.run() line: 908
Thread.run() line: 619

Daemon Thread [Lucene Merge Thread #0] (Suspended)
waiting for: IndexWriter  (id=71)
IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971
IndexWriter.merge(MergePolicy$OneMerge) line: 3879
ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205
ConcurrentMergeScheduler$MergeThread.run() line: 260

I don't really understand how this code is supposed to (and has) worked
before this and what the problem thus might be here:

In IndexWriter.optimize() at line 2263 we have the synchronized block where
doWait becomes true using the parameter-less call to optimize():

if (doWait) {
  synchronized(this) {
while(true) {
  if (mergeExceptions.size() > 0) {
// Forward any exceptions in background merge
// threads to the current thread:
final int size = mergeExceptions.size();
for(int i=0;iT>

On Fri, Oct 16, 2009 at 4:11 PM, Christopher Tignor wrote:

> After tracing through the lucene source more it seems that what is
> happening is after I call Future.cancel(true) on my parent thread,
> optimize() is called and this method launches it's own thread using a
> ConcurrentMergeScheduler$MergeThread to do the actual merging.
>
> When this Thread comes around to calling mergeInit() on my index writer - a
> synchronized method - it hangs.  For some reason it seems to no long have
> the mutex perhaps?  Trace of this thread's stall below...
>
> Daemon Thread [Lucene Merge Thread #0] (Suspended)
> IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971
> IndexWriter.merge(MergePolicy$OneMerge) line: 3879
> ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205
> ConcurrentMergeScheduler$MergeThread.run() line: 260
>
> thanks again,
>
> C>T>
>
>
>
> On Fri, Oct 16, 2009 at 3:53 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> But if the Future.cancel call turns out to be a no-op (simply waits
>> instead of interrupting the thread), how could it be that the deadlock
>> only happens when you call it?  Weird.  Are you really sure it's not
>> actually calling Thread.interrupt?
>>
>> That stack trace looks like a normal "optimize is waiting for the
>> background merges to complete".  Is it possible your background merges
>> are hitting exceptions?  You should see them on your error console if
>> so...
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 3:17 PM, Christopher Tignor
>>  wrote:
>> > It doesn't look like my Future.cancel(true) is actually interrupting the
>> > thread.  It only does so "if necessary" and in this case seems to be
>> letting
>> > the Thread finish gracefully without need for interruption.
>> >
>> > The stack trace leading up to the hanging IndexWriter.optimize() method
>> is
>> > below, though not terribly useful I imagine.  Here only 5 documents are
>> > trying to be added to an index taking up only 39.2 MB on disk.  Again
>> this
>> > deadlocking only happens after I use the Future for this task to cancel
>> the
>> > task...
>> >
>> > Thread [pool-3-thread-5] (Suspended)
>> >IndexWriter.doWait() line: 4494
>> >IndexWriter.optimize(int, boolean) line: 2283
>> >IndexWriter.optimize(boolean) line: 2218
>> >IndexWriter.optimize() line: 2198
>> >LuceneResultsPersister.commit() line: 97
>> >PersistenceJobQueue.persistAndCommitBatch(Persistenc

Re: IndexWriter optimize() deadlock

2009-10-16 Thread Christopher Tignor
I discovered the problem and fixed it's effect on my code:

Using the source for Lucene version 2.4.1,
in IndexWriter.optimize() there is a call to doWait() on line 2283
this method attempts to wait for a second in order to give other threads it
has spawned a chance to acquire it's mutex and complete index merges.
It checks if there are any merges left after it wakes up.  if there aren't
int proceeds.

However, using using a Future object associated with this thread and calling
cancel(true) doesn't allow the Thread to enter into Object.wait() (or the
executorservice observing this Future immediately wakes it up or something)
so it returns immediately and the thread holding the lock on the IndexWriter
never sceeds it's lock to threads running a merge.

I re-wrote my code such that it didn't need to call the interruptible
version of Future.cancel() to solve it, i.e. Future.cancel(false)

thanks,

C>T>

On Fri, Oct 16, 2009 at 4:44 PM, Christopher Tignor wrote:

> Indeed it looks like the thread running MergerThread started (After passing
> off to ConcurentMergeScheduler) by the thread calling IndexWriter.optimize()
> is indeed waiting on the mutex for the IndexWriter to be free so it can use
> the object to call mergeInit().
>
> The IndexWriter however has entered a synchronized() waiting loop, waking
> up every second ( in doWait() ) and checking if there are running merges
> left - which of course there are as the thread responsible for doing the
> merging can't get in.  Deadlocked stack traces are below:
>
> Thread [pool-2-thread-5] (Suspended)
> owns: IndexWriter  (id=71)
> owns: LuceneResultsPersister  (id=85)
> IndexWriter.optimize(int, boolean) line: 2283
> IndexWriter.optimize(boolean) line: 2218
> IndexWriter.optimize() line: 2198
> LuceneResultsPersister.commit() line: 97
> PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer) line:
> 105
> PersistenceJobConsumer.consume() line: 46
> PersistenceJobConsumer.run() line: 67
> Executors$RunnableAdapter.call() line: 441
> FutureTask$Sync.innerRun() line: 303
> FutureTask.run() line: 138
> ThreadPoolExecutor$Worker.runTask(Runnable) line: 886
> ThreadPoolExecutor$Worker.run() line: 908
> Thread.run() line: 619
>
> Daemon Thread [Lucene Merge Thread #0] (Suspended)
> waiting for: IndexWriter  (id=71)
> IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971
> IndexWriter.merge(MergePolicy$OneMerge) line: 3879
> ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205
> ConcurrentMergeScheduler$MergeThread.run() line: 260
>
> I don't really understand how this code is supposed to (and has) worked
> before this and what the problem thus might be here:
>
> In IndexWriter.optimize() at line 2263 we have the synchronized block where
> doWait becomes true using the parameter-less call to optimize():
>
> if (doWait) {
>   synchronized(this) {
> while(true) {
>   if (mergeExceptions.size() > 0) {
> // Forward any exceptions in background merge
> // threads to the current thread:
> final int size = mergeExceptions.size();
> for(int i=0;i   final MergePolicy.OneMerge merge = (MergePolicy.OneMerge)
> mergeExceptions.get(0);
>   if (merge.optimize) {
> IOException err = new IOException("background merge hit
> exception: " + merge.segString(directory));
> final Throwable t = merge.getException();
> if (t != null)
>   err.initCause(t);
> throw err;
>   }
> }
>   }
>
>   if (optimizeMergesPending())
> doWait();
>   else
>     break;
> }
>   }
>
> It is holding this lock while the thread it started to do the mergin is
> trying to call it's mergeInit() method.
>
> Any thoughts?
>
> thanks,
>
> C>T>
>
>
> On Fri, Oct 16, 2009 at 4:11 PM, Christopher Tignor 
> wrote:
>
>> After tracing through the lucene source more it seems that what is
>> happening is after I call Future.cancel(true) on my parent thread,
>> optimize() is called and this method launches it's own thread using a
>> ConcurrentMergeScheduler$MergeThread to do the actual merging.
>>
>> When this Thread comes around to calling mergeInit() on my index writer -
>> a synchronized method - it hangs.  For some reason it seems to no long have
>> the mutex perhaps?  Trace of this thread's stall below...
>>
>> Daemon Thread [Lucene Merge Thread #0] (Suspended)
>> IndexWriter.mergeInit(Merg

Re: IndexWriter optimize() deadlock

2009-10-16 Thread Christopher Tignor
the doWait() call is synchronized on IndexWriter but it is also, as you
suggest in a loop in a block synchronized on IndexWriter.

The doWait() call returns immediately, still holding the IndexWriter lock
from the loop in the synchornized block as my stack trace shows, without
blocking and giving the merger thread a chance to merge.  It keeps repeating
this doWait procedure which unfortunately never actually waits and the
MergerThread is starved

C>T>

On Fri, Oct 16, 2009 at 6:53 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> I'm glad you worked around it!  But I don't fully understand the
> issue.  That doWait is inside a sync(writer) block... if the Future
> manages to interrupt it, then that thread will release the lock when
> it exits that sync block.
>
> Actually, if the thread was indeed interrupted, you may be hitting this:
>
>http://issues.apache.org/jira/browse/LUCENE-1573
>
> (which is fixed in 2.9).
>
> Mike
>
> On Fri, Oct 16, 2009 at 6:01 PM, Christopher Tignor
>  wrote:
> > I discovered the problem and fixed it's effect on my code:
> >
> > Using the source for Lucene version 2.4.1,
> > in IndexWriter.optimize() there is a call to doWait() on line 2283
> > this method attempts to wait for a second in order to give other threads
> it
> > has spawned a chance to acquire it's mutex and complete index merges.
> > It checks if there are any merges left after it wakes up.  if there
> aren't
> > int proceeds.
> >
> > However, using using a Future object associated with this thread and
> calling
> > cancel(true) doesn't allow the Thread to enter into Object.wait() (or the
> > executorservice observing this Future immediately wakes it up or
> something)
> > so it returns immediately and the thread holding the lock on the
> IndexWriter
> > never sceeds it's lock to threads running a merge.
> >
> > I re-wrote my code such that it didn't need to call the interruptible
> > version of Future.cancel() to solve it, i.e. Future.cancel(false)
> >
> > thanks,
> >
> > C>T>
> >
> > On Fri, Oct 16, 2009 at 4:44 PM, Christopher Tignor <
> ctig...@thinkmap.com>wrote:
> >
> >> Indeed it looks like the thread running MergerThread started (After
> passing
> >> off to ConcurentMergeScheduler) by the thread calling
> IndexWriter.optimize()
> >> is indeed waiting on the mutex for the IndexWriter to be free so it can
> use
> >> the object to call mergeInit().
> >>
> >> The IndexWriter however has entered a synchronized() waiting loop,
> waking
> >> up every second ( in doWait() ) and checking if there are running merges
> >> left - which of course there are as the thread responsible for doing the
> >> merging can't get in.  Deadlocked stack traces are below:
> >>
> >> Thread [pool-2-thread-5] (Suspended)
> >> owns: IndexWriter  (id=71)
> >> owns: LuceneResultsPersister  (id=85)
> >> IndexWriter.optimize(int, boolean) line: 2283
> >> IndexWriter.optimize(boolean) line: 2218
> >> IndexWriter.optimize() line: 2198
> >> LuceneResultsPersister.commit() line: 97
> >> PersistenceJobQueue.persistAndCommitBatch(PersistenceJobConsumer)
> line:
> >> 105
> >> PersistenceJobConsumer.consume() line: 46
> >> PersistenceJobConsumer.run() line: 67
> >> Executors$RunnableAdapter.call() line: 441
> >> FutureTask$Sync.innerRun() line: 303
> >> FutureTask.run() line: 138
> >> ThreadPoolExecutor$Worker.runTask(Runnable) line: 886
> >> ThreadPoolExecutor$Worker.run() line: 908
> >> Thread.run() line: 619
> >>
> >> Daemon Thread [Lucene Merge Thread #0] (Suspended)
> >> waiting for: IndexWriter  (id=71)
> >> IndexWriter.mergeInit(MergePolicy$OneMerge) line: 3971
> >> IndexWriter.merge(MergePolicy$OneMerge) line: 3879
> >> ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) line: 205
> >> ConcurrentMergeScheduler$MergeThread.run() line: 260
> >>
> >> I don't really understand how this code is supposed to (and has) worked
> >> before this and what the problem thus might be here:
> >>
> >> In IndexWriter.optimize() at line 2263 we have the synchronized block
> where
> >> doWait becomes true using the parameter-less call to optimize():
> >>
> >> if (doWait) {
> >>   synchronized(this) {
> >> while(true) 

Token character positions

2009-11-17 Thread Christopher Tignor
Hello,

Hoping someone might clear up a question for me:

When Tokenizing we provide the start and end character offsets for each
token locating it within the source text.

If I tokenize the text "word" and then serach for the term "word" in the
same field, how can I recover this character offset information in the
matching documents to precisely locate the word?  I have been storing this
character info myself using payload data but if lucene stores it, then I am
doing so needlessly.  If recovering this character offset info isn't
possible, what is this charcter offset info used for?

thanks so much,

C>T>

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


token positions

2009-11-17 Thread Christopher Tignor
Hello,

Hoping someone might clear up a question for me:

When Tokenizing we provide the start and end character offsets for each
token locating it within the source text.

If I tokenize the text "word" and then search for the term "word" in the
same field, how can I recover this character offset information in the
matching documents to precisely locate the word?  I have been storing this
character info myself using payload data but if lucene stores it, then I am
doing so needlessly.  If recovering this character offset info isn't
possible, what is this character offset info used for?

thanks so much,

C>T>

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


recovering terms hit from wildcard queries

2009-11-18 Thread Christopher Tignor
Hello,

Firstly, thanks for all the good answers and support form this mailing list.

Would it be possible and if so, what would be the best way to recover the
terms filled in for a wildcard query following a successful search?

For example:
If I parse and execute a query using the string "my*" and get a collection
of document ids that match this search,
is there a good way to determine whether this query found "myopic", "mylar"
or some other term without loading/searching the returned documents?

thanks!

C>T>

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Phrase query with terms at same location

2009-11-18 Thread Christopher Tignor
Hello,

I have indexed words in my documents with part of speech tags at the same
location as these words using a custom Tokenizer as described, very
helpfully, here:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3c20060712115026.38897.qm...@web26002.mail.ukl.yahoo.com%3e

I would like to do a search that retrieves documents when a given word is
used with a specific part of speech, e.g. all docs where "report" is used as
a noun.

I was hoping I could use something like a PhraseQuery with "report _n" (_n
is my noun part of speech tag) with some sort of identifier that describes
the words as having to be at the same location - like a null slop or
something.

Any thoughts on how to do this?

thanks so much,

C>T>

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: recovering terms hit from wildcard queries

2009-11-18 Thread Christopher Tignor
Thanks - that might work though I believe would produce many queries instead
of just one to maintain the specific Term used to match a given hit
document.

I presume then I would get all the actual terms from the WildcardTermEnum
that my wildcard containing string refers to and then use them each in
separate query so I could know precisely which Term is associated with a
given document.

thanks,

C>T>

On Wed, Nov 18, 2009 at 5:16 PM, Simon Willnauer <
simon.willna...@googlemail.com> wrote:

> You could use WildcardTermEnum directly and pass your term and the
> reader to it. This will allow you to enumerate all terms that match
> your wildcard term.
> Is that what are you asking for?
>
> simon
>
> On Wed, Nov 18, 2009 at 10:39 PM, Christopher Tignor
>  wrote:
> > Hello,
> >
> > Firstly, thanks for all the good answers and support form this mailing
> list.
> >
> > Would it be possible and if so, what would be the best way to recover the
> > terms filled in for a wildcard query following a successful search?
> >
> > For example:
> > If I parse and execute a query using the string "my*" and get a
> collection
> > of document ids that match this search,
> > is there a good way to determine whether this query found "myopic",
> "mylar"
> > or some other term without loading/searching the returned documents?
> >
> > thanks!
> >
> > C>T>
> >
> > --
> > TH!NKMAP
> >
> > Christopher Tignor | Senior Software Architect
> > 155 Spring Street NY, NY 10012
> > p.212-285-8600 x385 f.212-285-8999
> >
>
> -----
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: Phrase query with terms at same location

2009-11-19 Thread Christopher Tignor
Thanks, Erick -

Indeed every word will have a part of speech token but Is this how the slop
actually works?  My understanding was that if I have two tokens in the same
location then each will not effect searches involving other in terms of the
slop as slop indicates the number of words *between* search terms in a
phrase.

Are tokens at the same location actually adjacent in their ordinal values,
thus affecting the slop as you describe?

If so, Is there a predictable way to determine which comes before the other
- perhaps the order they are inserted when being tokenized?

thanks,

C>T>

On Thu, Nov 19, 2009 at 8:35 AM, Erick Erickson wrote:

> If I'm reading this right, your tokenizer creates two tokens. One
> "report" and one "_n"... I suspect if so that this will create some
> "interesting"
> behaviors. For instance, if you put two tokens in place, are you going
> to double the slop when you don't care about part of speech? Is every
> word going to get a marker? etc.
>
> I'm not sure payloads would be useful here, but you might check it out...
>
> What I'd think about, though, is a variant of synonyms. That is, index
> report and report_n (note no space) at the same location. Then, when
> you wanted to create a part-of-speech-aware query, you'd attach the
> various markers to your terms (_n, _v, _adj, _adv etc.) and not have to
> worry about unexpected side-effects.
>
> HTH
> Erick
>
> On Wed, Nov 18, 2009 at 5:20 PM, Christopher Tignor  >wrote:
>
> > Hello,
> >
> > I have indexed words in my documents with part of speech tags at the same
> > location as these words using a custom Tokenizer as described, very
> > helpfully, here:
> >
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3c20060712115026.38897.qm...@web26002.mail.ukl.yahoo.com%3e
> >
> > I would like to do a search that retrieves documents when a given word is
> > used with a specific part of speech, e.g. all docs where "report" is used
> > as
> > a noun.
> >
> > I was hoping I could use something like a PhraseQuery with "report _n"
> (_n
> > is my noun part of speech tag) with some sort of identifier that
> describes
> > the words as having to be at the same location - like a null slop or
> > something.
> >
> > Any thoughts on how to do this?
> >
> > thanks so much,
> >
> > C>T>
> >
> > --
> > TH!NKMAP
> >
> > Christopher Tignor | Senior Software Architect
> > 155 Spring Street NY, NY 10012
> > p.212-285-8600 x385 f.212-285-8999
> >
>



-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: Phrase query with terms at same location

2009-11-19 Thread Christopher Tignor
Thanks again for this.

I would like to able to do several things with this data if possible.
As per Mark's post, I'd like to be able to query for phrases like "He _v"~1
(where _v is my verb part of speech token) to recover string like: "He later
apologized".

This already in fact seems to be working.  But I'd also like to be able to
say give me all the times
"report" is used as a noun, i.e. when "report" and "_n" occur at the same
location.

But isn't the slop for PhraseQueries the "edit
distance"<http://content18.wuala.com/contents/cborealis/Docs/lucene/api/org/apache/lucene/search/PhraseQuery.html#setSlop%28int%29>and
shouldn't "report _n"~1 achieve my above goal, moving "_n" onto the
location of "report" in one edit step?  If so, it seems I would need to be
able to also specify that the query is restricted from also  interpreting
the slop the other way, i.e. also recovering "report to him", allowing one
term between report and him.  Perhaps PhraseQuery can't do this?

It seems like your suggestion of creating part-of-speech tag prefixed tokens
might be the only way to accommodate both, e.g. creating a token
"_n_reporting" as well as "reporting" and maybe also an additional "_n"
token to avoid having to use more expensive Wildcard matches to recover all
nouns.  The only problem here is that I also have *other* tags at the same
location adding semantics to "reporting" as encountered in the text: it's
stemmed form "^report" for example as well as more fine grained part of
speech tag from the NUPOS set, e.g. "_n2_" and I can imagine additional
future semantics.  To create new combinatorial terms for all thes esemantic
tags explodes the token count exponentially...

thanks -

C>T>


On Thu, Nov 19, 2009 at 10:30 AM, Erick Erickson wrote:

> Ahhh, I should have followed the link. I was interpreting your first note
> as
> emitting two tokens NOT at the same offset. My mistake, ignore my nonsense
> about unexpected consequences. Your original assumption is correct, zero
> offsets are pretty transparent.
>
> What do you really want to do here? Mark's email (at the link) allows
> you to create queries queries expressing "find all phrases
> of the form noun-verb-adverb" say. The slop allows for intervening words.
>
> Your original post seems to want different semantics.
>
> << is
> used with a specific part of speech, e.g. all docs where "report" is used
> as
> a noun>>>.
>
> For that, my suggestion seems simpler, which is not surprising since it
> addresses a less general problem. So instead of including a general
> part of speech token, just suffix your original word with your marker and
> use that for your "synonym.
>
> Then expressing your intent is simply tacking on the part of speech
> marker to the words you care about (e.g. report_n when you wanted
> report as a noun). No phrases or slop required, at the expense of
> more terms.
>
> H, if you wanted to, say, "find all the nouns in the index", you
> could *prefix* the word (e.g. n_report) which would group all the
> nouns together in the term enumerations
>
> Sorry for the confusion
> Erick
>
>
> On Thu, Nov 19, 2009 at 9:38 AM, Christopher Tignor  >wrote:
>
> > Thanks, Erick -
> >
> > Indeed every word will have a part of speech token but Is this how the
> slop
> > actually works?  My understanding was that if I have two tokens in the
> same
> > location then each will not effect searches involving other in terms of
> the
> > slop as slop indicates the number of words *between* search terms in a
> > phrase.
> >
> >
> Are tokens at the same location actually adjacent in their ordinal values,
> > thus affecting the slop as you describe?
> >
> > If so, Is there a predictable way to determine which comes before the
> other
> > - perhaps the order they are inserted when being tokenized?
> >
> > thanks,
> >
> > C>T>
> >
> > On Thu, Nov 19, 2009 at 8:35 AM, Erick Erickson  > >wrote:
> >
> > > If I'm reading this right, your tokenizer creates two tokens. One
> > > "report" and one "_n"... I suspect if so that this will create some
> > > "interesting"
> > > behaviors. For instance, if you put two tokens in place, are you going
> > > to double the slop when you don't care about part of speech? Is every
> > > word going to get a marker? etc.
> > >
> > > I'm not sure payloads would be useful here, but you might check it
> out.

SpanQuery for Terms at same position

2009-11-19 Thread Christopher Tignor
Hello,

I would like to search for all documents that contain both "plan" and "_v"
(my part of speech token for verb) at the same position.
I have tokenized the documents accordingly so these tokens exists at the
same location.

I can achieve programaticaly using PhraseQueries by adding the Terms
explicitly at the same position but I need to be able to recover the Payload
data for each
term found within the matched instance of my query.

Unfortunately the PayloadSpanUtil doesn't seem to return the same results as
the PhraseQuery, possibly becuase it is converting it inoto Spans first
which do not support searching for Terms at the same document position?

Any help appreciated.

thanks,

C>T>

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: SpanQuery for Terms at same position

2009-11-23 Thread Christopher Tignor
Tested it out.  It doesn't work.  A slop of zero indicates no words between
the provided terms.  E.g. my query of "plan" "_n" returns entries like
"contingency "plan".

My work around for this problem is to use a PhraseQuery, where you can
explicitly set Terms to occur at the same location, t orecover the desired
document ids. Then, because I need the payload data for each match, I create
a SpanTermQuery for all the individual terms used, use a modified version
PayloadSpanUtil to recover only the PayloadSpans for each query from the
document ids collected above and then find the intersection of all these
sets making sure to factor in where the each span starts (the end will just
be one ordinal value after) within each document to ensure they're at the
same position.

Definitely more work than it needs to be I think.  Still looking for another
way.

C>T>


On Sat, Nov 21, 2009 at 10:47 PM, Adriano Crestani <
adrianocrest...@gmail.com> wrote:

> Hi,
>
> I didn't test, but you might want to try SpanNearQuery and set slop to
> zero.
> Give it a try and let me know if it worked.
>
> Regards,
> Adriano Crestani
>
> On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor  >wrote:
>
> > Hello,
> >
> > I would like to search for all documents that contain both "plan" and
> "_v"
> > (my part of speech token for verb) at the same position.
> > I have tokenized the documents accordingly so these tokens exists at the
> > same location.
> >
> > I can achieve programaticaly using PhraseQueries by adding the Terms
> > explicitly at the same position but I need to be able to recover the
> > Payload
> > data for each
> > term found within the matched instance of my query.
> >
> > Unfortunately the PayloadSpanUtil doesn't seem to return the same results
> > as
> > the PhraseQuery, possibly becuase it is converting it inoto Spans first
> > which do not support searching for Terms at the same document position?
> >
> > Any help appreciated.
> >
> > thanks,
> >
> > C>T>
> >
> > --
> > TH!NKMAP
> >
> > Christopher Tignor | Senior Software Architect
> > 155 Spring Street NY, NY 10012
> > p.212-285-8600 x385 f.212-285-8999
> >
>



-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: SpanQuery for Terms at same position

2009-11-23 Thread Christopher Tignor
A slop of -1 doesn't work either.  I get no results returned.

this would be a *really* helpful feature for me if someone might suggest an
implementation as I would really like to be able to do arbitrary span
searches where tokens may be at the same position and also in other
positions where the ordering of subsequent terms may be restricted as per
the normal span API.

thanks,

C>T>

On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot wrote:

> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:
> > Hi,
> >
> > I didn't test, but you might want to try SpanNearQuery and set slop to
> zero.
> > Give it a try and let me know if it worked.
>
> The slop is the number of positions "in between", so zero would still be
> too
> much to only match at the same position.
>
> SpanNearQuery may or may not work for a slop of -1, but one could try
> that for both the ordered and unordered cases.
> One way to do that is to start from the existing test cases.
>
> Regards,
> Paul Elschot
>
> >
> > Regards,
> > Adriano Crestani
> >
> > On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor <
> ctig...@thinkmap.com>wrote:
> >
> > > Hello,
> > >
> > > I would like to search for all documents that contain both "plan" and
> "_v"
> > > (my part of speech token for verb) at the same position.
> > > I have tokenized the documents accordingly so these tokens exists at
> the
> > > same location.
> > >
> > > I can achieve programaticaly using PhraseQueries by adding the Terms
> > > explicitly at the same position but I need to be able to recover the
> > > Payload
> > > data for each
> > > term found within the matched instance of my query.
> > >
> > > Unfortunately the PayloadSpanUtil doesn't seem to return the same
> results
> > > as
> > > the PhraseQuery, possibly becuase it is converting it inoto Spans first
> > > which do not support searching for Terms at the same document position?
> > >
> > > Any help appreciated.
> > >
> > > thanks,
> > >
> > > C>T>
> > >
> > > --
> > > TH!NKMAP
> > >
> > > Christopher Tignor | Senior Software Architect
> > > 155 Spring Street NY, NY 10012
> > > p.212-285-8600 x385 f.212-285-8999
> > >
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: SpanQuery for Terms at same position

2009-11-23 Thread Christopher Tignor
Thanks so much for this.

Using an un-ordered query, the -1 slop indeed returns the correct results,
matching tokens at the same position.

I tried the same query but ordered both after and before rebuilding the
source with Paul's changes to NearSpansOrdered but the query was still
failing, returning no results.

C>T>

On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller  wrote:

> Your trying -1 with ordered right? Try it with non ordered.
>
> Christopher Tignor wrote:
> > A slop of -1 doesn't work either.  I get no results returned.
> >
> > this would be a *really* helpful feature for me if someone might suggest
> an
> > implementation as I would really like to be able to do arbitrary span
> > searches where tokens may be at the same position and also in other
> > positions where the ordering of subsequent terms may be restricted as per
> > the normal span API.
> >
> > thanks,
> >
> > C>T>
> >
> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot  >wrote:
> >
> >
> >> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:
> >>
> >>> Hi,
> >>>
> >>> I didn't test, but you might want to try SpanNearQuery and set slop to
> >>>
> >> zero.
> >>
> >>> Give it a try and let me know if it worked.
> >>>
> >> The slop is the number of positions "in between", so zero would still be
> >> too
> >> much to only match at the same position.
> >>
> >> SpanNearQuery may or may not work for a slop of -1, but one could try
> >> that for both the ordered and unordered cases.
> >> One way to do that is to start from the existing test cases.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >>
> >>> Regards,
> >>> Adriano Crestani
> >>>
> >>> On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor <
> >>>
> >> ctig...@thinkmap.com>wrote:
> >>
> >>>> Hello,
> >>>>
> >>>> I would like to search for all documents that contain both "plan" and
> >>>>
> >> "_v"
> >>
> >>>> (my part of speech token for verb) at the same position.
> >>>> I have tokenized the documents accordingly so these tokens exists at
> >>>>
> >> the
> >>
> >>>> same location.
> >>>>
> >>>> I can achieve programaticaly using PhraseQueries by adding the Terms
> >>>> explicitly at the same position but I need to be able to recover the
> >>>> Payload
> >>>> data for each
> >>>> term found within the matched instance of my query.
> >>>>
> >>>> Unfortunately the PayloadSpanUtil doesn't seem to return the same
> >>>>
> >> results
> >>
> >>>> as
> >>>> the PhraseQuery, possibly becuase it is converting it inoto Spans
> first
> >>>> which do not support searching for Terms at the same document
> position?
> >>>>
> >>>> Any help appreciated.
> >>>>
> >>>> thanks,
> >>>>
> >>>> C>T>
> >>>>
> >>>> --
> >>>> TH!NKMAP
> >>>>
> >>>> Christopher Tignor | Senior Software Architect
> >>>> 155 Spring Street NY, NY 10012
> >>>> p.212-285-8600 x385 f.212-285-8999
> >>>>
> >>>>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >>
> >
> >
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: SpanQuery for Terms at same position

2009-11-23 Thread Christopher Tignor
Also, I noticed that with the above edit to NearSpansOrdered I am getting
erroneous results fo normal ordered searches using searches like:

"_n" followed by "work"

where because "_n" and "work" are at the same position the code changes
accept their pairing as a valid in-order result now that the eqaul to clause
has been added to the inequality.

C>T>

On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor
wrote:

> Thanks so much for this.
>
> Using an un-ordered query, the -1 slop indeed returns the correct results,
> matching tokens at the same position.
>
> I tried the same query but ordered both after and before rebuilding the
> source with Paul's changes to NearSpansOrdered but the query was still
> failing, returning no results.
>
> C>T>
>
>
> On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller wrote:
>
>> Your trying -1 with ordered right? Try it with non ordered.
>>
>> Christopher Tignor wrote:
>> > A slop of -1 doesn't work either.  I get no results returned.
>> >
>> > this would be a *really* helpful feature for me if someone might suggest
>> an
>> > implementation as I would really like to be able to do arbitrary span
>> > searches where tokens may be at the same position and also in other
>> > positions where the ordering of subsequent terms may be restricted as
>> per
>> > the normal span API.
>> >
>> > thanks,
>> >
>> > C>T>
>> >
>> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot > >wrote:
>> >
>> >
>> >> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:
>> >>
>> >>> Hi,
>> >>>
>> >>> I didn't test, but you might want to try SpanNearQuery and set slop to
>> >>>
>> >> zero.
>> >>
>> >>> Give it a try and let me know if it worked.
>> >>>
>> >> The slop is the number of positions "in between", so zero would still
>> be
>> >> too
>> >> much to only match at the same position.
>> >>
>> >> SpanNearQuery may or may not work for a slop of -1, but one could try
>> >> that for both the ordered and unordered cases.
>> >> One way to do that is to start from the existing test cases.
>> >>
>> >> Regards,
>> >> Paul Elschot
>> >>
>> >>
>> >>> Regards,
>> >>> Adriano Crestani
>> >>>
>> >>> On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor <
>> >>>
>> >> ctig...@thinkmap.com>wrote:
>> >>
>> >>>> Hello,
>> >>>>
>> >>>> I would like to search for all documents that contain both "plan" and
>> >>>>
>> >> "_v"
>> >>
>> >>>> (my part of speech token for verb) at the same position.
>> >>>> I have tokenized the documents accordingly so these tokens exists at
>> >>>>
>> >> the
>> >>
>> >>>> same location.
>> >>>>
>> >>>> I can achieve programaticaly using PhraseQueries by adding the Terms
>> >>>> explicitly at the same position but I need to be able to recover the
>> >>>> Payload
>> >>>> data for each
>> >>>> term found within the matched instance of my query.
>> >>>>
>> >>>> Unfortunately the PayloadSpanUtil doesn't seem to return the same
>> >>>>
>> >> results
>> >>
>> >>>> as
>> >>>> the PhraseQuery, possibly becuase it is converting it inoto Spans
>> first
>> >>>> which do not support searching for Terms at the same document
>> position?
>> >>>>
>> >>>> Any help appreciated.
>> >>>>
>> >>>> thanks,
>> >>>>
>> >>>> C>T>
>> >>>>
>> >>>> --
>> >>>> TH!NKMAP
>> >>>>
>> >>>> Christopher Tignor | Senior Software Architect
>> >>>> 155 Spring Street NY, NY 10012
>> >>>> p.212-285-8600 x385 f.212-285-8999
>> >>>>
>> >>>>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >>
>> >
>> >
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> TH!NKMAP
>
> Christopher Tignor | Senior Software Architect
> 155 Spring Street NY, NY 10012
> p.212-285-8600 x385 f.212-285-8999
>



-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: SpanQuery for Terms at same position

2009-11-24 Thread Christopher Tignor
yes that indeed works for me.

thanks,

C>T>

On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot wrote:

> Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor:
> > Also, I noticed that with the above edit to NearSpansOrdered I am getting
> > erroneous results fo normal ordered searches using searches like:
> >
> > "_n" followed by "work"
> >
> > where because "_n" and "work" are at the same position the code changes
> > accept their pairing as a valid in-order result now that the eqaul to
> clause
> > has been added to the inequality.
>
> Thanks for trying this. Indeed the "followed by" semantics is broken for
> the ordered case when spans at the same positions are considered
> ordered.
>
> Did I understand correctly that the unordered case with a slop of -1
> and without the edit works to match terms at the same position?
> In that case it may be worthwhile to add that to the javadocs,
> and also add a few testcases.
>
> Regards,
> Paul Elschot
>
> >
> > C>T>
> >
> > On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor
> > wrote:
> >
> > > Thanks so much for this.
> > >
> > > Using an un-ordered query, the -1 slop indeed returns the correct
> results,
> > > matching tokens at the same position.
> > >
> > > I tried the same query but ordered both after and before rebuilding the
> > > source with Paul's changes to NearSpansOrdered but the query was still
> > > failing, returning no results.
> > >
> > > C>T>
> > >
> > >
> > > On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller  >wrote:
> > >
> > >> Your trying -1 with ordered right? Try it with non ordered.
> > >>
> > >> Christopher Tignor wrote:
> > >> > A slop of -1 doesn't work either.  I get no results returned.
> > >> >
> > >> > this would be a *really* helpful feature for me if someone might
> suggest
> > >> an
> > >> > implementation as I would really like to be able to do arbitrary
> span
> > >> > searches where tokens may be at the same position and also in other
> > >> > positions where the ordering of subsequent terms may be restricted
> as
> > >> per
> > >> > the normal span API.
> > >> >
> > >> > thanks,
> > >> >
> > >> > C>T>
> > >> >
> > >> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot <
> paul.elsc...@xs4all.nl
> > >> >wrote:
> > >> >
> > >> >
> > >> >> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:
> > >> >>
> > >> >>> Hi,
> > >> >>>
> > >> >>> I didn't test, but you might want to try SpanNearQuery and set
> slop to
> > >> >>>
> > >> >> zero.
> > >> >>
> > >> >>> Give it a try and let me know if it worked.
> > >> >>>
> > >> >> The slop is the number of positions "in between", so zero would
> still
> > >> be
> > >> >> too
> > >> >> much to only match at the same position.
> > >> >>
> > >> >> SpanNearQuery may or may not work for a slop of -1, but one could
> try
> > >> >> that for both the ordered and unordered cases.
> > >> >> One way to do that is to start from the existing test cases.
> > >> >>
> > >> >> Regards,
> > >> >> Paul Elschot
> > >> >>
> > >> >>
> > >> >>> Regards,
> > >> >>> Adriano Crestani
> > >> >>>
> > >> >>> On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor <
> > >> >>>
> > >> >> ctig...@thinkmap.com>wrote:
> > >> >>
> > >> >>>> Hello,
> > >> >>>>
> > >> >>>> I would like to search for all documents that contain both "plan"
> and
> > >> >>>>
> > >> >> "_v"
> > >> >>
> > >> >>>> (my part of speech token for verb) at the same position.
> > >> >>>> I have tokenized the documents accordingly so these tokens exists
> at
> > >> >>>>
> > >

customized SpanQuery Payload usage

2009-11-24 Thread Christopher Tignor
Hello,

For certain span queries I construct problematically by piecing together my
own SpanTermQueries I would like to enforce that Payload data is not
returned for matches on those specific terms used by the constituent
SapnTermQueries.

For exmaple if I search for a position match with a SpanQuery referencing
the tokens "_n" and "work" and there is Payload data for each (there needs
to be for other types of queries) I would like to be able to screen out the
payload data originating from any matched "_n" tokens.

I thought for the tokens I am not interested in receiving payload data from
I might simply create (anonymously) my own subclass of SpanTermQuery which
overrides getSpans and returns another custom class which extends TermSpans
but there simply overrides isPayloadAvailable to return false:

new SpanTermQuery(new Term(myField, myTokenString)) {



public Spans getSpans(IndexReader reader)
throws IOException {
return new
TermSpans(reader.termPositions(term), term) {

public boolean isPayloadAvailable()
{
return false;
}

};
}
});

This however seems to eliminating payload data for all matches though I'm
not sure why and am tracing through the code, looking at NearSpansUnordered.

Any thoughts?

thanks so much,

C>T>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: NearSpansUnordered payloads

2009-11-24 Thread Christopher Tignor
I am also having a hard time understanding the NewSpansUnordered
isPayloadAvilable() method.

For my test case where 2 tokens are at the same position, the code below
seems to be failing in traversing the 2 SpansCells.  The first SpansCell it
retrieves has its next field set to null so it cannot find the second one.
Is this normal behavior?

 // TODO: Remove warning after API has been finalized
  public boolean isPayloadAvailable() {
SpansCell pointer = min();
while (pointer != null) {
  if (pointer.isPayloadAvailable()) {
return true;
  }
  pointer = pointer.next;
}

return false;
  }

When the linked list of SpanCells is first created they are linked together
normally but their order is reversed when adding them to the queue in list
toQueue() such that the last SpansCell with it's next field set o to null is
retrieved first.

C>T>

On Fri, Nov 20, 2009 at 6:49 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> I'm interested in getting the payload information from the
> matching span, however it's unclear from the javadocs why
> NearSpansUnordered is different than NearSpansOrdered in this
> regard.
>
> NearSpansUnordered returns payloads in a hash set that's
> computed each method call by iterating over the SpanCell as a
> linked list, whereas NearSpansOrdered stores the payloads in a
> list (which is ordered) only when collectPayloads is true.
>
> At first glance I'm not sure how to correlate the payload with
> the span match using NSU, nor why they're different.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: customized SpanQuery Payload usage

2009-11-25 Thread Christopher Tignor
The problem is that I need to be able to match spans resulting from a a
SpanNearQuery with the Term they came from so I can eliminate using Payloads
from certain Terms on a query-by-query basis.

I still need this term to effect the results of a NearSpanQuery as per the
usual logic, I just need to know when iterating over the resulting Spans
that when I hit one originating from a certain Term not to load it's payload
data.

I recently solved the problem fairly simply after doing much research into
the source.  When I am building the query and encounter a term I don't want
to recover payload data from, I add my own anonymous sub-type of
SpanTermQuery to my developing SpanNearQuery that itself creates an
anonymous sub-type of SpanTerm which simply returns an empty Collection for
it's payload data.

new SpanTermQuery(new Term(QueryVocabTracker.CONTENT_FIELD, tagToken)) {
@Override
public Spans getSpans(IndexReader reader)
throws IOException {
return new
TermSpans(reader.termPositions(term), term) {

@Override
public Collection getPayload()
throws IOException {
// no payload data for this
TermSpan
return Collections.emptyList();
}
};
}
}

thanks,

C>T>



On Wed, Nov 25, 2009 at 8:10 AM, Grant Ingersoll wrote:

>
> On Nov 24, 2009, at 9:56 AM, Christopher Tignor wrote:
>
> > Hello,
> >
> > For certain span queries I construct problematically by piecing together
> my
> > own SpanTermQueries I would like to enforce that Payload data is not
> > returned for matches on those specific terms used by the constituent
> > SapnTermQueries.
>
> I'm not sure I follow.  For those terms you don't want payloads, why can't
> you just avoid getting payloads?  Span queries themselves do not require
> payloads for execution.  Can you share your code for iterating over the
> spans?
>
> >
> > For exmaple if I search for a position match with a SpanQuery referencing
> > the tokens "_n" and "work" and there is Payload data for each (there
> needs
> > to be for other types of queries) I would like to be able to screen out
> the
> > payload data originating from any matched "_n" tokens.
> >
> > I thought for the tokens I am not interested in receiving payload data
> from
> > I might simply create (anonymously) my own subclass of SpanTermQuery
> which
> > overrides getSpans and returns another custom class which extends
> TermSpans
> > but there simply overrides isPayloadAvailable to return false:
> >
> > new SpanTermQuery(new Term(myField, myTokenString)) {
> >
> >
> >
> >public Spans getSpans(IndexReader reader)
> >throws IOException {
> >return new
> > TermSpans(reader.termPositions(term), term) {
> >
> >public boolean
> isPayloadAvailable()
> > {
> >return false;
> >}
> >
> >};
> >        }
> >});
> >
> > This however seems to eliminating payload data for all matches though I'm
> > not sure why and am tracing through the code, looking at
> NearSpansUnordered.
> >
> > Any thoughts?
> >
> > thanks so much,
> >
> > C>T>
> >
> >
> > --
> > TH!NKMAP
> >
> > Christopher Tignor | Senior Software Architect
> > 155 Spring Street NY, NY 10012
> > p.212-285-8600 x385 f.212-285-8999
>
> ----------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: SpanQuery for Terms at same position

2009-11-25 Thread Christopher Tignor
It's worth noting however that this -1 slop doesn't seem to work for cases
where oyu want to discover instances of more than two terms at the same
position.  Would be nice to be able to explicitly set this in the query
construction.

thanks,

C>T>
On Tue, Nov 24, 2009 at 9:17 AM, Christopher Tignor wrote:

> yes that indeed works for me.
>
> thanks,
>
> C>T>
>
>
> On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot wrote:
>
>> Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor:
>> > Also, I noticed that with the above edit to NearSpansOrdered I am
>> getting
>> > erroneous results fo normal ordered searches using searches like:
>> >
>> > "_n" followed by "work"
>> >
>> > where because "_n" and "work" are at the same position the code changes
>> > accept their pairing as a valid in-order result now that the eqaul to
>> clause
>> > has been added to the inequality.
>>
>> Thanks for trying this. Indeed the "followed by" semantics is broken for
>> the ordered case when spans at the same positions are considered
>> ordered.
>>
>> Did I understand correctly that the unordered case with a slop of -1
>> and without the edit works to match terms at the same position?
>> In that case it may be worthwhile to add that to the javadocs,
>> and also add a few testcases.
>>
>> Regards,
>> Paul Elschot
>>
>> >
>> > C>T>
>> >
>> > On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor
>> > wrote:
>> >
>> > > Thanks so much for this.
>> > >
>> > > Using an un-ordered query, the -1 slop indeed returns the correct
>> results,
>> > > matching tokens at the same position.
>> > >
>> > > I tried the same query but ordered both after and before rebuilding
>> the
>> > > source with Paul's changes to NearSpansOrdered but the query was still
>> > > failing, returning no results.
>> > >
>> > > C>T>
>> > >
>> > >
>> > > On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller > >wrote:
>> > >
>> > >> Your trying -1 with ordered right? Try it with non ordered.
>> > >>
>> > >> Christopher Tignor wrote:
>> > >> > A slop of -1 doesn't work either.  I get no results returned.
>> > >> >
>> > >> > this would be a *really* helpful feature for me if someone might
>> suggest
>> > >> an
>> > >> > implementation as I would really like to be able to do arbitrary
>> span
>> > >> > searches where tokens may be at the same position and also in other
>> > >> > positions where the ordering of subsequent terms may be restricted
>> as
>> > >> per
>> > >> > the normal span API.
>> > >> >
>> > >> > thanks,
>> > >> >
>> > >> > C>T>
>> > >> >
>> > >> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot <
>> paul.elsc...@xs4all.nl
>> > >> >wrote:
>> > >> >
>> > >> >
>> > >> >> Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:
>> > >> >>
>> > >> >>> Hi,
>> > >> >>>
>> > >> >>> I didn't test, but you might want to try SpanNearQuery and set
>> slop to
>> > >> >>>
>> > >> >> zero.
>> > >> >>
>> > >> >>> Give it a try and let me know if it worked.
>> > >> >>>
>> > >> >> The slop is the number of positions "in between", so zero would
>> still
>> > >> be
>> > >> >> too
>> > >> >> much to only match at the same position.
>> > >> >>
>> > >> >> SpanNearQuery may or may not work for a slop of -1, but one could
>> try
>> > >> >> that for both the ordered and unordered cases.
>> > >> >> One way to do that is to start from the existing test cases.
>> > >> >>
>> > >> >> Regards,
>> > >> >> Paul Elschot
>> > >> >>
>> > >> >>
>> > >> >>> Regards,
>> > >> >>> Adri

Re: SpanQuery for Terms at same position

2009-11-25 Thread Christopher Tignor
my own tests with my own data show you are correct and the 1-n slop works
for matching terms at the same ordinal position.

thanks!

C>T>

On Wed, Nov 25, 2009 at 4:25 PM, Paul Elschot wrote:

> Op woensdag 25 november 2009 21:20:33 schreef Christopher Tignor:
> > It's worth noting however that this -1 slop doesn't seem to work for
> cases
> > where oyu want to discover instances of more than two terms at the same
> > position.  Would be nice to be able to explicitly set this in the query
> > construction.
>
> I think requiring n terms at the same position would need a slop of 1-n,
> and I'd like to have some test cases added for that.
> Now if I only had some time...
>
> Regards,
> Paul Elschot
>
> >
> > thanks,
> >
> > C>T>
> > On Tue, Nov 24, 2009 at 9:17 AM, Christopher Tignor <
> ctig...@thinkmap.com>wrote:
> >
> > > yes that indeed works for me.
> > >
> > > thanks,
> > >
> > > C>T>
> > >
> > >
> > > On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot  >wrote:
> > >
> > >> Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor:
> > >> > Also, I noticed that with the above edit to NearSpansOrdered I am
> > >> getting
> > >> > erroneous results fo normal ordered searches using searches like:
> > >> >
> > >> > "_n" followed by "work"
> > >> >
> > >> > where because "_n" and "work" are at the same position the code
> changes
> > >> > accept their pairing as a valid in-order result now that the eqaul
> to
> > >> clause
> > >> > has been added to the inequality.
> > >>
> > >> Thanks for trying this. Indeed the "followed by" semantics is broken
> for
> > >> the ordered case when spans at the same positions are considered
> > >> ordered.
> > >>
> > >> Did I understand correctly that the unordered case with a slop of -1
> > >> and without the edit works to match terms at the same position?
> > >> In that case it may be worthwhile to add that to the javadocs,
> > >> and also add a few testcases.
> > >>
> > >> Regards,
> > >> Paul Elschot
> > >>
> > >> >
> > >> > C>T>
> > >> >
> > >> > On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor
> > >> > wrote:
> > >> >
> > >> > > Thanks so much for this.
> > >> > >
> > >> > > Using an un-ordered query, the -1 slop indeed returns the correct
> > >> results,
> > >> > > matching tokens at the same position.
> > >> > >
> > >> > > I tried the same query but ordered both after and before
> rebuilding
> > >> the
> > >> > > source with Paul's changes to NearSpansOrdered but the query was
> still
> > >> > > failing, returning no results.
> > >> > >
> > >> > > C>T>
> > >> > >
> > >> > >
> > >> > > On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller <
> markrmil...@gmail.com
> > >> >wrote:
> > >> > >
> > >> > >> Your trying -1 with ordered right? Try it with non ordered.
> > >> > >>
> > >> > >> Christopher Tignor wrote:
> > >> > >> > A slop of -1 doesn't work either.  I get no results returned.
> > >> > >> >
> > >> > >> > this would be a *really* helpful feature for me if someone
> might
> > >> suggest
> > >> > >> an
> > >> > >> > implementation as I would really like to be able to do
> arbitrary
> > >> span
> > >> > >> > searches where tokens may be at the same position and also in
> other
> > >> > >> > positions where the ordering of subsequent terms may be
> restricted
> > >> as
> > >> > >> per
> > >> > >> > the normal span API.
> > >> > >> >
> > >> > >> > thanks,
> > >> > >> >
> > >> > >> > C>T>
> > >> > >> >
> > >> > >> > On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot <
> > >> paul.elsc...@xs4all.nl
> > &

Re: SpanQuery for Terms at same position

2009-11-30 Thread Christopher Tignor
It would take a bit of work work / learning (haven't used a RAMDirectory
yet) to make them into test cases usable by others and am deep into this
project and under the gun right now.  But if some time surfaces I will for
sure...

thanks -

C>T>

On Wed, Nov 25, 2009 at 7:49 PM, Erick Erickson wrote:

> Hmmm, are they unit tests? Or would you be wiling to create stand-alone
> unit tests demonstrating this and submit it as a patch?
>
> Best
> er...@alwaystrollingforworkfromothers.opportunistic.
>
> On Wed, Nov 25, 2009 at 5:38 PM, Christopher Tignor  >wrote:
>
> > my own tests with my own data show you are correct and the 1-n slop works
> > for matching terms at the same ordinal position.
> >
> > thanks!
> >
> > C>T>
> >
> > On Wed, Nov 25, 2009 at 4:25 PM, Paul Elschot  > >wrote:
> >
> > > Op woensdag 25 november 2009 21:20:33 schreef Christopher Tignor:
> > > > It's worth noting however that this -1 slop doesn't seem to work for
> > > cases
> > > > where oyu want to discover instances of more than two terms at the
> same
> > > > position.  Would be nice to be able to explicitly set this in the
> query
> > > > construction.
> > >
> > > I think requiring n terms at the same position would need a slop of
> 1-n,
> > > and I'd like to have some test cases added for that.
> > > Now if I only had some time...
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > > >
> > > > thanks,
> > > >
> > > > C>T>
> > > > On Tue, Nov 24, 2009 at 9:17 AM, Christopher Tignor <
> > > ctig...@thinkmap.com>wrote:
> > > >
> > > > > yes that indeed works for me.
> > > > >
> > > > > thanks,
> > > > >
> > > > > C>T>
> > > > >
> > > > >
> > > > > On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot <
> > paul.elsc...@xs4all.nl
> > > >wrote:
> > > > >
> > > > >> Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor:
> > > > >> > Also, I noticed that with the above edit to NearSpansOrdered I
> am
> > > > >> getting
> > > > >> > erroneous results fo normal ordered searches using searches
> like:
> > > > >> >
> > > > >> > "_n" followed by "work"
> > > > >> >
> > > > >> > where because "_n" and "work" are at the same position the code
> > > changes
> > > > >> > accept their pairing as a valid in-order result now that the
> eqaul
> > > to
> > > > >> clause
> > > > >> > has been added to the inequality.
> > > > >>
> > > > >> Thanks for trying this. Indeed the "followed by" semantics is
> broken
> > > for
> > > > >> the ordered case when spans at the same positions are considered
> > > > >> ordered.
> > > > >>
> > > > >> Did I understand correctly that the unordered case with a slop of
> -1
> > > > >> and without the edit works to match terms at the same position?
> > > > >> In that case it may be worthwhile to add that to the javadocs,
> > > > >> and also add a few testcases.
> > > > >>
> > > > >> Regards,
> > > > >> Paul Elschot
> > > > >>
> > > > >> >
> > > > >> > C>T>
> > > > >> >
> > > > >> > On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Thanks so much for this.
> > > > >> > >
> > > > >> > > Using an un-ordered query, the -1 slop indeed returns the
> > correct
> > > > >> results,
> > > > >> > > matching tokens at the same position.
> > > > >> > >
> > > > >> > > I tried the same query but ordered both after and before
> > > rebuilding
> > > > >> the
> > > > >> > > source with Paul's changes to NearSpansOrdered but the query
> was
> > > still
> > > > >> > > failing, returning no results.
> > > > >> > >
> > > > >&g

minimum range for SpanQueries

2009-12-21 Thread Christopher Tignor
Is there are way to implement a minimum range for a SpanQuery or combination
thereof?

For example, using:

"The boy said hello to the boy"

I'd like to use a SpanNearQuery consisting of the terms "The" and "boy" that
returns one span including the entire sentence but not a span for the first
two words.
Thus, I'd like to specify a minimum range of at least 1 and a maximum of
say, 5 here.

I note that using a SpanNotQuery consisting of two SpanNearQueries with the
same terms and these above ranges does not work as the desired longer
SpanResult will include the shorter one and get weeded out.

thanks -

C>T>

-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: recovering payload from fields

2010-02-26 Thread Christopher Tignor
Hello,

To my knoweldge, the character position of the tokens is not preserved by
Lucene - only the ordinal postion of token's within a document / field is
preserved.  Thus you need to store this character offset information
separately, say, as Payload data.

best,

C>T>

On Fri, Feb 26, 2010 at 3:41 PM, Christopher Condit  wrote:

> I'm trying to store semantic information in payloads at index time. I
> believe this part is successful - but I'm having trouble getting access to
> the payload locations after the index is created. I'd like to know the
> offset in the original text for the token with the payload - and get this
> information for all payloads that are set in a Field even if they don't
> relate to the query. I tried (from the highlighting filter):
> TokenStream tokens = TokenSources.getTokenStream(reader, 0, "body");
>  while (tokens.incrementToken()) {
>TermAttribute term = tokens.getAttribute(TermAttribute.class);
>if (toker.hasAttribute(PayloadAttribute.class)) {
>  PayloadAttribute payload =
> tokens.getAttribute(PayloadAttribute.class);
>  OffsetAttribute offset = toker.getAttribute(OffsetAttribute.class);
>}
>  }
> But the OffsetAttribute never seems to contain any information.
> In my token filter do I need to do more than:
> offsetAtt = addAttribute(OffsetAttribute.class);
> during construction in order to store Offset information?
>
> Thanks,
> -Chris
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: recovering payload from fields

2010-03-05 Thread Christopher Tignor
What I'd ideally like to do is to take SpanQuery, loop over the PayloadSpans
returned from SpanQuery.getPayloadSpans() and store all PayloadSpans for a
given document in a Map by their doc id.

Then later after deciding in memory which documents I need, load the Payload
data for just those PayloadSpans pulled out of my Map.

But it seems I can't do this as loading Payload data is only done through
the PayloadSpans iterator so must iterate through the entire collection to
get to my PaylaodSpan.  Is there not a way to just save a PayloadSpan and
loads it's payload data later as needed?

thanks,

C>T>

On Sat, Feb 27, 2010 at 5:42 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> You can also access payloads through the TermPositions enum, but, this
> is by term and then by doc.
>
> It sounds like you need to iterate through all terms sequentially in a
> given field in the doc, accessing offset & payload?  In which case
> reanalyzing at search time may be the best way to go.
>
> You can store term vectors in the index, which will store offsets (if
> you ask it to), but, payloads are not currently stored with term
> vectors.
>
> Mike
>
> On Fri, Feb 26, 2010 at 7:42 PM, Christopher Condit 
> wrote:
> >> Payload Data is accessed through PayloadSpans so using SpanQUeries is
> the
> >> netry point it seems.  There are tools like PayloadSpanUtil that convert
> other
> >> queries into SpanQueries for this purpose if needed but the api for
> Payloads
> >> looks it like it goes through Spans is the bottom line.
> >
> > So there's no way to iterate through all the payloads for a given field?
> I can't use the SpanQuery mechanism because in this case the entire field
> will be displayed - and I can't search for "*". Is there some trick I'm not
> thinking of?
> >
> >> this is the tip of the iceberg; a big dangerous iceberg...
> >
> > Yes - I'm beginning to see that...
> >
> > Thanks,
> > -Chris
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


custom scoring help

2010-04-02 Thread Christopher Tignor
Hello,

I'm having a hard time implementing / understanding a very simple custom
scoring situation.

I have created my Similarity class for testing which overrides all the
relevant (I think) methods below, returning 1 for all but coord(int, int)
which returns q / maxOverlap so scores are scaled between 0. and 1..

I call writer.setSimilarity(new HashHitSimilarity()) when indexing
and searcher.setSimilarity(new HashHitSimilarity()) when searching.

The similarity is definitely affecting the scoring but not how I expect.  I
am looking for a straight average of the hits calculated, i.e.
totalHits for a doc / totalHits in search.

The above score with my test search and index of 6 docs should return the
scores below for all 6 documents in my index:

0.8387096774193549
0.3548387096774194
0.3548387096774194
0.25806451612903225
0.1935483870967742
0.12903225806451613

but the scores appear "stretched" and return these instead though I'm unsure
as to where this "stretching" happens:

0.9078212
0.75977653
0.57541895
0.5670391
0.5223464
0.37150836

public class HashHitSimilarity extends Similarity {

/**
 *
 */
private static final long serialVersionUID = 811419737205284733L;

public float tf(float freq) {
return 1f;
}

public float lengthNorm(String fieldName, int numTokens) {
return 1f;
}

public float queryNorm(float sumOfSquaredWeights) {
return 1f;
}

@Override
public float coord(int overlap, int maxOverlap) {
return 1f / (float) maxOverlap;
}

@Override
public float idf(int docFreq, int numDocs) {
return 1f;
}

@Override
public float sloppyFreq(int distance) {
return 0f;
}

}




-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


Re: custom scoring help

2010-04-02 Thread Christopher Tignor
This code is in fact working.  I had an error in my test case.  Things seem
to work as advertised.

sorry / thanks -

C>T>

On Fri, Apr 2, 2010 at 10:20 AM, Christopher Tignor wrote:

> Hello,
>
> I'm having a hard time implementing / understanding a very simple custom
> scoring situation.
>
> I have created my Similarity class for testing which overrides all the
> relevant (I think) methods below, returning 1 for all but coord(int, int)
> which returns q / maxOverlap so scores are scaled between 0. and 1..
>
> I call writer.setSimilarity(new HashHitSimilarity()) when indexing
> and searcher.setSimilarity(new HashHitSimilarity()) when searching.
>
> The similarity is definitely affecting the scoring but not how I expect.  I
> am looking for a straight average of the hits calculated, i.e.
> totalHits for a doc / totalHits in search.
>
> The above score with my test search and index of 6 docs should return the
> scores below for all 6 documents in my index:
>
> 0.8387096774193549
> 0.3548387096774194
> 0.3548387096774194
> 0.25806451612903225
> 0.1935483870967742
> 0.12903225806451613
>
> but the scores appear "stretched" and return these instead though I'm
> unsure as to where this "stretching" happens:
>
> 0.9078212
> 0.75977653
> 0.57541895
> 0.5670391
> 0.5223464
> 0.37150836
>
> public class HashHitSimilarity extends Similarity {
>
> /**
>  *
>  */
> private static final long serialVersionUID = 811419737205284733L;
>
> public float tf(float freq) {
> return 1f;
> }
>
> public float lengthNorm(String fieldName, int numTokens) {
> return 1f;
> }
>
> public float queryNorm(float sumOfSquaredWeights) {
> return 1f;
> }
>
> @Override
> public float coord(int overlap, int maxOverlap) {
> return 1f / (float) maxOverlap;
> }
>
>     @Override
>     public float idf(int docFreq, int numDocs) {
> return 1f;
> }
>
> @Override
> public float sloppyFreq(int distance) {
> return 0f;
> }
>
> }
>
>
>
>
> --
> TH!NKMAP
>
> Christopher Tignor | Senior Software Architect
> 155 Spring Street NY, NY 10012
> p.212-285-8600 x385 f.212-285-8999
>



-- 
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999