Re: get term neighbours

Grant Ingersoll Thu, 07 May 2009 13:20:01 -0700


On May 7, 2009, at 9:11 AM, Adrian Dimulescu wrote:

Thank you for these precisions. As I had to do something fast, Icoded the thing as illustrated by the following pseudocode:
IndexReader index;
TermPositions iterator = this.index.termPositions(t); // for eachdoc where this term appears
while (iterator.next()) {
          int docNr = iterator.doc();
          int freq = iterator.freq();
int[] apparitionPositions = new int[freq]; // these arethe positions in the crt doc of the crt term
          for (int i = 0; i < freq; i++) {
              apparitionPositions[i] = iterator.nextPosition();
          }
...
TermPositionVector tpv = (TermPositionVector)this.index.getTermFreqVector(docNr, "text");
...
// for all possible terms, see if it is close to one ofthe elements in apparitionPositions
         for (int i = 0; i < terms.length; i++) {
              int[] pos = tpv.getTermPositions(i);
... // for each element in pos, check close distanceto the crt term
         }
}
My understanding is that this is a less object-oriented way of doingthe same thing as your proposition but please correct me if I'm wrong.

This use case is in fact why I added the TermVectorMapper stuff intoLucene. In my case, I used SpanQuery to get me the position of theterm within the doc, then I materialized the vector via aTermVectorMapper such that the TVM only stored the window around thespan. It's not much different CPU wise, but it does save on memory, Ithink.

I finally managed to retrieve what I wanted with this code. Theproblem is that it is not really parallelizable. If several threadscall getTermFreqVector at the same time, they have to wait aftereach other. My multithreaded scenario involved a unique IndexReaderon which all threads ask for term vectors. I wonder if it ispossible to avoid this problem (perhaps by having a pool ofIndexReaders, is this a good practice, wouldn't there be memoryproblems?). I welcome any ideas on this subject.

Are you saying there are synchronizations happening? Even withmultiple Readers, don't you end up with the disk access being aproblem? Or, are you all in memory?

Thank you,
Adrian.

Grant Ingersoll wrote:
There isn't a very clean way to do this just yet, but it isdoable. Index with positions (you might find offsets useful too)and then use the TermVectorMapper and TermVector API call on theIndexReader (not the termPositions). Then, you will need toimplement a TermVectorMapper that takes in your position and thenreads in the term vector and gets just those positions around theinterested position. Once you are outside of your window, you canthen short circuit out of the TermVM (I think).
HTH,
Grant

On May 3, 2009, at 2:39 PM, Adrian Dimulescu wrote:
Hello,
I am post-processing a positional index -- with a field like thefollowing:
doc.add(new Field(Constants.FIELD_TEXT, txt, Store.NO,Index.ANALYZED, TermVector.WITH_POSITIONS));
At post-processing, I want to retrieve the neighbours of a giventerm within a given range. That is, if document x contains thesequence :
"Alabama experienced significant /recovery as the economy of thestate/ transitioned from agriculture to diversified interests inheavy manufacturing"
for range = 3 and term = "economy", I want to retrieve "recoveryas the *economy* of the state".
I see there is an API call :

IndexReader.termPositions(term)
which retrieves the actual positions of the given term. Is there aquick way to retrieve its neighbours too, instead of browsing allterms for all document and see if their position is close to theposition of the central term ?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: get term neighbours

Reply via email to