Re: Using Payloads

Murat Yakici Mon, 27 Apr 2009 02:41:35 -0700

See http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
and http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
and http://wiki.apache.org/lucene-java/ConcurrentAccessToIndex


Murat Yakici
Department of Computer & Information Sciences
University of Strathclyde
Glasgow, UK
-------------------------------------------

The University of Strathclyde is a charitable body, registered in Scotland,with registration number SC015263.




liat oren wrote:

Yes, I agree with you - I also tried this approach in the past and it was
terribely slow - looping on the term vectors.

What I have done - is dividing indexes into steps - which of course, if can
be avoided, it will be more than great!!

As for my problem -  it was a code problems, I sloved it, thanks.
Is it recommended to use multi-thread in order to index?

Best,
Liat
2009/4/26 Murat Yakici <murat.yak...@cis.strath.ac.uk>

See my comments:

Yes, for this specific part, I have this prior knowledge which is based

on

a
training set.
About the things you raise here, there are two things you might mean, I

am

not sure:

1. If you don't have that "prior" knowledge, then all it means you need

to

modify the formula of the score, no? to give more weight to factors you
think to be more significant.
The TermFreqVector will have the term frequencies and you will be able to
edit the score formula so it will fit your needs.

Although I like the TermFreqVector approach, the API and everything, it is
slower than using TermEnum and TermDocs. I don't have solid statistics,
but I confirmed the fact during my development work on a 80+ core SG
server. I wish TermFreqVector approach could load abit faster.

Or

2. Enable us to add statistics factors while indexing

This is what I am talking about, *Indexing time*.

Question - why do you want to edit these while indexing and not using

them

at "search" time in the way you desire? At this stage you already have

all

the statistics - frequencies of all terms within and outside documents

First of all you don't have all the statistics. Because, there are some
statistics that you simple can't calculate (or more correctly you
shouldn't be due to performance) during the query scoring time. It will be
an overkill.

As for my solution:
I tried to add documents to the index and for every document I have a
differnt map of factors for every term.
However, I get an exception: Exception in thread "main"
java.util.ConcurrentModificationException
It seems like two threads - one reads the map, one edits it (probably for
the next document), bump into each other.

Are you using a single Thread to do all the indexing or multiple threads
adding document to the index? You need to explain the situation a bit
more.

Murat

I tried to put the read part and write part in two different synchronied
methods, but it still shows the same exception.

Any idea how can it be solved?

Best,
Liat


2009/4/26 Murat Yakici <murat.yak...@cis.strath.ac.uk>

Yes, this is more or less what I had in mind. However, for this approach
one requires some *prior knowledge* of the vocabulary of the document
(or
the collection) to produce that score before even it gets analyzed,
isn't
it? And this is the paradox that I have been thinking. If you have that
knowledge, that's fine. In addition, for applications that only require
a
small term window to generate a score (such as term in context score)
this
can be implemented very easy.

It is possible to inject the document dependent boost/score generation
*logic* (an interface would do) to the Tokenizer/TokenStream. However, I
am afraid this may have an indexing time penalty. If your window size is
the document itself, you will be doing the same job twice (calculating
the
num of times a term occurs in doc X, index time weights etc.).
IndexWriter
already does these somewhere down deep.


Simply put, I want to add some scores to documents/terms, but I can't
generate that score before I observe the document/terms. If I do that I
would replicate some of the work that is being already done by
IndexWriter.

If I remember it correctly, there is also some intention to add document
payloads functionality. I have the same concerns on this. So I think we
need a clear view on the topic. Where is the payload work moving? How we
can generate a score without duplicating some of the work that
IndexWriter
is doing?   I guess Michael Busch is working on document payloads for
release 3.0. I would appreciate if someone can enlighten us on how that
would work and can be utilised, in particularly during the analysis
phase?


Cheers,
Murat

Thanks, Murat.
It was very useful - I also tried to override IndexWriter and
DocumentsWriter instead, but it didn't work well. DocumentsWriter

can't

be
overriden.

So, I didn't find a better way to make the changes.

My needs are having for every term in different documents different
values.
So, like you set the boost at the document level, I would like to set

the

boost for different terms within differnt documents.

For that matter, I made some changes in the code you sent - (I

coloured

the
changes in red):

Below you can find an example for the use of it

**********
 private class PayloadAnalyzer extends Analyzer
 {
  private PayloadTokenStream payToken = null;
  private int score;
  *private Map<String, Integer> scoresMap = new HashMap<String,
Integer>();*
  public synchronized void setScore(int s)
  {
   score = s;
  }
*  public synchronized void setMapScores(Map<String, Integer>

scoresMap)

  {
   this.scoresMap = scoresMap;
  }*
  public final TokenStream tokenStream(String field, Reader reader)
  {
   payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader));
//new
LowerCaseTokenizer(reader));
   payToken.setScore(score);
   payToken.setMapScores(scoresMap);
   return payToken;
  }
 }
 private class PayloadTokenStream extends TokenStream
 {
  private Tokenizer tok = null;
  private int score;
  *private Map<String, Integer> scoresMap = new HashMap<String,
Integer>();*
  public PayloadTokenStream(Tokenizer tokenizer)
  {
   tok = tokenizer;
  }
  public void setScore(int s)
  {
   score = s;
  }
 * public synchronized void setMapScores(Map<String, Integer>

scoresMap)

  {
   this.scoresMap = scoresMap;
  }*
  public Token next(Token t) throws IOException
  {
   t = tok.next(t);
   if(t != null)
   {
    //t.setTermBuffer("can change");
    //Do something with the data
    byte[] bytes = ("score:" + score).getBytes();
    //    t.setPayload(new Payload(bytes));
*    String word = String.copyValueOf(t.termBuffer(), 0,

t.termLength());

    int score = scoresMap.get(word);
    byte payLoad = Byte.parseByte(String.valueOf(score));
    t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));*
   }
   return t;
  }
  public void reset(Reader input) throws IOException
  {
   tok.reset(input);
  }
  public void close() throws IOException
  {
   tok.close();
  }
 }
**********************************
*Example for the use of payloads:*

  PayloadAnalyzer panalyzer = new PayloadAnalyzer();
  File index = new File("" + "TestSearchIndex");
  IndexWriter iwriter = new IndexWriter(index, panalyzer);
  Document d = new Document();
  d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES));
  d.add(new Field("id", "1^3", Field.Store.YES,

Field.Index.UN_TOKENIZED,

Field.TermVector.NO <http://field.termvector.no/> <

http://field.termvector.no/>));

  Map<String, Integer> mapScores = new HashMap<String, Integer>();
  mapScores.put("word1", 3);
  mapScores.put("word2", 1);
  mapScores.put("word3", 1);
  panalyzer.setMapScores(mapScores);
  iwriter.addDocument(d, panalyzer);
  d = new Document();
  d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES));
  d.add(new Field("id", "2^3", Field.Store.YES,

Field.Index.UN_TOKENIZED,

Field.TermVector.NO <http://field.termvector.no/> <

http://field.termvector.no/>));
 >> >   //We set the score for the term of the document that will be

analyzed.
  /*I was worried about this part - document dependent score
  which may be utilized*/
  mapScores = new HashMap<String, Integer>();
  mapScores.put("word1", 1);
  mapScores.put("word2", 3);
  mapScores.put("word3", 1);
  panalyzer.setMapScores(mapScores);
  iwriter.addDocument(d, panalyzer);
  /*-----------------*/
  //  iwriter.commit();
  iwriter.optimize();
  iwriter.close();
  BooleanQuery bq = new BooleanQuery();
  BoostingTermQuery tq = new BoostingTermQuery(new Term("text",

"word1"));

  tq.setBoost((float) 1.0);
  bq.add(tq, BooleanClause.Occur.MUST);
  tq = new BoostingTermQuery(new Term("text", "word2"));
  tq.setBoost((float) 3);
  bq.add(tq, BooleanClause.Occur.SHOULD);
  tq = new BoostingTermQuery(new Term("text", "word3"));
  tq.setBoost((float) 1);
  bq.add(tq, BooleanClause.Occur.SHOULD);
  IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex");
  searcher1.setSimilarity(new WordsSimilarity());
  TopDocs topDocs = searcher1.search(bq, null, 3);
  Hits hits1 = searcher1.search(bq);
  for(int j = 0; j < hits1.length(); j++)
  {
   Explanation explanation = searcher1.explain(bq, j);
   System.out.println("**** " + hits1.score(j) + " " +
hits1.doc(j).getField("id").stringValue() + " *****");
   System.out.println(explanation.toString());
   explanation.getValue();

 System.out.println("********************************************************");

   System.out.println("Score " + topDocs.scoreDocs[j].score + " doc "

searcher1.doc(topDocs.scoreDocs[j].doc).getField("id").stringValue());

 System.out.println("********************************************************");

  }

If you try the same query with differnt boosting, you will get a

different

order for the documents.

Does it look ok?

Thanks again!
Liat
2009/4/25 Murat Yakici <murat.yak...@cis.strath.ac.uk>

Here is what I am doing, not so magical... There are two classes, an
analyzer and an a TokenStream in which I can inject my document
dependent
data to be stored as payload.


private PayloadAnalyzer panalyzer = new PayloadAnalyzer();

   private class PayloadAnalyzer extends Analyzer {

       private PayloadTokenStream payToken = null;
       private int score;

       public synchronized void setScore(int s) {
           score=s;
       }

     public final TokenStream tokenStream(String field, Reader

reader) {

        payToken = new PayloadTokenStream(new
LowerCaseTokenizer(reader));
        payToken.setScore(score);
        return payToken;
       }
   }

   private class PayloadTokenStream extends TokenStream {

       private Tokenizer tok = null;
       private int score;

       public PayloadTokenStream(Tokenizer tokenizer) {
           tok = tokenizer;
       }

       public void setScore(int s) {
           score = s;
       }

       public Token next(Token t) throws IOException {
           t = tok.next(t);
           if (t != null) {
               //t.setTermBuffer("can change");
               //Do something with the data
               byte[] bytes = ("score:"+ score).getBytes();
               t.setPayload(new Payload(bytes));
           }
           return t;
       }

       public void reset(Reader input) throws IOException {
           tok.reset(input);
       }

       public void close() throws IOException {
           tok.close();
       }
   }


   public void doIndex() {
       try {
           File index = new File("./TestPayloadIndex");
           IndexWriter iwriter = new IndexWriter(index,
                    panalyzer,
                    IndexWriter.MaxFieldLength.UNLIMITED);

           Document d = new Document();
           d.add(new Field("content",
              "Everyone, someone, myTerm, yourTerm", Field.Store.YES,
               Field.Index.ANALYZED, Field.TermVector.YES));
           //We set the score for the term of the document that will

be

analyzed.
           /*I was worried about this part - document dependent score
which may be utilized*/
           panalyzer.setScore(5);
           iwriter.addDocument(d, panalyzer);
           /*-----------------*/
           ...
           iwriter.commit();
           iwriter.optimize();
           iwriter.close();

           //Now read the index
           IndexReader ireader = IndexReader.open(index);
           TermPositions tpos = ireader.termPositions(
                                 new Term("content","myterm"));//Note
LowercaseTokenizer
           while (tpos.next()) {
               int pos;
               for(int i=0;i<tpos.freq();i++){
                   pos=tpos.nextPosition();
                   if (tpos.isPayloadAvailable()) {
                       byte[] data = new

byte[tpos.getPayloadLength()];

                       tpos.getPayload(data, 0);
                      //Utilise payloads;
                   }
               }
           }

           tpos.close();
       } catch (CorruptIndexException ex) {
          //
       } catch (LockObtainFailedException ex) {
           //
       } catch (IOException ex) {
           //
       }
   }

I wish it was designed better... Please let me know if you guys have

better idea.

Cheers,
Murat

Dear Murat,

I saw your question and wondered how did you implement these

changes?

The requirement below are the same ones as I am trying to code now.
Did you modify the source code itself or only used Lucene's jar and

just

override code?

I would very much apprecicate if you could give me a short

explanation

on

how was it done.

Thanks a lot,
Liat

2009/4/21 Murat Yakici <murat.yak...@cis.strath.ac.uk>

Hi,
I started playing with the experimental payload functionality. I

have

written an analyzer which adds a payload (some sort of a

score/boost)

for
each term occurance. The payload/score for each term is dependent

on

the

document that the term comes from (I guess this is the typoical

use

case).
So say term t1 may have a payload of 5 in doc1 and 34 in doc5. The
parameter
for calculating the payload changes after each
indexWriter.addDocument(..)
method call in a while loop. I am assuming that the
indexWriter.addDocument(..) methods are thread safe. Can I confirm

this?

Cheers,

--
Murat Yakici
Department of Computer & Information Sciences
University of Strathclyde
Glasgow, UK
-------------------------------------------
The University of Strathclyde is a charitable body, registered in
Scotland,
with registration number SC015263.

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Murat Yakici
Department of Computer & Information Sciences
University of Strathclyde
Glasgow, UK
-------------------------------------------
The University of Strathclyde is a charitable body, registered in
Scotland,
with registration number SC015263.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Murat Yakici
Department of Computer & Information Sciences
University of Strathclyde
Glasgow, UK
-------------------------------------------
The University of Strathclyde is a charitable body, registered in
Scotland,
with registration number SC015263.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Murat Yakici
Department of Computer & Information Sciences
University of Strathclyde
Glasgow, UK
-------------------------------------------
The University of Strathclyde is a charitable body, registered in Scotland,
with registration number SC015263.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using Payloads

Reply via email to