Re: How to get mapping of query terms to number of their occurrences in a doc?

Erik Hatcher Thu, 09 Feb 2006 01:11:19 -0800

This is a real gotcha with Lucene in it's out of the boxconfiguration. In the several applications I've built to indexdocuments I've always hit this and had to set the maxFieldLength toits maximum possible value. Is there still an argument to be made tokeep the default at 10K or would it be reasonable to bump this upeven if there are, the few, cases where setting it lower isdesirable? We made the compound file index be the default toprevent file handle limitations at the expense of some (generallyirrelevant) performance, so maybe we could also make this more commonsetting the default also?


        Erik


On Feb 8, 2006, at 2:17 PM, Dmitry Goldenberg wrote:

Duh! Bingo! Mistery solved. I should have thought of this :)
The discrepancies come in with larger documents, definitely > 10Kterms which is Lucene's default maxFieldLength.
Thanks for your help, Chris

- Dmitry

________________________________

From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Wed 2/8/2006 10:04 AM
To: java-user@lucene.apache.org
Subject: RE: How to get mapping of query terms to number of theiroccurrences in a doc?
: That's what I did, for debugging. The query is "biology", andhere's
: what the API tells me for term frequencies:
: biolog 15
: biologi 31
: biologist 4
:
: I actually see 13 occurrences of "biologist" and "biologists", 64
: occurrences of "biology", 27 occurrences of "biological".
:
: I see "inform 22" but the actual count of the word "information"in the
: document is 33.  But "ioniz 7" is correct.

I think I missunderstood what you ment when you said the counts don't
match up. Are you comparing the number you get from that code withthenumber of times you personally see the word in the source documentbefore
it has been analyzed?
If so, then there could be a couple of things going on ... i wouldstartby using a tool like Luke to see the actual lists of Terms for eachdoc --there may be something else your analyzer is doing that you don'trealize.
It's also possible that you are hitting the maxFieldLength in the
IndexWriter ... when that happens IndexWriter throws away anyremaining
tokens, so if your documenst are really large.
Lastly, I would add a *lot* more debugging to your code. Print outthecontents of "terms", when you loop over "tfvs" print out the fieldand the
full list of strTerms, in the inner most loop when you incriment the
count, print out the field/text/and count.

that's the best advise i have for spotting what's wrong.



: ________________________________
:
: From: Chris Hostetter [mailto:[EMAIL PROTECTED]
: Sent: Tue 2/7/2006 4:10 PM
: To: java-user@lucene.apache.org
: Subject: Re: How to get mapping of query terms to number of theiroccurrences in a doc?
:
:
:
:
: A cursory reading of your code looks ok ... stemming shouldn't bean issue
: as long as your measure of success is comparing docs that match your
: orriginal query with the counts you get out.
:
: What i mean by that is that any stemming should have already beentaken: care of when your query object was constructed (either by youmanually, or: by QueryParser). the direct equals comparisons you are dongshould be
: fine.
:
: have you tried adding logging of the raw term field/text and thefreq
: counts you get back to see if that helps you spot the problem?
:
:
: : Date: Mon, 6 Feb 2006 14:34:05 -0800
: : From: Dmitry Goldenberg <[EMAIL PROTECTED]>
: : Reply-To: java-user@lucene.apache.org
: : To: java-user@lucene.apache.org
: : Subject: How to get mapping of query terms to number of theiroccurrences
: :     in a doc?
: :
: : Given a query, I want to be able to, for each query term, getthe number of occurrences of the term. I have tried what I'mincluding below and it does not seem to provide reliable results.Seems to work fine with exact matching but as soon as stemmingkicks in, all bets are off as to value of the number of occurrencesreturned.
: :
: : Any ideas, anyone? Can this be written in a simpler and/ormore efficient way?
: : Thanks -
: :
: :       int totalOccurrences = 0;
: :
: :       reader = IndexReader.open(getDirectory(indexDirPath));
: :       HashSet terms = new HashSet();
: :       query.extractTerms(terms);
: :
: :       TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
: :       if (tfvs != null) {
: :
: :         // For each term frequency vector (i.e. for each field)
: :         for (int i = 0; i < tfvs.length; i++) {
: :           String field = tfvs[i].getField();
: :           String[] strTerms = tfvs[i].getTerms();
: :           int[] tfs = tfvs[i].getTermFrequencies();
: :
: :           if (strTerms != null) {
: :
: :             // For each term in the query
: : for (Iterator iter = terms.iterator(); iter.hasNext();) {
: :
: :               Term term = (Term) iter.next();
: :               // For each term in the vector
: :               for (int j = 0; j < strTerms.length; j++) {
: :
: :                 // If found the query term among the vector terms
: : if (field.equals(term.field()) && strTerms[j].equals(term.text())) {
: :
: :                   // Add the term frequency to the total
: :                   totalOccurrences += tfs[j];
: :
: :                 }
: :               }
: :             }
: :           }
: :         }
: :       }
: :
:
:
:
: -Hoss
:
:
:---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:
:
:
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to get mapping of query terms to number of their occurrences in a doc?

Reply via email to