hmm ok i tried it but to no avail. It would have confused me even more to be honest. actually i would not have used a Document Collector at all, because I was supposed to give all results even when queried "the". What i mean is that i would not need the score at all. I just didn't know how ;) Anyway a scoring system should not "invent" token i think.
but thanks jan Am Donnerstag, den 18.11.2010, 13:05 -0500 schrieb Pulkit Singhal: > Briefly looked at your code and there is no way that I'm right about > this but I'll say it anyway: > Every single field you index doesn't have any NORMS so how will the > scoring happen? > It probably happens based on the matches at query time but its not > like you are specifying any boosts in you query. > Lucene has a complex scoring formula that I don't claim to fully > understand ... but what if somehow (stay with me, don't shoot the > messenger) due to the fact that you have no NORMS at all, the results > being collected somehow give a score to the document that doesn't have > a match at all and therefore present it in the results? > > Just a theory (a bad one perhaps) ... but one which can be easily > blown away by using ANALYZED in your indexer and then trying again. > > - Pulkit > > On Thu, Nov 18, 2010 at 12:55 PM, Pulkit Singhal > <pulkitsing...@gmail.com> wrote: > > Wow, you live in a really great country and attend an awesome > > university where they have classes like "Text Analytics" I'm gonna > > send my kid there to study :) > > > > In all seriousness I think the problem may be with how you are > > collecting your results. > > > > I find this very amusing: > >> 80. 896889 phrase occurs 0 times > > > > How can it claim there are zero hits and still be returning you a result? > > Weird. > > > > Have you tried removing all other docs and then only leaving the one > > problem child in there indexing just that and seeing what comes back? > > > > On Wed, Nov 17, 2010 at 1:19 PM, Jan <fajer...@informatik.hu-berlin.de> > > wrote: > >> thats what i figured...i can't find out what i'm doing wrong though ;) > >> > >> so the query is "experiment" (i know not really a phrase...but the > >> assignment requested precisely so). The program constructs the following > >> query > >> > >> +(AbstractText:"experiment" ArticleTitle:"experiment") > >> > >> which looks good to me. the results look like this: > >> > >> Found 95 hits. > >> 1. 19810 phrase occurs 3 times > >> 2. 587340 phrase occurs once > >> ... > >> 80. 896889 phrase occurs 0 times > >> ... > >> 95. 900325 phrase occurs once > >> > >> so here is the document 896889 > >> PMID > >> 896889 > >> ArticleTitle > >> Estrogen-induced sexual receptivity and localization of > >> 3H-estradiol in > >> brains of female mice: effects of 5 alpha-reduced androgens, progestins > >> and cyproterone acetate. > >> AbstractText > >> Sexual receptivity induced in ovariectomized CD-1 mice with chronic > >> daily administration of estradiol benzoate (E2 B) was blocked by > >> concurrent administration of the 5 alpha-reduced androgen, > >> dihydrotestosterone (DHT). Receptivity was restored in these females > >> with progesterone-, but not with dihydroprogesterone-priming 6 hr prior > >> to testing. Delaying the DHT injections until 12 hr after the E2 B > >> injections greatly reduced its inhibitory properties. Receptivity in E2 > >> B-primed females was also blocked by concurrent treatment with > >> cyproterone acetate and 3 alpha-, but not 3 beta-adrostanediol. > >> Pretreatment with DHT, or 3 alpha- or 3 beta-androstanediol failed to > >> consistently affects 3H-estradiol accumulation in crude nuclear and > >> supernatant fractions from brain and pituitary > >> > >> so apart from doing something wrong while indexing/analyzing (the text > >> above is from the xml, but i double checked...it is put in teh index > >> with these textfragments) or so, the token "experiment" does not even > >> occur. thats what baffles me. > >> > >> thanks for the very quick reaction > >> jan > >> > >> Am Mittwoch, den 17.11.2010, 12:57 -0500 schrieb Donna L Gresh: > >>> As it is probably more likely that you're doing something incorrect than > >>> that Lucene is reporting incorrect results :), it might help if you > >>> reported the exact query that is being submitted to the IndexSearcher, and > >>> then showing us the document that was incorrectly returned. My guess is > >>> that either looking at the query itself will immediately reveal the > >>> problem to you, or that the query in combination with the document and > >>> knowledge of which analyzers you are using will reveal the problem- > >>> > >>> Donna > >>> > >>> > >>> Jan <fajer...@informatik.hu-berlin.de> wrote on 11/17/2010 11:47:49 AM: > >>> > >>> > [image removed] > >>> > > >>> > uncorrect results > >>> > > >>> > Jan > >>> > > >>> > to: > >>> > > >>> > java-user > >>> > > >>> > 11/17/2010 11:51 AM > >>> > > >>> > Please respond to java-user > >>> > > >>> > Hi, > >>> > i have an assignment in my Text Analytics class. I am supposed to create > >>> > an index and search it. The corpus is a PubMed-like XML file. it is > >>> > possible to query terms (programcall a few terms) and phrases > >>> > (programcall "a phrase"). > >>> > When a phrase is queried the program should answer how often the phrase > >>> > occured. > >>> > The problem is, on certain queries the IndexSearcher returns some > >>> > documents that do not have that particular query in its fields. > >>> > I'd be delighted if someone could tell me what i am doing wrong. > >>> > See the source code at my github repo > >>> > > >>> https://github.com/jangingnicht/TextAnalytics2/tree/master/src/textanalytics2/ > >>> > >>> > > >>> > Thanks in advance > >>> > jan > >>> > > >>> > PS: I use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6 > >>> > 1.8.2) on an 64 bit Linux machine. > >>> > [attachment "signature.asc" deleted by Donna L Gresh/Watson/IBM] > >> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil