Computing document frequencies for specific queries in Lucene

aengle1429 Thu, 23 Jun 2011 13:07:31 -0700

Hello, I currently am trying to get the following results... let's say I have
3 XML files that I parse using SAX:
<?xml version="1.0" encoding="UTF-8"?>
<person>
   <name>bob bob bob
   </name>
   <name>3m
   </name>
   <height>3m
   </height>
   <height>bob
   </height>
</person>


<?xml version="1.0" encoding="UTF-8"?>
<person>
   <name>bob
   </name>
   <name>bob
   </name>
   <name>bob bob
   </name>      
   <height>3m
   </height>
   <height>bob
   </height>
</person>

<?xml version="1.0" encoding="UTF-8"?>
<person>
   <name>bob
   </name>
   <name>bob
   </name>
   <height>bob
   </height>
</person>

I am currently indexing these under separate fields for the duplicate <name>
tag. so I have in total 3 /person/name fields: /person/name0, /person/name1,
/person/name2.

I am wanting to compute how many times, in a given unique field
(/person/name) a query appears. Let's say the query is "bob"

I want to see, for total times appearing: 9
I want to also see how many times it appeared in all documents): 6

My current solution is to call TermDocs for the first question and iterate
through counting the docFreq() of the given field(/person/namex) (there are
two loops then).

This gets very slow, and ideally, I would like to index them all under
/person/name, but I still really need these answers. Does anyone have any
ideas? I can offer more clarification and some source code, but my current
method is very slow (I need to index ~4million files and run compute these
quantities--very slow when you have 150 fields of
/person/actor/movie_acted_in and 4 million documents...

Thank you very much!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Computing-document-frequencies-for-specific-queries-in-Lucene-tp3101450p3101450.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Computing document frequencies for specific queries in Lucene

Reply via email to