constructing a mini-index with just the number of hits for a term

Sven Thu, 13 Nov 2008 08:06:53 -0800

First - I apologize for the double post on my earlier email. The firsttime I sent it I received an error message from [EMAIL PROTECTED]saying that I should instead send email to [EMAIL PROTECTED] so Ithought it did not go through.My question is this - is there a way to use the Lucene/Solrinfrastructure to create a mini-index that simply contains a lookuptable of terms and the number of times they have appeared?I do not need to record which documents have them nor do I need to knowwhere in the documents they appear. There could be (and probably willbe) more than 2^32 terms, however. I'm not sure if that makes adifference to the Lucene backend, but thought it might be relevant.This question coincides with my earlier question about counting thetimes a given term is associated with another term. I figure that thiswould be more easily accomplished by making the mini-index describedabove alongside the normal index when a document is indexed. Forexample, when scanning:

Bravely bold Sir Robin, brought forth from Camelot. He was not afraidto die! Oh, brave Sir Robin!

In addition to the normal indexing function of Lucene, I would like towrite something on the backend to also index:


bravely|bold
bravely|sir
bravely|robin
bravely|brought
bravely|forth
bold|sir
bold|robin
bold|brought
bold|forth
bold|camelot  ("from" being a stop word)
...and so on

I only need to keep a running total of each "bravely|bold" term,however, since the number of terms will be quite large and keeping trackof the document/termpositions would translate to a lot of wasted HD space.If such a thing is not already in place, could someone let me know ifthere are some tutorials, documentation, or presentations that describethe inner workings of Lucene and the theories/implementation at work forthe actual file formats, structures, data manipulations, etc? (Thejavadocs don't go into this kind of detail.) I'm sure I can siftthrough the code and eventually make sense of it, but if there isdocumentation out there, I'd prefer to peruse that first. My thoughtbeing that I can simply generate my own kind of hash for each combinedterm and write it out to a custom file structure similar to Lucene - butthe specifics of how to (optimally) do so are not plain to me yet.

Thanks again!
-Sven


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

constructing a mini-index with just the number of hits for a term

Reply via email to