First - I apologize for the double post on my earlier email. The first time I sent it I received an error message from [EMAIL PROTECTED] saying that I should instead send email to [EMAIL PROTECTED] so I thought it did not go through. My question is this - is there a way to use the Lucene/Solr infrastructure to create a mini-index that simply contains a lookup table of terms and the number of times they have appeared? I do not need to record which documents have them nor do I need to know where in the documents they appear. There could be (and probably will be) more than 2^32 terms, however. I'm not sure if that makes a difference to the Lucene backend, but thought it might be relevant. This question coincides with my earlier question about counting the times a given term is associated with another term. I figure that this would be more easily accomplished by making the mini-index described above alongside the normal index when a document is indexed. For example, when scanning:

Bravely bold Sir Robin, brought forth from Camelot. He was not afraid to die! Oh, brave Sir Robin!

In addition to the normal indexing function of Lucene, I would like to write something on the backend to also index:

bravely|bold
bravely|sir
bravely|robin
bravely|brought
bravely|forth
bold|sir
bold|robin
bold|brought
bold|forth
bold|camelot  ("from" being a stop word)
...and so on

I only need to keep a running total of each "bravely|bold" term, however, since the number of terms will be quite large and keeping track of the document/termpositions would translate to a lot of wasted HD space. If such a thing is not already in place, could someone let me know if there are some tutorials, documentation, or presentations that describe the inner workings of Lucene and the theories/implementation at work for the actual file formats, structures, data manipulations, etc? (The javadocs don't go into this kind of detail.) I'm sure I can sift through the code and eventually make sense of it, but if there is documentation out there, I'd prefer to peruse that first. My thought being that I can simply generate my own kind of hash for each combined term and write it out to a custom file structure similar to Lucene - but the specifics of how to (optimally) do so are not plain to me yet.
Thanks again!
-Sven


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to