First - I apologize for the double post on my earlier email. The first
time I sent it I received an error message from [EMAIL PROTECTED]
saying that I should instead send email to [EMAIL PROTECTED] so I
thought it did not go through.
My question is this - is there a way to use the Lucene/Solr
infrastructure to create a mini-index that simply contains a lookup
table of terms and the number of times they have appeared?
I do not need to record which documents have them nor do I need to know
where in the documents they appear. There could be (and probably will
be) more than 2^32 terms, however. I'm not sure if that makes a
difference to the Lucene backend, but thought it might be relevant.
This question coincides with my earlier question about counting the
times a given term is associated with another term. I figure that this
would be more easily accomplished by making the mini-index described
above alongside the normal index when a document is indexed. For
example, when scanning:
Bravely bold Sir Robin, brought forth from Camelot. He was not afraid
to die! Oh, brave Sir Robin!
In addition to the normal indexing function of Lucene, I would like to
write something on the backend to also index:
bravely|bold
bravely|sir
bravely|robin
bravely|brought
bravely|forth
bold|sir
bold|robin
bold|brought
bold|forth
bold|camelot ("from" being a stop word)
...and so on
I only need to keep a running total of each "bravely|bold" term,
however, since the number of terms will be quite large and keeping track
of the document/termpositions would translate to a lot of wasted HD space.
If such a thing is not already in place, could someone let me know if
there are some tutorials, documentation, or presentations that describe
the inner workings of Lucene and the theories/implementation at work for
the actual file formats, structures, data manipulations, etc? (The
javadocs don't go into this kind of detail.) I'm sure I can sift
through the code and eventually make sense of it, but if there is
documentation out there, I'd prefer to peruse that first. My thought
being that I can simply generate my own kind of hash for each combined
term and write it out to a custom file structure similar to Lucene - but
the specifics of how to (optimally) do so are not plain to me yet.
Thanks again!
-Sven
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]