Re: constructing a mini-index with just the number of hits for a term

Michael McCandless Tue, 18 Nov 2008 12:43:06 -0800


Flexible indexing (LUCENE-1458) should make this possible.

IE you could use your own codec which discards doc/freq/prox/payloadand during indexing (for this one field) and simply stores the termfrequency in the terms dict. However, one problem will be deletions(in case it matters to your app): in order to properly update theterms dict counts, SegmentMerger walks through the docIDs for the termand skips the deleted ones.

But it will be some time before this is real, though there's aninitial patch on LUCENE-1458.


Mike

Grant Ingersoll wrote:

Can you share what the actual problem is that you are trying tosolve? It might help put things in context for me. I'm guessingyou are doing some type of co-occurrence analysis, but...
More below.

On Nov 13, 2008, at 11:08 AM, Sven wrote:
First - I apologize for the double post on my earlier email. Thefirst time I sent it I received an error message from [EMAIL PROTECTED]saying that I should instead send email to [EMAIL PROTECTED]so I thought it did not go through.My question is this - is there a way to use the Lucene/Solrinfrastructure to create a mini-index that simply contains a lookuptable of terms and the number of times they have appeared?
This could be possible. I think I would create documents withIndex.ANALYZED, and Store.NO. Then, you just need to use theTermEnum and TermDocs to access the information that you need. In asense, you are just creating the term dictionary. You could alsoturn off storing of NORMS, which will save too.
I do not need to record which documents have them nor do I need toknow where in the documents they appear. There could be (andprobably will be) more than 2^32 terms, however.
2^32 unique terms or 2^32 total terms?
I'm not sure if that makes a difference to the Lucene backend, butthought it might be relevant.This question coincides with my earlier question about counting thetimes a given term is associated with another term. I figure thatthis would be more easily accomplished by making the mini-indexdescribed above alongside the normal index when a document isindexed. For example, when scanning:
Bravely bold Sir Robin, brought forth from Camelot. He was notafraid to die! Oh, brave Sir Robin!
In addition to the normal indexing function of Lucene, I would liketo write something on the backend to also index:
bravely|bold
bravely|sir
bravely|robin
bravely|brought
bravely|forth
bold|sir
bold|robin
bold|brought
bold|forth
bold|camelot  ("from" being a stop word)
...and so on
I only need to keep a running total of each "bravely|bold" term,however, since the number of terms will be quite large and keepingtrack of the document/termpositions would translate to a lot ofwasted HD space.
For this, I think you will have to hook into the Analyzer process.The other thing to do is just try keeping the document/termpositions, it may not actually be as bad as you think in terms ofspace.
If such a thing is not already in place, could someone let me knowif there are some tutorials, documentation, or presentations thatdescribe the inner workings of Lucene and the theories/implementation at work for the actual file formats, structures,data manipulations, etc? (The javadocs don't go into this kind ofdetail.) I'm sure I can sift through the code and eventually makesense of it, but if there is documentation out there, I'd prefer toperuse that first. My thought being that I can simply generate myown kind of hash for each combined term and write it out to acustom file structure similar to Lucene - but the specifics of howto (optimally) do so are not plain to me yet.
Thanks again!
-Sven


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: constructing a mini-index with just the number of hits for a term

Reply via email to