Re: One (large) field shared by many documents

Peter Bloem Sat, 19 May 2007 15:17:26 -0700

I'm sorry, I should have explained the intended behavior more clearly.

The basic idea (without the collection fields) is that there are verysimple documents in the index with one content field each. All I do withthis index is a standard search in this text field. To improve thesearch results, I want to also add the concatenation of all documents ina collection as a field to every single document. I then search theindex using both fields, and diminishing the effect of the collectionfield. This should improve the search results.

As an example, say I have the documents a:"look a cat" b:"my chimpanseeis hairy" c:"dogs are playful" and many others. These three documentsare grouped into one collection (of many). The term vectors for thedocuments would then be

a: {look, a, cat}
b: {my, chimpansee, is , hairy}
c: {dogs, are, playful}

If I create a term vector for the whole collection: {look, a, cat, my,chimpansee, is , hairy, dogs, are, playful} and add it to each of thedocuments as a separate field, the query "my hairy cat" scores wellagainst document a because of the match on cat, but also because of thematch on both cat and hairy on the collection field. Documents about thelinux command 'cat' do not have the word "hairy" in their collectionfield (because they're part of a different collection), and so would notget this benefit. It's essentially a smoothing technique, since itallows query words that aren't in the document to still have some effect.

The problem of course is that storing these collection term vectors foreach document greatly increases the size of the index and the indexingtime. It would be alot faster if I could somehow use a second index tostore the collections as documents, so I would only have to store oneterm vector per collection. (This isn't my own idea btw, I'm trying toreplicate the results from some other research that used this method).


I hope this is more clear,
Peter

Erick Erickson wrote:

This seems kind of kludgy, but that may just mean I don't understand
your problem very well.

What is it that you're trying to accomplish? Searching constrained
by topic or groups?

If you're trying to search by groups, search the archive for the
word "facet" or "faceted search".

Otherwise, could you describe what behavior you're after and maybe
there'd be more ideas....

Best
Erick

On 5/19/07, Peter Bloem <[EMAIL PROTECTED]> wrote:


Hi,

I have the following problem. I'm indexing documents that belong to some
collection (ie. the dataset is divided into collections, which are
divided into documents). These documents become my lucene documents,
with some relatively small string that becomes the field I want to
search. However, I would also like to add to document d the
concatenation of all documents in d's collection as a field (mainly as a
smoothing technique, because documents correspond roughly to topics).
I'm currently doing just that, adding an extra field for the entire
concatenated collection to each document in that collection. Of course

this increases the index size and indexing time greatly (aboutfive-fold).


There must be a better way to do this. My idea was to create a second
index where the collections are indexed as (lucene) documents. This
index would have the text as a field, and a list of document id's
referring back to the main index. I could then retrieve the term vector
for each collection from this second index for each search result from
the original index.

My question is if this is a smart approach. And if it is, which of
Lucene's classes should I use for this. The best I could find was the
FilterIndexReader. If extending the FilterIndexReader is really the best
way to go, could I simply override the document(int, FieldSelector)
method, or is there more to it? I doubt I'm the first person that's ever
wanted a many to one relation between fields and documents, so I hope
there's a simpler way about this.

Thank you,
Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: One (large) field shared by many documents

Reply via email to