On Jun 5, 2005, at 1:11 AM, Phillip Rhodes wrote:
I understand that "Documents are the primary retrievable units from a Lucene query" But I don't know if I want to have 12 documents in the lucene index that represent the same business object, or if I should place 12 different business documents within the lucene index.

Deciding how to slice a domain into Documents is one of the most important decisions to make with Lucene usage, and not one that Lucene itself gives an answer to. There are precedents that have been set and advice that users here can give, but ultimately how to represent your domain in Lucene is up to you.

Here is the background:
I want to index a product catalog (some data in database and some data on the filesystem, I have cross-reference between the two). Each product is associated to attributes, categories and one or more PDF/MS Word documents, HTML descriptions, images, etc...
A product could have 12 different files associated to it.

Is it okay if I create as many documents as assets that I want to return from a search and add information to each document tying it back to the product that it is assocated with? Is that the right approach?

Do users of your search system need to know about the PDF/Word/HTML documents? Or should they simply know about "products"? If all you need back is the product, then the simplest approach would be to create one Lucene Document per product, parse all the files and data associated with it and add it as text to fields. If the search system is simple in that fielded search is not needed, simply create two fields per Document: id and text. Field "id" is the product id, and "text" is an aggregation of all the text associated with the product regardless of where it came from (careful if you're doing string concatenation to put whitespace between so you don't blur words together).

There are many other ways to approach this and my recommendation is just the simplest one based on the description of your needs.

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to