On Jun 5, 2005, at 1:11 AM, Phillip Rhodes wrote:
I understand that "Documents are the primary retrievable units
from a Lucene query" But I don't know if I want to have 12
documents in the lucene index that represent the same business
object, or if I should place 12 different business documents within
the lucene index.
Deciding how to slice a domain into Documents is one of the most
important decisions to make with Lucene usage, and not one that
Lucene itself gives an answer to. There are precedents that have
been set and advice that users here can give, but ultimately how to
represent your domain in Lucene is up to you.
Here is the background:
I want to index a product catalog (some data in database and some
data on the filesystem, I have cross-reference between the two).
Each product is associated to attributes, categories and one or
more PDF/MS Word documents, HTML descriptions, images, etc...
A product could have 12 different files associated to it.
Is it okay if I create as many documents as assets that I want to
return from a search and add information to each document tying it
back to the product that it is assocated with? Is that the right
approach?
Do users of your search system need to know about the PDF/Word/HTML
documents? Or should they simply know about "products"? If all you
need back is the product, then the simplest approach would be to
create one Lucene Document per product, parse all the files and data
associated with it and add it as text to fields. If the search
system is simple in that fielded search is not needed, simply create
two fields per Document: id and text. Field "id" is the product id,
and "text" is an aggregation of all the text associated with the
product regardless of where it came from (careful if you're doing
string concatenation to put whitespace between so you don't blur
words together).
There are many other ways to approach this and my recommendation is
just the simplest one based on the description of your needs.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]