I'm not sure I understand what your field arrangement would be when you say
"[T]he items I'm pulling in from the web contain large bodies of text 
(descriptions) whereas the products in my catalog consist of shorter fields 
such as product name, manufacturer, product code, etc. So using the smaller 
fields from my catalog to build queries against the larger fields in the items 
I pull in seems to be the only way to do things (that I can think of)."

I would want to take a vanilla crawl, parse, index approach: (1) find a 
candidate document, (2) parse the web document as best I could to located all 
the fields of your existing documents "product name, manufacturer, product code 
etc.".  But instead of creating a new document, I would form a very general 
query against my document set.

That sounds good, but if the web documents are tricky to parse, I could see why 
you might want to index the web documents as "text body" and search for any of 
your existing fields. You'd get good throughput searching for as many of your 
documents in as many of the web documents as possible, but of course, you'd NOT 
want to wait until you've crawled Amazon before checking for any matches.  This 
leads me to think about multiple phase approach where a crawler creates 
"useful" size indices, then it closes that index, hands it off to the 
query-for-my-products phase and starts another one.   Note how this approach 
doesn't require your products in an Lucene index, just the web documents.

That sounds like a fun and interesting problem.  Good luck.

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to