Best design for an use case which is going to stress Lucene

Terenzio Treccani Wed, 15 Mar 2006 09:42:39 -0800

Hi all,

I'm required to develop an application for searching over news items.
There will be thousands of news items, each one will be assigned
directly to a list of millions of customerIDs. The query will be done
by passing a customerID and will return all news items associated to
it. Furthermore, a news item will be added or deleted by itemID. No
queries on other fields (news metadata etc) will be performed.
So, the result for a query similar to "customerID:0000001" will return
hits containing news items.
The index structure is very simple, but the number of news items and
customers will be HUGE. I see three possible ways of designing the
index, which I describe in the following. Which one would you choose?
If you have any advice, see any pro/cons etc any suggestion would be
appreciated.


Thanks a lot
Terenzio

a) One document per news item.
Each document will have the following fields:

- CustomerID (indexed, not stored) : a list of space-separated ids
like : "0000001 0000002 0000003"
- Title (not indexed, stored) : a text
- Content (not indexed, stored) : a potentially long (a few kilobites) text
- Meta 1 (not indexed, stored) : a meta tag
- Meta 2....

b) One documet per customer id.
Each document will have the following fields:

- CustomerID (indexed, not stored) : a single ID like: "0000001"
- NewsID (not indexed, stored) : a list of space-separated file names
like : "news01.xml news02.xml news03.xml"

The file names will then contain the news data in XML format, which
can be quickly read and cached. Maybe the quickest solution for
queries, but I see some problems in adding and deleting news by
itemID.

c) In a RDBMS.
Given the structured nature of this index, it could be implemented in
a normal table on a (Oracle) database.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Best design for an use case which is going to stress Lucene

Reply via email to