23 mar 2007 kl. 02.12 skrev Melanie Langlois:

I want to manage user subscriptions to specific documents. So I would like to store the subscription (query) into the lucene directory, and whenever I receive a new document, I will search all the matching subscriptions to send the documents to all subcribers. For instance if a user subscribes to all documents with text containing (WORD1 and WORD2) or WORD3, how can I match the incoming document based on stored subscriptions? I was thinking to have two subfields for each field of the subscription: the AND conditions and the OR conditions.

-OR. I will tokenized the document field content and insert OR between each of them, and run the query against OR condition of subscription

-It's for the AND that I will have an issue, because if the incoming text may contains more words than the sequence I want to search.

For instance, if I subscribe for documents contents lucene and java for instance , if the incoming document contents is lucene is a great API which has been developed in java, once I removed stopwords my query would look like lucene and great and API and developed and java.

As query is composed of more words than the stored subscription I will fail to retrieve the subscription. But if I put only or words, the results will not be accurate, as I can obtain subscription only for java for instance.


I wrote such a thing way back, where I used the new document as the query and the user subscriptions as the index. Similar to what you describe, I had an AND, OR and NOT field. This really limited the type of queries users could store. It does however work, particullary well on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document at the time to your index, take a look at MemoryIndex in contrib. If you insert documents in batches larger than one document at the time, take a look at LUCENE-550 in the Jira. Add new documents to such an index and place the subscribed queries on it. Depening on the queries, the speed should be some 20-100 times faster than using a RAMDirectory. One million queries should take some 20 seconds to assemble and place on a 25 document index on my laptop. See <https:// issues.apache.org/jira/secure/attachment/ 12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550.

--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to