Re: Reverse search

karl wettin Fri, 23 Mar 2007 04:55:06 -0800


23 mar 2007 kl. 09.57 skrev Melanie Langlois:

Well, I though to use the PerFieldAnalyzerWrapper which contains asbasic the snowballAnalyzer with English stopwords and usesnowballAnalyzer with language specific keywords for the fieldswhich will be in different languages. But I'm seeing that in yourMemoryIndexTest you commented the use of SnowballAnalyzer, is itbecause it's too slow. In this case, I think I could use theStandardAnalyzer... what do you think?

I think that creating an index with a couple of documents takes afraction of the time it will take to place a million queries on thatindex. There is no real need to optimize something that takesmilliseconds when you in the same process do something that takeshalf a minute.


--
karl


Mélanie

-----Original Message-----
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Friday, March 23, 2007 12:46 PM
To: [email protected]
Subject: Re: Reverse search


23 mar 2007 kl. 03.07 skrev Melanie Langlois:

Thanks Karl, the performances graph is really amazing!
I have to say that it would not have think this way around would be
faster, but sounds nice if I can use this, make everything easier
to manage. I'm just wondering what did you consider when you build
your graph, only the time to run the queries? Because, I should add
the time for creating the index anytime a new document comes in (or
a subset of documents if several comes in same time), and the
indexing of these documents. The documents should not be big,
around 2KB. Did you measure this part ?


Adding a document to a MemoryIndex or InstantiatedIndex takes more or
less the same time it would take to add it to an empty RAMDirectory.

How many clock ticks is spent really depends on what analysers youuse.


--
karl


Mélanie

-----Original Message-----
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Friday, March 23, 2007 10:35 AM
To: [email protected]
Subject: Re: Reverse search


23 mar 2007 kl. 02.12 skrev Melanie Langlois:

I want to manage user subscriptions to specific documents. So I
would like to store the subscription (query) into the lucene
directory, and whenever I receive a new document, I will search all
the matching subscriptions to send the documents to all subcribers.
For instance if a user subscribes to all documents with text
containing (WORD1 and WORD2) or WORD3, how can I match the incoming
document based on stored subscriptions? I was thinking to have two
subfields for each field of the subscription: the AND conditions
and the OR conditions.

-OR. I will tokenized the document field content and insert OR
between each of them, and run the query against OR condition of
subscription

-It's for the AND that I will have an issue, because if the
incoming text may contains more words than the sequence I want to
search.

For instance, if I subscribe for documents contents lucene and java
for instance , if the incoming document contents is lucene is a
great API which has been developed in java, once I removed
stopwords my query would look like lucene and great and API and
developed and java.

As query is composed of more words than the stored subscription I
will fail to retrieve the subscription. But if I put only or words,
the results will not be accurate, as I can obtain subscription only
for java for instance.


I wrote such a thing way back, where I used the new document as the
query and the user subscriptions as the index. Similar to what you
describe, I had an AND, OR and NOT field. This really limited the
type of queries users could store. It does however work, particullary
well on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document at
the time to your index, take a look at MemoryIndex in contrib. If you
insert documents in batches larger than one document at the time,
take a look at LUCENE-550 in the Jira. Add new documents to such an
index and place the subscribed queries on it. Depening on the
queries, the speed should be some 20-100 times faster than using a
RAMDirectory. One million queries should take some 20 seconds to
assemble and place on a 25 document index on my laptop. See <https://
issues.apache.org/jira/secure/attachment/
12353601/12353601_HitCollectionBench.jpg> for performace of
LUCENE-550.

--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reverse search

Reply via email to