Re: Reverse search

markharw00d Sun, 25 Mar 2007 07:36:26 -0800


On app startup:
1) parse all Queries and place in an array.

2) Create a RAMIndex containing a doc for each query with contentconsisting of the query's terms (see Query.extractTerms). For optimalperformance only index the most rare term for queries with multiplemandatory criteria e.g. PhraseQuerys. "Most rare" can be determined bylooking at IndexReader.docFreq(t) using an existing index which isrepresentative of your type of content.3) For any queries that can't be handled by 2) e.g. FuzzyQueries - addto list of "run always queries".


Whenever you receive a new document:
1) Put it in a MemoryIndex

2) Get a list of the document's terms by callingmemoryIndex.getReader().terms();3) For each term hit your query RAMIndex and getqueryIndexReader.termDocs(term) - this will give you the ids of queriesthat need to be run - you can use the doc id to index straight into yourparsed queries array.4) Run all queries found in 3) and all those held in your "run always"list against the MemoryIndex containing your new document


Hope this helps,
Mark


Melanie Langlois wrote:

Hi Mark,
If I follow you, I should list the key terms in my incoming document, then 
select the queries which contains these key terms, and then run those queries 
on my index ? If this is correct there is two things I don't understand:
-how do I know which term is a key term in my document ?
-how can I select the queries? Should I index them in a separate index?

Thanks,
Mélanie Langlois-----Original Message-----From: mark harwood [mailto:[EMAIL PROTECTED]Sent: Friday, March 23, 2007 11:19 PM
To: java-user@lucene.apache.org
Subject: Re: Reverse search

Bear in mind that the million queries you run on the MemoryIndex can be shortlisted if 
you place those queries in a RAMIndex and use the source document's terms to "query 
the queries". The list of unique terms for your document is readily available in the 
MemoryIndex's TermEnum.
You can take this list and find "likely related queries" to execute from your 
Query index.
Note that for phrase queries or other forms of query with multiple mandatory terms you should only index one of the terms (preferably the rarest) to ensure that your query is not needlessly executed. For example - using this approach I need only run the phrase query for "XYZ limited" whenever I encounter a document with the rare term "XYZ" in it, rather than the much more commonplace "limited".
Cheers
Mark

----- Original Message ----
From: karl wettin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 23 March, 2007 12:54:36 PM
Subject: Re: Reverse search


23 mar 2007 kl. 09.57 skrev Melanie Langlois:
Well, I though to use the PerFieldAnalyzerWrapper which contains asbasic the snowballAnalyzer with English stopwords and usesnowballAnalyzer with language specific keywords for the fieldswhich will be in different languages. But I'm seeing that in yourMemoryIndexTest you commented the use of SnowballAnalyzer, is itbecause it's too slow. In this case, I think I could use theStandardAnalyzer... what do you think?
I think that creating an index with a couple of documents takes afraction of the time it will take to place a million queries on thatindex. There is no real need to optimize something that takesmilliseconds when you in the same process do something that takeshalf a minute.

___________________________________________________________All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazinehttp://uk.docs.yahoo.com/nowyoucan.html


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reverse search

Reply via email to