Hi Ryan,

Well, at 100 million+ keywords, Lucene might be the right tool.

One thing that you might check out for the query side is Karl Wettin's recently 
committed ShingleMatrixAnalyzer (not in any Lucene release yet - only on the 
trunk).

The JUnit test class TestShingleMatrixFilter has an example of splitting an 
input string into "shingles" (a.k.a. "token n-grams") - in this example the 
input string "please divide this sentence into shingles" is converted into the 
following terms by requesting a minimum shingle size of one token and a maximum 
of two tokens, and using the space character to join the tokens together:

"please", "please divide", "divide", "divide this", "this", "this sentence", 
"sentence",  "sentence into", "into", "into shingles", "shingles"

You could index your keywords list as-is with no tokenization; break up your 
queries using a WhitespaceTokenizer connected to a ShingleMatrixFilter, with 
the minimum shingle size set to one and the maximum set to the number of tokens 
in keyword with the most tokens; and then build a BooleanQuery with one clause 
per shingle, each set to BooleanClause.Occur.SHOULD.

Steve

On 07/23/2008 at 4:05 PM, Ryan D wrote:
> Heh, actually I'm using Perl but I've always associated text-search with
> Lucene, I'm not sure if it's the best solution or not. On the small side
> there are 1.6 million keywords, on the large side there are well over
> 100 million but I might find another way to break down the searches into
> smaller searches(send A-G server1, H-R to server2...etc).
> 
> Is there another search tool that might be better suited for this...the
> only thing I can relate this too is how AdWords works. A user enters a
> query in the Google search box and they search their database for people
> who've purchased those keywords to the appropriate ads.  What I'm doing
> is similar but without the payday. :-{
> 
> Currently I'm using a (huge) hash table and regular expressions
> ($query =~ /$keyword/) going down the list from largest to smallest
> but I know this is not a long term solution especially if I have to
> load the large 100 million+ list in.
> 
> Thanks.
> 
> 
> On Jul 23, 2008, at 3:54 PM, Steven A Rowe wrote:
> 
> > Hi Ryan,
> > 
> > I'm not sure Lucene's the right tool for this job.
> > 
> > I have used regular expressions and ternary search trees in the past to
> > do similar things.
> > 
> > Is the set of keywords too large for an in-memory solution like these? 
> > If not, consider using a tool like the Perl package Regex::PreSuf
> > <http://search.cpan.org/dist/Regex-PreSuf/> - it can convert a list of
> > strings into a compact set of alternations, which you can then import
> > into a Java program.  (I'm not aware of any similar Java tools.)
> > 
> > Steve
> > 
> > On 07/23/2008 at 3:30 PM, Ryan Detzel wrote:
> > > Everything i've read and seen about luceen is search for keywords in
> > > documents; I want to do the reverse. I have a huge list of
> > > keywords("big boy","red ball","computer") and I have phrases that I
> > > want to see if they keywords are in. For example using the small
> > > keyword list above(store in documents in lucene) what's the best
> > > approach to pass in a query "the girl likes red balls" and have it
> > > match the keyword "red ball"?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to