Hi Ryan, Well, at 100 million+ keywords, Lucene might be the right tool.
One thing that you might check out for the query side is Karl Wettin's recently committed ShingleMatrixAnalyzer (not in any Lucene release yet - only on the trunk). The JUnit test class TestShingleMatrixFilter has an example of splitting an input string into "shingles" (a.k.a. "token n-grams") - in this example the input string "please divide this sentence into shingles" is converted into the following terms by requesting a minimum shingle size of one token and a maximum of two tokens, and using the space character to join the tokens together: "please", "please divide", "divide", "divide this", "this", "this sentence", "sentence", "sentence into", "into", "into shingles", "shingles" You could index your keywords list as-is with no tokenization; break up your queries using a WhitespaceTokenizer connected to a ShingleMatrixFilter, with the minimum shingle size set to one and the maximum set to the number of tokens in keyword with the most tokens; and then build a BooleanQuery with one clause per shingle, each set to BooleanClause.Occur.SHOULD. Steve On 07/23/2008 at 4:05 PM, Ryan D wrote: > Heh, actually I'm using Perl but I've always associated text-search with > Lucene, I'm not sure if it's the best solution or not. On the small side > there are 1.6 million keywords, on the large side there are well over > 100 million but I might find another way to break down the searches into > smaller searches(send A-G server1, H-R to server2...etc). > > Is there another search tool that might be better suited for this...the > only thing I can relate this too is how AdWords works. A user enters a > query in the Google search box and they search their database for people > who've purchased those keywords to the appropriate ads. What I'm doing > is similar but without the payday. :-{ > > Currently I'm using a (huge) hash table and regular expressions > ($query =~ /$keyword/) going down the list from largest to smallest > but I know this is not a long term solution especially if I have to > load the large 100 million+ list in. > > Thanks. > > > On Jul 23, 2008, at 3:54 PM, Steven A Rowe wrote: > > > Hi Ryan, > > > > I'm not sure Lucene's the right tool for this job. > > > > I have used regular expressions and ternary search trees in the past to > > do similar things. > > > > Is the set of keywords too large for an in-memory solution like these? > > If not, consider using a tool like the Perl package Regex::PreSuf > > <http://search.cpan.org/dist/Regex-PreSuf/> - it can convert a list of > > strings into a compact set of alternations, which you can then import > > into a Java program. (I'm not aware of any similar Java tools.) > > > > Steve > > > > On 07/23/2008 at 3:30 PM, Ryan Detzel wrote: > > > Everything i've read and seen about luceen is search for keywords in > > > documents; I want to do the reverse. I have a huge list of > > > keywords("big boy","red ball","computer") and I have phrases that I > > > want to see if they keywords are in. For example using the small > > > keyword list above(store in documents in lucene) what's the best > > > approach to pass in a query "the girl likes red balls" and have it > > > match the keyword "red ball"? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]