On Tue, Jan 3, 2012 at 1:44 PM, Robert Muir <rcm...@gmail.com> wrote: > On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley <ryan...@gmail.com> wrote: >> >> Just brainstorming, it seems like an FST could be a good/efficient way >> to match documents. My plan would be to: >> >> 1. Use an Analyzer to create a TokenStream for each place name. From >> the TokenStream create an FST<docid> -- this would have to pick some >> impossible character for the token seperator. >> 2. While indexing, create a TokenStream from the input text. For each >> token, try to follow the Arc to a match. If there is a match, add it >> to the document. >> >> Does this approach seem reasonable? >> Is there some standard way to do this that I am missing? >> > > I'm not really sure this will fit well inside a tokenstream at all, as > it seems more like the kind of thing you would do before analysis,
For sure -- any pointers on how to best do this? It seems like using the existing lucene infrastructure to: - normalize latin characters - lowercase - remove stopwords - break camelcase - etc is there something else I should be looking at that is better suited to do this? > and at analysis you would be worried about how you are going to index the > text for search, what you are going to do with the location (separate > field or whatever), etc. > > apart from that - as far as whether or not to use an FST, it seems ok > to me, especially if the data used for geocoding is pretty static. > > if you want to prototype using an FST inside a tokenstream to do this, > just convert your geocoding data into a synonyms file (mapping to the > location), use SynonymsFilter, and you are done. > excellent - i will look there. thanks ryan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org