Re: Tagging documents as they are indexed -- Is FST a reasonable approach?

Ryan McKinley Tue, 03 Jan 2012 16:04:31 -0800

On Tue, Jan 3, 2012 at 1:44 PM, Robert Muir <rcm...@gmail.com> wrote:
> On Tue, Jan 3, 2012 at 4:30 PM, Ryan McKinley <ryan...@gmail.com> wrote:
>>
>> Just brainstorming, it seems like an FST could be a good/efficient way
>> to match documents.  My plan would be to:
>>
>> 1. Use an Analyzer to create a TokenStream for each place name.  From
>> the TokenStream create an FST<docid> -- this would have to pick some
>> impossible character for the token seperator.
>> 2. While indexing, create a TokenStream from the input text.  For each
>> token, try to follow the Arc to a match.  If there is a match, add it
>> to the document.
>>
>> Does this approach seem reasonable?
>> Is there some standard way to do this that I am missing?
>>
>
> I'm not really sure this will fit well inside a tokenstream at all, as
> it seems more like the kind of thing you would do before analysis,


For sure -- any pointers on how to best do this?

It seems like using the existing lucene infrastructure to:
 - normalize latin characters
 - lowercase
 - remove stopwords
 - break camelcase
 - etc

is there something else I should be looking at that is better suited to do this?


> and at analysis you would be worried about how you are going to index the
> text for search, what you are going to do with the location (separate
> field or whatever), etc.
>
> apart from that - as far as whether or not to use an FST, it seems ok
> to me, especially if the data used for geocoding is pretty static.
>
> if you want to prototype using an FST inside a tokenstream to do this,
> just convert your geocoding data into a synonyms file (mapping to the
> location), use SynonymsFilter, and you are done.
>

excellent - i will look there.

thanks
ryan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Tagging documents as they are indexed -- Is FST a reasonable approach?

Reply via email to