Olivier Boudry wrote:

On 11/28/07, *Grzegorz Chrupala* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    You may have better luck checking out methods used in parsing natural
    language. In order to use statistical parsing techniques such as
    Probabilistic Context Free Grammars ([1],[2] ) the standard
    approach is to
    extract rule probabilities from an annotated corpus, that is
    collection of
    strings with associated parse trees. Maybe you could use your 2/3 of
    addresses that you know are correctly parsed as your training
    material.

    A PCFG parser can output all (or n-best) parses ordered according to
    probabilities so that would seem to be fit your requirements.
    [1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
    [2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14
    <http://www.cs.colorado.edu/%7Emartin/slp2.html#Chapter14>


Wow, Natural Language Processing looks quite complex! But it also seems to be closely related to my problem. If someone finds a "NPL for dummies" article or book I'm interested. ;-)

Especially in the fuzzy cases like this one, NLP often turns to machine learning models. One could try to train a hidden Markov model or support vector machines to label parts of the string as "name", "street", "number", "city", etc. These techniques work very well for part of speech tagging in natural language, and this seems similar. However, you need a manually annotated set of examples to train the models. If you really have a big load of data and it seems like a good solution, you could use an off-the-shelf part-of-speech tagger like SVMTool (http://www.lsi.upc.edu/~nlp/SVMTool/) to do it.

Reinier
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to