Re: [Haskell-cafe] Parsing unstructured data

Reinier Lamers Thu, 29 Nov 2007 02:31:49 -0800

Olivier Boudry wrote:

On 11/28/07, *Grzegorz Chrupala* <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:


    You may have better luck checking out methods used in parsing natural
    language. In order to use statistical parsing techniques such as
    Probabilistic Context Free Grammars ([1],[2] ) the standard
    approach is to
    extract rule probabilities from an annotated corpus, that is
    collection of
    strings with associated parse trees. Maybe you could use your 2/3 of
    addresses that you know are correctly parsed as your training
    material.

    A PCFG parser can output all (or n-best) parses ordered according to
    probabilities so that would seem to be fit your requirements.
    [1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
    [2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14
    <http://www.cs.colorado.edu/%7Emartin/slp2.html#Chapter14>

Wow, Natural Language Processing looks quite complex! But it alsoseems to be closely related to my problem. If someone finds a "NPL fordummies" article or book I'm interested. ;-)

Especially in the fuzzy cases like this one, NLP often turns to machinelearning models. One could try to train a hidden Markov model or supportvector machines to label parts of the string as "name", "street","number", "city", etc. These techniques work very well for part ofspeech tagging in natural language, and this seems similar. However, youneed a manually annotated set of examples to train the models. If youreally have a big load of data and it seems like a good solution, youcould use an off-the-shelf part-of-speech tagger like SVMTool(http://www.lsi.upc.edu/~nlp/SVMTool/) to do it.


Reinier
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Parsing unstructured data

Reply via email to