Re: instaparse: composing smaller rules into a bigger one

Jim - FooBar(); Tue, 19 Nov 2013 03:13:52 -0800

Hi Mark,

Your comments were spot on! Changing the SPACE tag makes it work and Ican also get rid of all the '?' after 'SPACE'. Also hiding the actualspaces makes it look a lot nicer. Many many thanks for this...:)

I applied to register at the instaparse google group and my registrationis pending. While it is pending would you mind answering an additionalquestion? I have reworked my grammar to this (again observe the bold bits):




"S  = PHRASE+ SPACE END
   PHRASE = DDIPK | DDI | ENCLOSED |  TOKEN (SPACE TOKEN PUNCT?)*
*DDIPK =  PK SPACE TO SPACE DRUG SPACE EFF?*
   DDI =  MECH SPACE DRUG SPACE TOKEN
*EFF =* *BE? SPACE (SIGN | XFOLD)? SPACE MECH SPACE (ADV | XFOLD)?*

   <WORD> = #'\\w+'
   <PUNCT> = #'\\p{Punct}'
   XFOLD = NUM SPACE '-'? SPACE 'fold' SPACE #'[a-z]+ses?'?
   ROUTE = #'(?i)oral|intravenous'
   UNIT = 'mg' | 'g'
   DOSE = NUM SPACE UNIT SPACE INTERVAL?

INTERVAL = #'[a-z]+ce'? SPACE ADV | NUM SPACE 'times per' TIME | '/'TIME

   TIME =  'hour' | 'day' | 'week'
   PERCENTAGE = NUM SPACE ('%' | #'per(\\s|\\-)?cent')
   ENCLOSED = PAREN | SQBR
   <PAREN> =  #'\\(.*\\)'
   <SQBR> =   #'\\[.*\\]'
    NUM =  #'[0-9]+'
    CYP =  #'CYP[A-Z0-9]*'
    ADV =   #'[a-z]+ly'
   <SPACE> = <#'\\s*'>
    DRUG = ROUTE? SPACE
      (#'(?i)\\b+\\w+a[z|st|p]ine?\\b+' |
       #'(?i)\\b+\\w+[i|u]dine?\\b+'    |
       #'(?i)\\b+\\w+azo[l|n]e?\\b+'    |
       #'(?i)\\b+\\w+tamine?\\b+'       |
       #'(?i)\\b+\\w+zepam\\b+'         |
       #'(?i)\\b+\\w+zolam\\b+'         |
       #'(?i)\\b+\\w+[y|u]lline?\\b+'   |
       #'(?i)\\b+\\w+artane?\\b+'       |
       #'(?i)\\b+\\w+retine?\\b+'       |
       #'(?i)\\b+\\w+navir\\b+'         |
       #'(?i)\\b+\\w+ocaine\\b+'        |
       #'(?i)didanosine|tenofovir|vaprisol|conivaptan|amlodipine')

    MECH =  #'[a-z]+e(s|d)'
    SIGN =  ADV | NEG
    NEG = 'not' | #'un[a-z]*ed'
    <TO> = 'to' | 'of'
    BE = 'is' | 'are' | 'was' | 'were'
  (*  DO = 'does' | 'do' | 'did' *)
   <COMMA> = ','
  (* <OTHER> = 'as' | 'its' | 'by' *)
    END =  '.' "

Now consider the sentence: "Exposure to oral didanosine is_significantly increased_ when coadministered with tenofovir disoproxilfumarate."


My very first tag is this, which is perfect:
 [:PHRASE
  [:DDIPK
   [:PK "Exposure"]
   "to"
   [:DRUG [:ROUTE "oral"] "didanosine"]
   [:EFF
    [:BE "is"]
    [:SIGN [:ADV "significantly"]]
    [:MECH "increased"]]]]

but now consider the same sentence slightly different: "Exposure to oraldidanosine is _increased significantly_ when coadministered withtenofovir disoproxil fumarate."


 [:PHRASE
  [:DDIPK
   [:PK "Exposure"]
   "to"
   [:DRUG [:ROUTE "oral"] "didanosine"]
   [:EFF [:MECH "increased"]]]]
 [:PHRASE
  [:TOKEN [:SIGN [:ADV "significantly"]]]

Shouldn't the EFF rule have caught the [:SIGN [:ADV "significantly"]]tag? Why did it start a new PHRASE ? The same thing happens with XFOLD.iF the 'x-fold' is before the adverb (2-fold increases) it shows in theDDIPK tag otherwise (increases 2-fold) it appears after it in a newPHRASE tag. I'm pretty sure the rule covers both cases and in fact itreaches the EFF rule but it never mathes the *(ADV | XFOLD)? *rule. I ampresuming this is something quite simple...


As always, thanks in advance,

Jim


On 19/11/13 00:38, Mark Engelberg wrote:

Seems like there are (at least) two issues here.
1. You have a preference in mind that is not expressed by thegrammar. The parse that was outputted is a valid parse that fits allthe rules of the grammar. If you want the parser to prefer DRUGPK andEFF interpretations over other interpretations, you need to specifythat, for example:TOKEN = DRUGPK / EFF / (NUM | DRUG | PK | MECH | SIGN | ENCLOSED) /WORD
2. Your rule for space is "<SPACE> = #'\\s+'", i.e., one or morespaces. But the way your other rules utilize the SPACE rule, thiscauses a problem. For example, you define DRUGPK as ending with SPACE(and that ending SPACE is part of the DRUGPK token), but your S rulealso says that tokens (including DRUGPK) must be /followed/ by aSPACE. So the DRUGPK rule will never be satisfied, because it isincluding the ending whitespace as part of the token, and then there'sno whitespace following the token as required by the S rule. Asanother example, your EFF rule begins "BE? SPACE SIGN? SPACE MECH" andif the optional BE and SIGN aren't present, it's looking for twomandatory spaces in a row.
I suggest changing your rule to "<SPACE> = #'\\s*'", i.e., zero ormore spaces. Or if you don't actually care about seeing the spaces inyour parse output, you can change it to "<SPACE> = <#'\\s*'>".
If you make both those changes, you'll get:
=> (parsePK "Exposure to didanosine is increased when coadministeredwith tenofovir disoproxil fumarate [Table 5 and see ClinicalPharmacokinetics (12.3, Tables 9 and 10)].")[:S [:TOKEN [:DRUGPK [:PK "Exposure"] "to" [:DRUG "didanosine"] [:EFF"is" [:MECH "increased"]]]] [:TOKEN "when"] [:TOKEN [:EFF [:MECH"coadministered"]]] [:TOKEN "with"] [:TOKEN [:DRUG "tenofovir"]][:TOKEN "disoproxil"] [:TOKEN "fumarate"] [:TOKEN [:ENCLOSED "[Table 5and see Clinical Pharmacokinetics (12.3, Tables 9 and 10)]"]] [:END "."]]
which I think is what you want.
If you have follow-up questions, I recommend posting to the instaparsegoogle group:https://groups.google.com/forum/#!forum/instaparse<https://groups.google.com/forum/#%21forum/instaparse>
--Mark
P.S. I've been experimenting with a feature to make it easier toexpress grammars where you find yourself inserting an optionalwhitespace rule everywhere, documented here under:
https://github.com/Engelberg/instaparse/blob/master/docs/ExperimentalFeatures.md#auto-whitespace
On Mon, Nov 18, 2013 at 5:47 AM, Jim - FooBar(); <jimpil1...@gmail.com<mailto:jimpil1...@gmail.com>> wrote:
    Hi all,

    I'm having a small problem composing smaller matches in
    instaparse. Here is what I'm trying...just observe the bold bits:

    (def parsePK
      (insta/parser
       "S  = TOKEN (SPACE TOKEN PUNCT?)* END
       TOKEN = (NUM | DRUG | PK | DRUGPK | MECH | SIGN | EFF |
    ENCLOSED) / WORD
       <WORD> = #'\\w+' | PUNCT
       <PUNCT> = #'\\p{Punct}'
       ENCLOSED = PAREN | SQBR
       <PAREN> = #'\\[.*\\]'
       <SQBR> =  #'\\(.*\\)'
        NUM =  #'[0-9]+'
        ADV =   #'[a-z]+ly'
       <SPACE> = #'\\s+'
        DRUG =  #'(?i)didanosine|quinidine|tenofovir'
        PK = #'(?i)exposure|bioavailability|lower?[\\s|\\-]?clearance'
    *DRUGPK =  PK SPACE TO SPACE DRUG SPACE EFF? SPACE *
        MECH =  #'[a-z]+e(s|d)'
    *EFF = BE? SPACE SIGN? SPACE MECH | BE? SPACE MECH SPACE ADV? *
        SIGN =  ADV | NEG
        NEG = 'not'
        <TO> = 'to' | 'of'
        <BE> = 'is' | 'are' | 'was' | 'were'
        END =  '.' " ))

    Running the parser returns the following. It seems that the 2
    bigger composite rules DRUGPK & EFF are not recognised at all.
    Only the smaller pieces are actually shown. I would expect [:TOKEN
    [:DRUGPK "Exposure to didanosine is increased"]] and  [:TOKEN
    [:EFF "is increased"]] entries.
    (pprint
    (parsePK "Exposure to didanosine is increased when coadministered
    with tenofovir disoproxil fumarate [Table 5 and see Clinical
    Pharmacokinetics (12.3, Tables 9 and 10)]."))


    [:S
     [:TOKEN [:PK "Exposure"]]
     " "
     [:TOKEN "to"]
     " "
     [:TOKEN [:DRUG "didanosine"]]
     " "
     [:TOKEN "is"]
     " "
     [:TOKEN [:MECH "increased"]]
     " "
     [:TOKEN "when"]
     " "
     [:TOKEN [:MECH "coadministered"]]
     " "
     [:TOKEN "with"]
     " "
     [:TOKEN [:DRUG "tenofovir"]]
     ","
     " "
     [:TOKEN "disoproxil"]
     " "
     [:TOKEN "fumarate"]
     [:END "."]]

     Am I thinking about it the wrong way? Can ayone shed some light?

    many thanks in advance,

    Jim
----You received this message because you are subscribed to the Google
    Groups "Clojure" group.
    To post to this group, send email to clojure@googlegroups.com
    <mailto:clojure@googlegroups.com>
    Note that posts from new members are moderated - please be patient
    with your first post.
    To unsubscribe from this group, send email to
    clojure+unsubscr...@googlegroups.com
    <mailto:clojure%2bunsubscr...@googlegroups.com>
    For more options, visit this group at
    http://groups.google.com/group/clojure?hl=en
    ---
    You received this message because you are subscribed to the Google
    Groups "Clojure" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to clojure+unsubscr...@googlegroups.com
    <mailto:clojure%2bunsubscr...@googlegroups.com>.
    For more options, visit https://groups.google.com/groups/opt_out.


--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patientwith your first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the GoogleGroups "Clojure" group.To unsubscribe from this group and stop receiving emails from it, sendan email to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

---You received this message because you are subscribed to the Google Groups "Clojure" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: instaparse: composing smaller rules into a bigger one

Reply via email to