Hi Mark,

Your comments were spot on! Changing the SPACE tag makes it work and I can also get rid of all the '?' after 'SPACE'. Also hiding the actual spaces makes it look a lot nicer. Many many thanks for this...:)

I applied to register at the instaparse google group and my registration is pending. While it is pending would you mind answering an additional question? I have reworked my grammar to this (again observe the bold bits):



"S  = PHRASE+ SPACE END
   PHRASE = DDIPK | DDI | ENCLOSED |  TOKEN (SPACE TOKEN PUNCT?)*
*DDIPK =  PK SPACE TO SPACE DRUG SPACE EFF?*
   DDI =  MECH SPACE DRUG SPACE TOKEN
*EFF =* *BE? SPACE (SIGN | XFOLD)? SPACE MECH SPACE (ADV | XFOLD)?*
TOKEN = (DOSE | NUM | DRUG | PK | PERCENTAGE | XFOLD | CYP | MECH | SIGN | TO | ENCLOSED | COMMA) / WORD
   <WORD> = #'\\w+'
   <PUNCT> = #'\\p{Punct}'
   XFOLD = NUM SPACE '-'? SPACE 'fold' SPACE #'[a-z]+ses?'?
   ROUTE = #'(?i)oral|intravenous'
   UNIT = 'mg' | 'g'
   DOSE = NUM SPACE UNIT SPACE INTERVAL?
INTERVAL = #'[a-z]+ce'? SPACE ADV | NUM SPACE 'times per' TIME | '/' TIME
   TIME =  'hour' | 'day' | 'week'
   PERCENTAGE = NUM SPACE ('%' | #'per(\\s|\\-)?cent')
   ENCLOSED = PAREN | SQBR
   <PAREN> =  #'\\(.*\\)'
   <SQBR> =   #'\\[.*\\]'
    NUM =  #'[0-9]+'
    CYP =  #'CYP[A-Z0-9]*'
    ADV =   #'[a-z]+ly'
   <SPACE> = <#'\\s*'>
    DRUG = ROUTE? SPACE
      (#'(?i)\\b+\\w+a[z|st|p]ine?\\b+' |
       #'(?i)\\b+\\w+[i|u]dine?\\b+'    |
       #'(?i)\\b+\\w+azo[l|n]e?\\b+'    |
       #'(?i)\\b+\\w+tamine?\\b+'       |
       #'(?i)\\b+\\w+zepam\\b+'         |
       #'(?i)\\b+\\w+zolam\\b+'         |
       #'(?i)\\b+\\w+[y|u]lline?\\b+'   |
       #'(?i)\\b+\\w+artane?\\b+'       |
       #'(?i)\\b+\\w+retine?\\b+'       |
       #'(?i)\\b+\\w+navir\\b+'         |
       #'(?i)\\b+\\w+ocaine\\b+'        |
       #'(?i)didanosine|tenofovir|vaprisol|conivaptan|amlodipine')

PK = MECH? #'(?i)exposure|bioavailability|lower?(\\s|\\-)?clearance|AUC|half\\-life|Cmax'
    MECH =  #'[a-z]+e(s|d)'
    SIGN =  ADV | NEG
    NEG = 'not' | #'un[a-z]*ed'
    <TO> = 'to' | 'of'
    BE = 'is' | 'are' | 'was' | 'were'
  (*  DO = 'does' | 'do' | 'did' *)
   <COMMA> = ','
  (* <OTHER> = 'as' | 'its' | 'by' *)
    END =  '.' "

Now consider the sentence: "Exposure to oral didanosine is _significantly increased_ when coadministered with tenofovir disoproxil fumarate."

My very first tag is this, which is perfect:
 [:PHRASE
  [:DDIPK
   [:PK "Exposure"]
   "to"
   [:DRUG [:ROUTE "oral"] "didanosine"]
   [:EFF
    [:BE "is"]
    [:SIGN [:ADV "significantly"]]
    [:MECH "increased"]]]]

but now consider the same sentence slightly different: "Exposure to oral didanosine is _increased significantly_ when coadministered with tenofovir disoproxil fumarate."

 [:PHRASE
  [:DDIPK
   [:PK "Exposure"]
   "to"
   [:DRUG [:ROUTE "oral"] "didanosine"]
   [:EFF [:MECH "increased"]]]]
 [:PHRASE
  [:TOKEN [:SIGN [:ADV "significantly"]]]

Shouldn't the EFF rule have caught the [:SIGN [:ADV "significantly"]] tag? Why did it start a new PHRASE ? The same thing happens with XFOLD. iF the 'x-fold' is before the adverb (2-fold increases) it shows in the DDIPK tag otherwise (increases 2-fold) it appears after it in a new PHRASE tag. I'm pretty sure the rule covers both cases and in fact it reaches the EFF rule but it never mathes the *(ADV | XFOLD)? *rule. I am presuming this is something quite simple...

As always, thanks in advance,

Jim


On 19/11/13 00:38, Mark Engelberg wrote:
Seems like there are (at least) two issues here.

1. You have a preference in mind that is not expressed by the grammar. The parse that was outputted is a valid parse that fits all the rules of the grammar. If you want the parser to prefer DRUGPK and EFF interpretations over other interpretations, you need to specify that, for example: TOKEN = DRUGPK / EFF / (NUM | DRUG | PK | MECH | SIGN | ENCLOSED) / WORD

2. Your rule for space is "<SPACE> = #'\\s+'", i.e., one or more spaces. But the way your other rules utilize the SPACE rule, this causes a problem. For example, you define DRUGPK as ending with SPACE (and that ending SPACE is part of the DRUGPK token), but your S rule also says that tokens (including DRUGPK) must be /followed/ by a SPACE. So the DRUGPK rule will never be satisfied, because it is including the ending whitespace as part of the token, and then there's no whitespace following the token as required by the S rule. As another example, your EFF rule begins "BE? SPACE SIGN? SPACE MECH" and if the optional BE and SIGN aren't present, it's looking for two mandatory spaces in a row.

I suggest changing your rule to "<SPACE> = #'\\s*'", i.e., zero or more spaces. Or if you don't actually care about seeing the spaces in your parse output, you can change it to "<SPACE> = <#'\\s*'>".

If you make both those changes, you'll get:

=> (parsePK "Exposure to didanosine is increased when coadministered with tenofovir disoproxil fumarate [Table 5 and see Clinical Pharmacokinetics (12.3, Tables 9 and 10)].") [:S [:TOKEN [:DRUGPK [:PK "Exposure"] "to" [:DRUG "didanosine"] [:EFF "is" [:MECH "increased"]]]] [:TOKEN "when"] [:TOKEN [:EFF [:MECH "coadministered"]]] [:TOKEN "with"] [:TOKEN [:DRUG "tenofovir"]] [:TOKEN "disoproxil"] [:TOKEN "fumarate"] [:TOKEN [:ENCLOSED "[Table 5 and see Clinical Pharmacokinetics (12.3, Tables 9 and 10)]"]] [:END "."]]

which I think is what you want.

If you have follow-up questions, I recommend posting to the instaparse google group: https://groups.google.com/forum/#!forum/instaparse <https://groups.google.com/forum/#%21forum/instaparse>

--Mark

P.S. I've been experimenting with a feature to make it easier to express grammars where you find yourself inserting an optional whitespace rule everywhere, documented here under:
https://github.com/Engelberg/instaparse/blob/master/docs/ExperimentalFeatures.md#auto-whitespace


On Mon, Nov 18, 2013 at 5:47 AM, Jim - FooBar(); <jimpil1...@gmail.com <mailto:jimpil1...@gmail.com>> wrote:

    Hi all,

    I'm having a small problem composing smaller matches in
    instaparse. Here is what I'm trying...just observe the bold bits:

    (def parsePK
      (insta/parser
       "S  = TOKEN (SPACE TOKEN PUNCT?)* END
       TOKEN = (NUM | DRUG | PK | DRUGPK | MECH | SIGN | EFF |
    ENCLOSED) / WORD
       <WORD> = #'\\w+' | PUNCT
       <PUNCT> = #'\\p{Punct}'
       ENCLOSED = PAREN | SQBR
       <PAREN> = #'\\[.*\\]'
       <SQBR> =  #'\\(.*\\)'
        NUM =  #'[0-9]+'
        ADV =   #'[a-z]+ly'
       <SPACE> = #'\\s+'
        DRUG =  #'(?i)didanosine|quinidine|tenofovir'
        PK = #'(?i)exposure|bioavailability|lower?[\\s|\\-]?clearance'
    *DRUGPK =  PK SPACE TO SPACE DRUG SPACE EFF? SPACE *
        MECH =  #'[a-z]+e(s|d)'
    *EFF = BE? SPACE SIGN? SPACE MECH | BE? SPACE MECH SPACE ADV? *
        SIGN =  ADV | NEG
        NEG = 'not'
        <TO> = 'to' | 'of'
        <BE> = 'is' | 'are' | 'was' | 'were'
        END =  '.' " ))

    Running the parser returns the following. It seems that the 2
    bigger composite rules DRUGPK & EFF are not recognised at all.
    Only the smaller pieces are actually shown. I would expect [:TOKEN
    [:DRUGPK "Exposure to didanosine is increased"]] and  [:TOKEN
    [:EFF "is increased"]] entries.
    (pprint
    (parsePK "Exposure to didanosine is increased when coadministered
    with tenofovir disoproxil fumarate [Table 5 and see Clinical
    Pharmacokinetics (12.3, Tables 9 and 10)]."))


    [:S
     [:TOKEN [:PK "Exposure"]]
     " "
     [:TOKEN "to"]
     " "
     [:TOKEN [:DRUG "didanosine"]]
     " "
     [:TOKEN "is"]
     " "
     [:TOKEN [:MECH "increased"]]
     " "
     [:TOKEN "when"]
     " "
     [:TOKEN [:MECH "coadministered"]]
     " "
     [:TOKEN "with"]
     " "
     [:TOKEN [:DRUG "tenofovir"]]
     ","
     " "
     [:TOKEN "disoproxil"]
     " "
     [:TOKEN "fumarate"]
     [:END "."]]

     Am I thinking about it the wrong way? Can ayone shed some light?

    many thanks in advance,

    Jim





-- -- You received this message because you are subscribed to the Google
    Groups "Clojure" group.
    To post to this group, send email to clojure@googlegroups.com
    <mailto:clojure@googlegroups.com>
    Note that posts from new members are moderated - please be patient
    with your first post.
    To unsubscribe from this group, send email to
    clojure+unsubscr...@googlegroups.com
    <mailto:clojure%2bunsubscr...@googlegroups.com>
    For more options, visit this group at
    http://groups.google.com/group/clojure?hl=en
    ---
    You received this message because you are subscribed to the Google
    Groups "Clojure" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to clojure+unsubscr...@googlegroups.com
    <mailto:clojure%2bunsubscr...@googlegroups.com>.
    For more options, visit https://groups.google.com/groups/opt_out.


--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to