Hi Mark,
Your comments were spot on! Changing the SPACE tag makes it work and I
can also get rid of all the '?' after 'SPACE'. Also hiding the actual
spaces makes it look a lot nicer. Many many thanks for this...:)
I applied to register at the instaparse google group and my registration
is pending. While it is pending would you mind answering an additional
question? I have reworked my grammar to this (again observe the bold bits):
"S = PHRASE+ SPACE END
PHRASE = DDIPK | DDI | ENCLOSED | TOKEN (SPACE TOKEN PUNCT?)*
*DDIPK = PK SPACE TO SPACE DRUG SPACE EFF?*
DDI = MECH SPACE DRUG SPACE TOKEN
*EFF =* *BE? SPACE (SIGN | XFOLD)? SPACE MECH SPACE (ADV | XFOLD)?*
TOKEN = (DOSE | NUM | DRUG | PK | PERCENTAGE | XFOLD | CYP | MECH |
SIGN | TO | ENCLOSED | COMMA) / WORD
<WORD> = #'\\w+'
<PUNCT> = #'\\p{Punct}'
XFOLD = NUM SPACE '-'? SPACE 'fold' SPACE #'[a-z]+ses?'?
ROUTE = #'(?i)oral|intravenous'
UNIT = 'mg' | 'g'
DOSE = NUM SPACE UNIT SPACE INTERVAL?
INTERVAL = #'[a-z]+ce'? SPACE ADV | NUM SPACE 'times per' TIME | '/'
TIME
TIME = 'hour' | 'day' | 'week'
PERCENTAGE = NUM SPACE ('%' | #'per(\\s|\\-)?cent')
ENCLOSED = PAREN | SQBR
<PAREN> = #'\\(.*\\)'
<SQBR> = #'\\[.*\\]'
NUM = #'[0-9]+'
CYP = #'CYP[A-Z0-9]*'
ADV = #'[a-z]+ly'
<SPACE> = <#'\\s*'>
DRUG = ROUTE? SPACE
(#'(?i)\\b+\\w+a[z|st|p]ine?\\b+' |
#'(?i)\\b+\\w+[i|u]dine?\\b+' |
#'(?i)\\b+\\w+azo[l|n]e?\\b+' |
#'(?i)\\b+\\w+tamine?\\b+' |
#'(?i)\\b+\\w+zepam\\b+' |
#'(?i)\\b+\\w+zolam\\b+' |
#'(?i)\\b+\\w+[y|u]lline?\\b+' |
#'(?i)\\b+\\w+artane?\\b+' |
#'(?i)\\b+\\w+retine?\\b+' |
#'(?i)\\b+\\w+navir\\b+' |
#'(?i)\\b+\\w+ocaine\\b+' |
#'(?i)didanosine|tenofovir|vaprisol|conivaptan|amlodipine')
PK = MECH?
#'(?i)exposure|bioavailability|lower?(\\s|\\-)?clearance|AUC|half\\-life|Cmax'
MECH = #'[a-z]+e(s|d)'
SIGN = ADV | NEG
NEG = 'not' | #'un[a-z]*ed'
<TO> = 'to' | 'of'
BE = 'is' | 'are' | 'was' | 'were'
(* DO = 'does' | 'do' | 'did' *)
<COMMA> = ','
(* <OTHER> = 'as' | 'its' | 'by' *)
END = '.' "
Now consider the sentence: "Exposure to oral didanosine is
_significantly increased_ when coadministered with tenofovir disoproxil
fumarate."
My very first tag is this, which is perfect:
[:PHRASE
[:DDIPK
[:PK "Exposure"]
"to"
[:DRUG [:ROUTE "oral"] "didanosine"]
[:EFF
[:BE "is"]
[:SIGN [:ADV "significantly"]]
[:MECH "increased"]]]]
but now consider the same sentence slightly different: "Exposure to oral
didanosine is _increased significantly_ when coadministered with
tenofovir disoproxil fumarate."
[:PHRASE
[:DDIPK
[:PK "Exposure"]
"to"
[:DRUG [:ROUTE "oral"] "didanosine"]
[:EFF [:MECH "increased"]]]]
[:PHRASE
[:TOKEN [:SIGN [:ADV "significantly"]]]
Shouldn't the EFF rule have caught the [:SIGN [:ADV "significantly"]]
tag? Why did it start a new PHRASE ? The same thing happens with XFOLD.
iF the 'x-fold' is before the adverb (2-fold increases) it shows in the
DDIPK tag otherwise (increases 2-fold) it appears after it in a new
PHRASE tag. I'm pretty sure the rule covers both cases and in fact it
reaches the EFF rule but it never mathes the *(ADV | XFOLD)? *rule. I am
presuming this is something quite simple...
As always, thanks in advance,
Jim
On 19/11/13 00:38, Mark Engelberg wrote:
Seems like there are (at least) two issues here.
1. You have a preference in mind that is not expressed by the
grammar. The parse that was outputted is a valid parse that fits all
the rules of the grammar. If you want the parser to prefer DRUGPK and
EFF interpretations over other interpretations, you need to specify
that, for example:
TOKEN = DRUGPK / EFF / (NUM | DRUG | PK | MECH | SIGN | ENCLOSED) /
WORD
2. Your rule for space is "<SPACE> = #'\\s+'", i.e., one or more
spaces. But the way your other rules utilize the SPACE rule, this
causes a problem. For example, you define DRUGPK as ending with SPACE
(and that ending SPACE is part of the DRUGPK token), but your S rule
also says that tokens (including DRUGPK) must be /followed/ by a
SPACE. So the DRUGPK rule will never be satisfied, because it is
including the ending whitespace as part of the token, and then there's
no whitespace following the token as required by the S rule. As
another example, your EFF rule begins "BE? SPACE SIGN? SPACE MECH" and
if the optional BE and SIGN aren't present, it's looking for two
mandatory spaces in a row.
I suggest changing your rule to "<SPACE> = #'\\s*'", i.e., zero or
more spaces. Or if you don't actually care about seeing the spaces in
your parse output, you can change it to "<SPACE> = <#'\\s*'>".
If you make both those changes, you'll get:
=> (parsePK "Exposure to didanosine is increased when coadministered
with tenofovir disoproxil fumarate [Table 5 and see Clinical
Pharmacokinetics (12.3, Tables 9 and 10)].")
[:S [:TOKEN [:DRUGPK [:PK "Exposure"] "to" [:DRUG "didanosine"] [:EFF
"is" [:MECH "increased"]]]] [:TOKEN "when"] [:TOKEN [:EFF [:MECH
"coadministered"]]] [:TOKEN "with"] [:TOKEN [:DRUG "tenofovir"]]
[:TOKEN "disoproxil"] [:TOKEN "fumarate"] [:TOKEN [:ENCLOSED "[Table 5
and see Clinical Pharmacokinetics (12.3, Tables 9 and 10)]"]] [:END "."]]
which I think is what you want.
If you have follow-up questions, I recommend posting to the instaparse
google group:
https://groups.google.com/forum/#!forum/instaparse
<https://groups.google.com/forum/#%21forum/instaparse>
--Mark
P.S. I've been experimenting with a feature to make it easier to
express grammars where you find yourself inserting an optional
whitespace rule everywhere, documented here under:
https://github.com/Engelberg/instaparse/blob/master/docs/ExperimentalFeatures.md#auto-whitespace
On Mon, Nov 18, 2013 at 5:47 AM, Jim - FooBar(); <jimpil1...@gmail.com
<mailto:jimpil1...@gmail.com>> wrote:
Hi all,
I'm having a small problem composing smaller matches in
instaparse. Here is what I'm trying...just observe the bold bits:
(def parsePK
(insta/parser
"S = TOKEN (SPACE TOKEN PUNCT?)* END
TOKEN = (NUM | DRUG | PK | DRUGPK | MECH | SIGN | EFF |
ENCLOSED) / WORD
<WORD> = #'\\w+' | PUNCT
<PUNCT> = #'\\p{Punct}'
ENCLOSED = PAREN | SQBR
<PAREN> = #'\\[.*\\]'
<SQBR> = #'\\(.*\\)'
NUM = #'[0-9]+'
ADV = #'[a-z]+ly'
<SPACE> = #'\\s+'
DRUG = #'(?i)didanosine|quinidine|tenofovir'
PK = #'(?i)exposure|bioavailability|lower?[\\s|\\-]?clearance'
*DRUGPK = PK SPACE TO SPACE DRUG SPACE EFF? SPACE *
MECH = #'[a-z]+e(s|d)'
*EFF = BE? SPACE SIGN? SPACE MECH | BE? SPACE MECH SPACE ADV? *
SIGN = ADV | NEG
NEG = 'not'
<TO> = 'to' | 'of'
<BE> = 'is' | 'are' | 'was' | 'were'
END = '.' " ))
Running the parser returns the following. It seems that the 2
bigger composite rules DRUGPK & EFF are not recognised at all.
Only the smaller pieces are actually shown. I would expect [:TOKEN
[:DRUGPK "Exposure to didanosine is increased"]] and [:TOKEN
[:EFF "is increased"]] entries.
(pprint
(parsePK "Exposure to didanosine is increased when coadministered
with tenofovir disoproxil fumarate [Table 5 and see Clinical
Pharmacokinetics (12.3, Tables 9 and 10)]."))
[:S
[:TOKEN [:PK "Exposure"]]
" "
[:TOKEN "to"]
" "
[:TOKEN [:DRUG "didanosine"]]
" "
[:TOKEN "is"]
" "
[:TOKEN [:MECH "increased"]]
" "
[:TOKEN "when"]
" "
[:TOKEN [:MECH "coadministered"]]
" "
[:TOKEN "with"]
" "
[:TOKEN [:DRUG "tenofovir"]]
","
" "
[:TOKEN "disoproxil"]
" "
[:TOKEN "fumarate"]
[:END "."]]
Am I thinking about it the wrong way? Can ayone shed some light?
many thanks in advance,
Jim
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
<mailto:clojure@googlegroups.com>
Note that posts from new members are moderated - please be patient
with your first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
<mailto:clojure%2bunsubscr...@googlegroups.com>
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to clojure+unsubscr...@googlegroups.com
<mailto:clojure%2bunsubscr...@googlegroups.com>.
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient
with your first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.