Hi Peter,

I would guess that you are seeing things like "SOFT" because you new dictionary 
has a vocabulary that was not included in sno_rx_16ab.
I don't remember if OMIM (which has the 'SOFT' synonym) was included in 
sno_rx_16ab.  Probably not, omim is a more -specialized- vocabulary for 
genetics.

The term is only in the omim (and mth) vocabularies in the 2016AB umls release. 
   
https://uts.nlm.nih.gov/metathesaurus.html#C3542022;0;1;CUI;2016AB;WORD;CUI;*;  

The term is in snomed in umls 2020AA, but only with the expanded full-text 
synonym.  It still has the abbreviation from omim.  
 
https://uts.nlm.nih.gov/metathesaurus.html#SHORT%20STATURE,%20ONYCHODYSPLASIA,%20FACIAL%20DYSMORPHISM,%20AND%20HYPOTRICHOSIS;0;1;TERM;2020AA;WORD;TERM;*;

As for finding terms in adjectives, the default parts of speech(pos) that are 
checked for terms are:
VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB

You can see what these are here: 
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

You can override this list.  In your piper file, set the variable 
"exclusionTags"

// Default excluded parts of speech, plus various forms of adjective.
set 
exclusionTags="ADJ,JJ,JJR,JJS,VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"

//  Annotate concepts based upon default algorithms.
add DefaultJCasTermAnnotator


You'll notice that I threw in 'ADJ' for good measure.  It should not break 
anything.  

I have modified this list many times for various projects.  In one I allow 
verbs for lookup.  For those notes the value of the true positives outweighed 
the increased false negatives.  In another I actually empty the entire list to 
allow everything (set exclusionTags="").  I did this because there is a lot of 
structured text in lists and tables, but the pos tagger is trying to resolve 
prose text.  The pos assigned on the structured text is all over the place, and 
terms are missed left and right.

So ... last but definitely not least, case-sensitivity.
I started working on this a while ago, but right now it sits unfinished.

There is an additional table in the dictionary database, in which all synonyms 
are all upper-case.
This second table is created with synonyms that exist in the umls as all 
upper-case.
The first  "classic" table is created using ONLY synonyms from the umls that 
are lower and/or mixed case. 

When the annotator engine iterates over the text, it checks one table (classic) 
or the other (caps) depending upon the case of the text in the note.

It sounds like minor work, but it requires a new engine, new dictionary, and 
new dictionary creator.  None of this is difficult, but it requires time.

Anyway, I hope that some of this helps.

Sean


________________________________________
From: Peter Abramowitsch <pabramowit...@gmail.com>
Sent: Saturday, August 1, 2020 11:35 PM
To: dev@ctakes.apache.org
Subject: Re: With custom dictionary - over-eager resolution of acronyms 
[EXTERNAL]

* External Email - Caution *


Hi Jeff thanks for your suggestions,

I spent some time in the script file and sure enough,  my 2020 UMLS
extraction actually has these two entries:

INSERT INTO CUI_TERMS VALUES(3542022,0,1,'soft','soft')
INSERT INTO PREFTERM VALUES(3542022,'SHORT STATURE, ONYCHODYSPLASIA, FACIAL
DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME')

It's unbelievable.  the UMLS entry has got to be wrong or I'm missing
something to say that it only applies (as an acronym) if it's capitalized

In sno_rx  there is neither a CUI 3542022 nor the definition of "soft" as a
solitary word, nor even a mention of ONYCHODYSPLASIA or HYPOTRICHOSIS

In any case, I would have thought that ctakes will only create an event
mention from a term tagged as NN or NP slot, not a ADJ as in "soft tissue"

Anyway  Thanks!  Now I will keep poking around.


Peter












On Sat, Aug 1, 2020 at 5:06 PM Jeffrey Miller <jeff...@gmail.com> wrote:

> Sorry, I meant suggest to search for 'soft' in the dictionary file not
> 'short'
>
> grep -i ,\'soft\', *.script
>
> On Sat, Aug 1, 2020 at 7:47 PM Jeffrey Miller <jeff...@gmail.com> wrote:
>
> > Hi Peter,
> >
> > To my knowledge, there isn't any drastic difference in the behavior of
> the
> > dictionary gui creator and the way the sno_rx dictionary was created. I
> > originally thought there was, but I realized the difference was that I
> had
> > not installed all of UMLS to my machine (just the vocabularies I was
> > interested in) and I was missing synonyms. The first thing I would check,
> > are you able to find a matching entry in the .script file for your ctakes
> > dictionary when you do this:
> >
> > grep -i ,\'short\', *.script
> >
> > That would confirm whether or not you have a term in your dictionary made
> > up only of 'short' and whether it mapped to the CUI equal to "SHORT
> > STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS
> SYNDROME".
> > If it's not in there, something else is going on. You could do the same
> for
> > 'bed'.
> >
> > If not, another thing I might check is that I noticed you are using
> > the OverlapJCasTermAnnotator in your prior e-mail. I don't have much
> > experience with it, and I don't think it should cause this behavior, but
> I
> > wonder if that could be making the difference (as compared
> > to DefaultJCasTermAnnotator).
> >
> > Jeff
> >
> > On Sat, Aug 1, 2020 at 5:27 PM Peter Abramowitsch <
> pabramowit...@gmail.com>
> > wrote:
> >
> >>
> >> Hi All
> >>
> >> Having created a new dictionary from the 2020AA UMLS and added Genes and
> >> Receptors to the dictionary-creator's default selections, I have a
> curious
> >> problem where cTakes now assigns the most bizarre acronyms to ordinary
> >> words used in POS contexts where it shouldn't  find <XXX>Mentions.
> >>
> >> Here are two examples:
> >>
> >> 1.   soft (in "soft tissue...")
> >> becomes   "SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND
> >> HYPOTRICHOSIS SYNDROME",
> >>
> >> 2.   bed in ("The wound bed was...")
> >> becomes  "BORNHOLM EYE DISEASE"
> >>
> >> I have not changed the TermConsumer type in the descriptor XML.
> >>
> >> Are the DictionaryCreator's defaults, the equivalent to the default
> >> sno_rx that's delivered with the app?
> >>
> >> Attached is the vocab subsets list I used
> >>
> >>
> >> Peter
> >>
> >>
> >>
>

Reply via email to