dictionary lookup config for best F1 measure [was RE: cTakes Annotation Comparison

2015-01-09 Thread Masanz, James J.
Sean (or others), 

Of the various configuration options described below, which values/choices 
would you recommend for best F1 measure for something like the shared clef 2013 
task?
https://sites.google.com/site/shareclefehealth/

I'm looking for something that doesn't have to be the best speed-wise, but that 
is the recommended for optimizing F1 measure.

Regards,
James 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Friday, December 19, 2014 11:55 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Well, I guess that it is time for me to speak up …

I must say that I’m happy that people are showing interest in the fast lookup.  
I am also happy (sort of) that some concerns are being raised – and that there 
is now community participation in my little toy.  I  have some concerns about 
what people are reporting.  This does not coincide with what I have seen at 
all.  Yesterday I started (without knowing this thread existed) testing a 
bare-minimum pipeline for CUI extraction.  It is just the stripped-down 
Aggregate with only: segment, tokens, sentences, POS, and the fast lookup.  The 
people at Children’s wanted to know how fast we could get.  1,196 notes in 
under 90 seconds on my laptop with over 210,000 annotations, which is 175/note. 
 After reading the thread I decided to run the fast lookup with several 
configurations.  I also ran the default for 10.5 hours.  I am comparing the 
annotations from each system against the human annotations that we have, and I 
will let everybody know what I find – for better or worse.

The fast lookup does not (out-of-box) do the exact same thing as the default.  
Some things can be configured to make it more closely approximate the default 
dictionary.

1.Set the minimum annotation span length to 2 (default is 3).  This is 
in desc/[ae]/UmlsLookupAnnotator.xml : line #78.  The annotator should then 
pick up text like “CT” and improve recall, but it will hurt precision.

2.   Set the Lookup Window to LookupWindowAnnotation.  This is in 
desc/[ae]/UmlsLookupAnnotator.xml: lines #65 & #93.   The LookupWindowAnnotator 
will need to be added to the aggregate pipeline 
AggregatePlaintextFastUMLSProcesor.xml  lines #50 & #172.  This will narrow the 
lookup window and may increase precision, but (in my experience) reduces recall.

3.   Allow the –rough- identification of Overlapping spans.  The default 
dictionary will often identify text like “metastatic colorectal carcinoma” when 
that text actually does not exist anywhere in umls.  It basically ignores 
“colorectal” and gives the whole span the CUI for “metastatic carcinoma”.  In 
this case it is arguably a good thing.  In many others it is arguably not so 
much.  There is a Class ... lookup2.ae.OverlapJCasTermAnnotator.java that will 
do the same thing.  You can create a new desc/[ae]/*Annotator.xml or just 
change the  in desc/[ae]/UmlsLookupAnnotator.xml 
line #25.  I will check in a new desc xml (sorry; thought I had) because there 
are 2 parameters unique to OverlapJCasTermAnnotator

4.   You can play with the OverlapJCasTermAnnotator parameters 
“consecutiveSkips” and “totalTokenSkips”.  These control just how lenient you 
want the overlap tagging to be.

5.   Create a new dictionary database.  There is a (bit messy) 
DictionaryTool in sandbox that will let you dump whatever you do or do not want 
from UMLS into a database.  It will also help you clean up or –select- stored 
entries as well.  There is a lot of garbage in the default dictionary database: 
repeated terms with caps/no caps (“Cancer”,”cancer”), text with metadata 
(“cancer [finding]”) and text that just clutters (“PhenX: entry for cancer”, 
“1”, “2”).  The fast lookup database should have most of the Snomed and RxNorm 
terms (and synonyms) of interest, but you could always make a new database that 
is much more inclusive.

The main key to the speed of the fast dictionary lookup is actually … the key.  
It is the way that the database is indexed and the lookup by “rare” word 
instead of “first” word.  Everything else can be changed around it and it 
should still be a faster version.

As for the false positives like “Today”, that will always be a problem until we 
have disambiguation.  The lookup is basically a glorified grep.

Sean

From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Friday, December 19, 2014 10:43 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Also check out stats that Sean ran before releasing the new component on:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
From the evaluation and experience, the new lookup algorithm should be a huge 
improvement in terms of both speed and accuracy.
This is very different than what Bruce mentioned…  I’m sure Sean will chime 
here.
(The old dictionary lookup

RE: dictionary lookup config for best F1 measure [was RE: cTakes Annotation Comparison

2015-01-09 Thread Finan, Sean
Hi James,
Great question.  In truth, you may need to run a few times to find out.  Doing 
that with a full pipeline would be tedious, but there is a descriptor in 
clinical-pipeline named CuisOnlyPlaintextUMLSProcessor.xml that will only 
obtain Umls cuis.  It runs ~50,000 notes per hour on my laptop as-is, so I 
suggest that you test with that ae.  It has lvg commented out by default (for 
speed).  Adding lvg will increase the runtime, but it also will (as you know) 
find a few additional terms.   You can try a few configurations without it and 
then the best option with it.  If you want to test the default dictionary 
lookup then you can certainly swap the referenced lookup xmls.

Changes to the fast dictionary configuration are made in two places:
1.  The main descriptor ...-fast/desc/analysis_engine/UmlsLookupAnnotator.xml
2.  The resource (dictionary) configuration file 
resources/.../fast/cTakesHsql..xml

A few suggestions, in order of impact:
1.  I am guessing that the annotations in clef are human annotated with 
longest-length spans only.  In other words, "colon cancer" instead of  "colon 
cancer" and "cancer".  To best approximate this style of annotation, edit the 
cTakesHsql.xml in the section  and change the selected 
implementation.  By default it is DefaultTermConsumer (go figure), but you will 
want to use the commented-out PrecisionTermConsumer.  As the above cTakesHsql 
comment indicates " DefaultTermConsumer will persist all spans.
   PrecisionTermConsumer will only persist only the longest overlapping span of 
any semantic group."  Doing this should increase precision, and depending upon 
how "good" the annotations are it should not greatly change recall.

2. Just for kicks, try using SemanticCleanupTermConsumer.  It may slightly 
increase precision, but it also may decrease recall.  Hopefully it doesn't do 
much at all (PrecisionTermConsumer and proper semantic typing in the dictionary 
should suffice without this term consumer).

3. Especially for task 2 (acronyms & abbreviations), you should try a run with 
minimumSpan in UmlsLookupAnnotator.xml set to 2.   This changes 
the minimum allowable span of a term.  The default is 3 to increase precision 
on acronyms & abbreviations, but decreasing to 2 may improve recall on the 
same.   The dictionary is not built with anything below 2 characters.
4.  On that note (character length), if task 1 does not include acronyms & 
abbreviations, then you can try increasing the minimum span length above 3 and 
see if there is a good increase in precision without a significant decrease in 
recall.

5.  Try a few runs with overlapping spans in addition to exact matches.  To do 
this use the OverlapJCasTermAnnotator instead of the DefaultJCasTermAnnotator 
annotator implementation.  DefaultJCasTermAnnotator is specified in 
UmlsLookupAnnotator.xml  but I will check in a descriptor for overlap matching. 
 There are additional parameters for that option, but I'll email  them after I 
checkin.

6.  By default the new lookup uses Sentence as the lookup window.  I did this 
for two reasons: 1. Not all terms are within Noun Phrases, 2. Some Noun Phrases 
overlapped, causing repeated lookups (in my 3.0 candidate trials), and 3. Not 
all cTakes Noun Phrases are accurate.  Because the lookup is fast, using a full 
Sentence for lookup doesn't seem to hurt much.  However, you can always switch 
it back to see if precision is increased enough to warrant the decrease in 
recall.  This is changed in UmlsLookupAnnotator.xml

I have run my own tests with the various setups, but I don't want to adversely 
influence what you run just in case the trends with the share/clef annotations 
differ.

Sean

-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, January 09, 2015 3:57 PM
To: 'dev@ctakes.apache.org'
Subject: dictionary lookup config for best F1 measure [was RE: cTakes 
Annotation Comparison

Sean (or others), 

Of the various configuration options described below, which values/choices 
would you recommend for best F1 measure for something like the shared clef 2013 
task?
https://sites.google.com/site/shareclefehealth/

I'm looking for something that doesn't have to be the best speed-wise, but that 
is the recommended for optimizing F1 measure.

Regards,
James 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Friday, December 19, 2014 11:55 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Well, I guess that it is time for me to speak up …

I must say that I’m happy that people are showing interest in the fast lookup.  
I am also happy (sort of) that some concerns are being raised – and that there 
is now community participation in my little toy.  I  have some concerns about 
what people are reporting.  This does not coincide with what I have seen at 
all.  Yesterday I started (without knowing this thread existed) testing a 
bare-

RE: dictionary lookup config for best F1 measure [was RE: cTakes Annotation Comparison : Span Overlap addendum

2015-01-09 Thread Finan, Sean
Hi James,

I've checked in a descriptor for the UmlsOverlapLookupAnnotator in fast/desc/ . 
 I also checked in a modification for the CuisOnlyPlaintextUMLSProcessor.xml 
with the Overlap annotator commented out as an option:

  
 
 
 
 
  

As an example of its difference from the Default, I ran the example colon 
cancer document from thyme and it finds the following:
"blood with stool" > C1321898: blood in stool
"polyps, all adenomatous" > C0206677: adenomatous polyps
"lesions in his liver" > C0577053: lesion of liver
"PAST MEDICAL/SURGICAL HISTORY" > C0262926: medical history , C0455458: past 
medical history
"MEDICAL/SURGICAL HISTORY" > C0262926: medical history
"tonsils and adenoids" > C0580788: tonsil and adenoid structure ; this is also 
found without overlap, but overlap finds it a second time
"torn left Achilles tendon" > C0263970: rupture of Achilles tendon
"ankle scar on left" > C0230448: structure of left ankle *
"prostate, no masses palpable" > C0577252: prostate palpable
"cancer of the cecum" > C0153437: malignant neoplasm of cecum  ; this is also 
found without overlap, but overlap finds it a second time
"complications of anesthesia" > C0392008: complication of anesthesia  ; this is 
also found without overlap, but overlap finds it a second time

* One important item is that the overlap annotator understands discontiguous 
spans.  There is, in fact, a ...lookup2.textspan.MultiTextSpan class.  So, for 
items such as "ankle scar on left" the annotator is actually annotating only 
"ankle ... left" but it has to be stored in the cas as one big happy albeit 
underspecified span.

I think that I mentioned in the previous email that the Overlap annotator has a 
couple of extra parameters.  They are called "totalTokenSkips" and 
"consecutiveTokenSkips".  The names are pretty self-explanatory; the algorithm 
will allow a maximum number of tokens to be skipped, consecutive or not, as 
long as the total number of consecutive tokens to be skipped is not above a 
certain number.  For instance, total=4 and consecutive=2 (the defaults) will 
match "this kinda sorta should maybe hopefully match" with "this should match". 
 This is pretty lenient, but seems to work in my tests.  "this kinda-sorta 
should ..." will not match ... though maybe '-' should be a special case.  Let 
me know what you think.

Enjoy,
Sean


-Original Message-
From: Masanz, James J. [mailto:masanz.ja...@mayo.edu] 
Sent: Friday, January 09, 2015 3:57 PM
To: 'dev@ctakes.apache.org'
Subject: dictionary lookup config for best F1 measure [was RE: cTakes 
Annotation Comparison

Sean (or others), 

Of the various configuration options described below, which values/choices 
would you recommend for best F1 measure for something like the shared clef 2013 
task?
https://sites.google.com/site/shareclefehealth/

I'm looking for something that doesn't have to be the best speed-wise, but that 
is the recommended for optimizing F1 measure.

Regards,
James 

-Original Message-
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Friday, December 19, 2014 11:55 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Well, I guess that it is time for me to speak up …

I must say that I’m happy that people are showing interest in the fast lookup.  
I am also happy (sort of) that some concerns are being raised – and that there 
is now community participation in my little toy.  I  have some concerns about 
what people are reporting.  This does not coincide with what I have seen at 
all.  Yesterday I started (without knowing this thread existed) testing a 
bare-minimum pipeline for CUI extraction.  It is just the stripped-down 
Aggregate with only: segment, tokens, sentences, POS, and the fast lookup.  The 
people at Children’s wanted to know how fast we could get.  1,196 notes in 
under 90 seconds on my laptop with over 210,000 annotations, which is 175/note. 
 After reading the thread I decided to run the fast lookup with several 
configurations.  I also ran the default for 10.5 hours.  I am comparing the 
annotations from each system against the human annotations that we have, and I 
will let everybody know what I find – for better or worse.

The fast lookup does not (out-of-box) do the exact same thing as the default.  
Some things can be configured to make it more closely approximate the default 
dictionary.

1.Set the minimum annotation span length to 2 (default is 3).  This is 
in desc/[ae]/UmlsLookupAnnotator.xml : line #78.  The annotator should then 
pick up text like “CT” and improve recall, but it will hurt precision.

2.   Set the Lookup Window to LookupWindowAnnotation.  This is in 
desc/[ae]/UmlsLookupAnnotator.xml: lines #65 & #93.   The LookupWindowAnnotator 
will need to be added to the aggregate pipeline 
AggregatePlaintextFastUMLSProcesor.xml  lines #50 & #172.  This will narrow the 
lookup window and may increase