Hi James,
I've checked in a descriptor for the UmlsOverlapLookupAnnotator in fast/desc/ .
I also checked in a modification for the CuisOnlyPlaintextUMLSProcessor.xml
with the Overlap annotator commented out as an option:
<delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
<!-- UmlsLookupAnnotator only finds exact span matches -->
<import
location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
<!-- UmlsOverlapLookupAnnotator finds exact span matches and
overlapping span matches -->
<!--<import
location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsOverlapLookupAnnotator.xml"/>-->
</delegateAnalysisEngine>
As an example of its difference from the Default, I ran the example colon
cancer document from thyme and it finds the following:
"blood with stool" > C1321898: blood in stool
"polyps, all adenomatous" > C0206677: adenomatous polyps
"lesions in his liver" > C0577053: lesion of liver
"PAST MEDICAL/SURGICAL HISTORY" > C0262926: medical history , C0455458: past
medical history
"MEDICAL/SURGICAL HISTORY" > C0262926: medical history
"tonsils and adenoids" > C0580788: tonsil and adenoid structure ; this is also
found without overlap, but overlap finds it a second time
"torn left Achilles tendon" > C0263970: rupture of Achilles tendon
"ankle scar on left" > C0230448: structure of left ankle *
"prostate, no masses palpable" > C0577252: prostate palpable
"cancer of the cecum" > C0153437: malignant neoplasm of cecum ; this is also
found without overlap, but overlap finds it a second time
"complications of anesthesia" > C0392008: complication of anesthesia ; this is
also found without overlap, but overlap finds it a second time
* One important item is that the overlap annotator understands discontiguous
spans. There is, in fact, a ...lookup2.textspan.MultiTextSpan class. So, for
items such as "ankle scar on left" the annotator is actually annotating only
"ankle ... left" but it has to be stored in the cas as one big happy albeit
underspecified span.
I think that I mentioned in the previous email that the Overlap annotator has a
couple of extra parameters. They are called "totalTokenSkips" and
"consecutiveTokenSkips". The names are pretty self-explanatory; the algorithm
will allow a maximum number of tokens to be skipped, consecutive or not, as
long as the total number of consecutive tokens to be skipped is not above a
certain number. For instance, total=4 and consecutive=2 (the defaults) will
match "this kinda sorta should maybe hopefully match" with "this should match".
This is pretty lenient, but seems to work in my tests. "this kinda-sorta
should ..." will not match ... though maybe '-' should be a special case. Let
me know what you think.
Enjoy,
Sean
-----Original Message-----
From: Masanz, James J. [mailto:[email protected]]
Sent: Friday, January 09, 2015 3:57 PM
To: '[email protected]'
Subject: dictionary lookup config for best F1 measure [was RE: cTakes
Annotation Comparison
Sean (or others),
Of the various configuration options described below, which values/choices
would you recommend for best F1 measure for something like the shared clef 2013
task?
https://sites.google.com/site/shareclefehealth/
I'm looking for something that doesn't have to be the best speed-wise, but that
is the recommended for optimizing F1 measure.
Regards,
James
-----Original Message-----
From: Finan, Sean [mailto:[email protected]]
Sent: Friday, December 19, 2014 11:55 AM
To: [email protected]; [email protected]
Subject: RE: cTakes Annotation Comparison
Well, I guess that it is time for me to speak up …
I must say that I’m happy that people are showing interest in the fast lookup.
I am also happy (sort of) that some concerns are being raised – and that there
is now community participation in my little toy. I have some concerns about
what people are reporting. This does not coincide with what I have seen at
all. Yesterday I started (without knowing this thread existed) testing a
bare-minimum pipeline for CUI extraction. It is just the stripped-down
Aggregate with only: segment, tokens, sentences, POS, and the fast lookup. The
people at Children’s wanted to know how fast we could get. 1,196 notes in
under 90 seconds on my laptop with over 210,000 annotations, which is 175/note.
After reading the thread I decided to run the fast lookup with several
configurations. I also ran the default for 10.5 hours. I am comparing the
annotations from each system against the human annotations that we have, and I
will let everybody know what I find – for better or worse.
The fast lookup does not (out-of-box) do the exact same thing as the default.
Some things can be configured to make it more closely approximate the default
dictionary.
1. Set the minimum annotation span length to 2 (default is 3). This is
in desc/[ae]/UmlsLookupAnnotator.xml : line #78. The annotator should then
pick up text like “CT” and improve recall, but it will hurt precision.
2. Set the Lookup Window to LookupWindowAnnotation. This is in
desc/[ae]/UmlsLookupAnnotator.xml: lines #65 & #93. The LookupWindowAnnotator
will need to be added to the aggregate pipeline
AggregatePlaintextFastUMLSProcesor.xml lines #50 & #172. This will narrow the
lookup window and may increase precision, but (in my experience) reduces recall.
3. Allow the –rough- identification of Overlapping spans. The default
dictionary will often identify text like “metastatic colorectal carcinoma” when
that text actually does not exist anywhere in umls. It basically ignores
“colorectal” and gives the whole span the CUI for “metastatic carcinoma”. In
this case it is arguably a good thing. In many others it is arguably not so
much. There is a Class ... lookup2.ae.OverlapJCasTermAnnotator.java that will
do the same thing. You can create a new desc/[ae]/*Annotator.xml or just
change the <annotatorImplementationName> in desc/[ae]/UmlsLookupAnnotator.xml
line #25. I will check in a new desc xml (sorry; thought I had) because there
are 2 parameters unique to OverlapJCasTermAnnotator
4. You can play with the OverlapJCasTermAnnotator parameters
“consecutiveSkips” and “totalTokenSkips”. These control just how lenient you
want the overlap tagging to be.
5. Create a new dictionary database. There is a (bit messy)
DictionaryTool in sandbox that will let you dump whatever you do or do not want
from UMLS into a database. It will also help you clean up or –select- stored
entries as well. There is a lot of garbage in the default dictionary database:
repeated terms with caps/no caps (“Cancer”,”cancer”), text with metadata
(“cancer [finding]”) and text that just clutters (“PhenX: entry for cancer”,
“1”, “2”). The fast lookup database should have most of the Snomed and RxNorm
terms (and synonyms) of interest, but you could always make a new database that
is much more inclusive.
The main key to the speed of the fast dictionary lookup is actually … the key.
It is the way that the database is indexed and the lookup by “rare” word
instead of “first” word. Everything else can be changed around it and it
should still be a faster version.
As for the false positives like “Today”, that will always be a problem until we
have disambiguation. The lookup is basically a glorified grep.
Sean
From: Chen, Pei [mailto:[email protected]]
Sent: Friday, December 19, 2014 10:43 AM
To: [email protected]; [email protected]
Subject: RE: cTakes Annotation Comparison
Also check out stats that Sean ran before releasing the new component on:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
From the evaluation and experience, the new lookup algorithm should be a huge
improvement in terms of both speed and accuracy.
This is very different than what Bruce mentioned… I’m sure Sean will chime
here.
(The old dictionary lookup is essentially obsolete now- plagued with
bugs/issues as you mentioned.)
--Pei
From: Kim Ebert [mailto:[email protected]]
Sent: Friday, December 19, 2014 10:25 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: cTakes Annotation Comparison
Guergana,
I'm curious to the number of records that are in your gold standard sets, or if
your gold standard set was run through a long running cTAKES process. I know at
some point we fixed a bug in the old dictionary lookup that caused the
permutations to become corrupted over time. Typically this isn't seen in the
first few records, but over time as patterns are used the permutations would
become corrupted. This caused documents that were fed through cTAKES more than
once to have less codes returned than the first time.
For example, if a permutation of 4,2,3,1 was found, the permutation would be
corrupted to be 1,2,3,4. It would no longer be possible to detect permutations
of 4,2,3,1 until cTAKES was restarted. We got the fix in after the cTAKES 3.2.0
release. https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the
corpus size, I could see the permutation engine eventually only have a single
permutation of 1,2,3,4.
Typically though, this isn't very easily detected in the first 100 or so
documents.
We discovered this issue when we made cTAKES have consistent output of codes in
our system.
[IMAT Solutions]<http://imatsolutions.com>
Kim Ebert
Software Engineer
[Office:]801.669.7342
[email protected]<mailto:[email protected]>
On 12/19/2014 07:05 AM, Savova, Guergana wrote:
We are doing a similar kind of evaluation and will report the results.
Before we released the Fast lookup, we did a systematic evaluation across three
gold standard sets. We did not see the trend that Bruce reported below. The P,
R and F1 results from the old dictionary look up and the fast one were similar.
Thank you everyone!
--Guergana
-----Original Message-----
From: David Kincaid [mailto:[email protected]]
Sent: Friday, December 19, 2014 9:02 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: cTakes Annotation Comparison
Thanks for this, Bruce! Very interesting work. It confirms what I've seen in my
small tests that I've done in a non-systematic way. Did you happen to capture
the number of false positives yet (annotations made by cTAKES that are not in
the human adjudicated standard)? I've seen a lot of dictionary hits that are
not actually entity mentions, but I haven't had a chance to do a systematic
analysis (we're working on our annotated gold standard now). One great example
is the antibiotic "Today". Every time the word today appears in any text it is
annotated as a medication mention when it almost never is being used in that
sense.
These results by themselves are quite disappointing to me. Both the
UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor
recall. It seems like the trade off for more speed is a ten-fold (or more)
decrease in entity recognition.
Thanks again for sharing your results with us. I think they are very useful to
the project.
- Dave
On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen <
[email protected]<mailto:[email protected]>>
wrote:
Actually, we are working on a similar tool to compare it to the human
adjudicated standard for the set we tested against. I didn't mention
it before because the tool isn't complete yet, but initial results for
the set (excluding those marked as "CUI-less") was as follows:
Human adjudicated annotations: 4591 (excluding CUI-less)
Annotations found matching the human adjudicated standard
UMLSProcessor 2245
FastUMLSProcessor 215
[image: IMAT Solutions] <http://imatsolutions.com><http://imatsolutions.com>
Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
[email protected]<mailto:[email protected]>
On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei
<[email protected]<mailto:[email protected]>
wrote:
Bruce,
Thanks for this-- very useful.
Perhaps Sean Finan comment more-
but it's also probably worth it to compare to an adjudicated human
annotated gold standard.
--Pei
-----Original Message-----
From: Bruce Tietjen [mailto:[email protected]]
Sent: Thursday, December 18, 2014 1:45 PM
To: [email protected]<mailto:[email protected]>
Subject: cTakes Annotation Comparison
With the recent release of cTakes 3.2.1, we were very interested in
checking for any differences in annotations between using the
AggregatePlaintextUMLSProcessor pipeline and the
AggregatePlanetextFastUMLSProcessor pipeline within this release of
cTakes
with its associated set of UMLS resources.
We chose to use the SHARE 14-a-b Training data that consists of 199
documents (Discharge 61, ECG 54, Echo 42 and Radiology 42) as the
basis for the comparison.
We decided to share a summary of the results with the development
community.
Documents Processed: 199
Processing Time:
UMLSProcessor 2,439 seconds
FastUMLSProcessor 1,837 seconds
Total Annotations Reported:
UMLSProcessor 20,365 annotations
FastUMLSProcessor 8,284 annotations
Annotation Comparisons:
Annotations common to both sets: 3,940
Annotations reported only by the UMLSProcessor: 16,425
Annotations reported only by the FastUMLSProcessor: 4,344
If anyone is interested, following was our test procedure:
We used the UIMA CPE to process the document set twice, once using
the AggregatePlaintextUMLSProcessor pipeline and once using the
AggregatePlaintextFastUMLSProcessor pipeline. We used the
WriteCAStoFile CAS consumer to write the results to output files.
We used a tool we recently developed to analyze and compare the
annotations generated by the two pipelines. The tool compares the
two outputs for each file and reports any differences in the
annotations (MedicationMention, SignSymptomMention,
ProcedureMention, AnatomicalSiteMention, and
DiseaseDisorderMention) between the two output sets. The tool
reports the number of 'matches' and 'misses' between each annotation set. A
'match'
is
defined as the presence of an identified source text interval with
its associated CUI appearing in both annotation sets. A 'miss' is
defined as the presence of an identified source text interval and
its associated CUI in one annotation set, but no matching identified
source text interval
and
CUI in the other. The tool also reports the total number of
annotations (source text intervals with associated CUIs) reported in
each annotation set. The compare tool is in our GitHub repository at
https://github.com/perfectsearch/cTAKES-compare