Hi Bruce, > Correction -- So far, I did steps 1 and 2 of Sean's email. No problem. Aside from recreating the database, those two steps have the greatest impact. But before you change anything else, please do some manual spot checks. I have never seen a case where the lookup would be so horribly inaccurate.
Thanks -----Original Message----- From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 3:29 PM To: dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison Correction -- So far, I did steps 1 and 2 of Sean's email. [image: IMAT Solutions] <http://imatsolutions.com> Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 1:22 PM, Bruce Tietjen < bruce.tiet...@perfectsearchcorp.com> wrote: > > Sean, > > I tried the configuration changes you mentioned in your earlier email. > > The results are as follows: > > Total Annotations found: 12,161 (default configuration found 8,284) > > If counting exact span matches, this run only matched 211 (default > configuration matched 215). > > If counting overlapping spans, this run only matched 220 (default > configuration matched 224) > > Bruce > > > > [image: IMAT Solutions] <http://imatsolutions.com> Bruce Tietjen > Senior Software Engineer > [image: Mobile:] 801.634.1547 > bruce.tiet...@imatsolutions.com > > On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei < > pei.c...@childrens.harvard.edu> wrote: >> >> Kim, >> >> Maintenance is the factor not bugs/issue to forge ahead. >> >> They are 2 components that do the same thing with the same goal (As >> Sean mentioned, one should be able configure the new code base to >> replicate the old algorithm if required- it’s just a simpler and >> cleaner code base. If this is not the case or if there are issues, >> we should fix it and move forward.). >> >> We can keep the old component around for as long as needed, but it’s >> likely going to have limited support… >> >> --Pei >> >> >> >> *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com] >> *Sent:* Friday, December 19, 2014 1:47 PM >> *To:* Chen, Pei; dev@ctakes.apache.org >> >> *Subject:* Re: cTakes Annotation Comparison >> >> >> >> Pei, >> >> I don't think bugs/issues should be part of determining if one >> algorithm vs the other is superior. Obviously, it is worth mentioning >> the bugs, but if the fast lookup method has worse precision and >> recall but better performance, vs the slower but more accurate first >> word lookup algorithm, then time should be invested in fixing those >> bugs and resolving those weird issues. >> >> Now I'm not saying which one is superior in this case, as the data >> will end up speaking for itself one way or the other; bus as of right >> now, I'm not convinced yet that the old dictionary lookup is obsolete >> yet, and I'm not sure the community is convinced yet either. >> >> >> >> [image: IMAT Solutions] <http://imatsolutions.com> >> >> *Kim Ebert* >> Software Engineer >> [image: Office:]801.669.7342 >> kim.eb...@imatsolutions.com <greg.hub...@imatsolutions.com> >> >> On 12/19/2014 08:39 AM, Chen, Pei wrote: >> >> Also check out stats that Sean ran before releasing the new component on: >> >> >> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup >> -fast/doc/DictionaryLookupStats.docx >> >> From the evaluation and experience, the new lookup algorithm should >> be a huge improvement in terms of both speed and accuracy. >> >> This is very different than what Bruce mentioned… I’m sure Sean will >> chime here. >> >> (The old dictionary lookup is essentially obsolete now- plagued with >> bugs/issues as you mentioned.) >> >> --Pei >> >> >> >> *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com >> <kim.eb...@perfectsearchcorp.com>] >> *Sent:* Friday, December 19, 2014 10:25 AM >> *To:* dev@ctakes.apache.org >> *Subject:* Re: cTakes Annotation Comparison >> >> >> >> Guergana, >> >> I'm curious to the number of records that are in your gold standard >> sets, or if your gold standard set was run through a long running cTAKES >> process. >> I know at some point we fixed a bug in the old dictionary lookup that >> caused the permutations to become corrupted over time. Typically this >> isn't seen in the first few records, but over time as patterns are >> used the permutations would become corrupted. This caused documents >> that were fed through cTAKES more than once to have less codes >> returned than the first time. >> >> For example, if a permutation of 4,2,3,1 was found, the permutation >> would be corrupted to be 1,2,3,4. It would no longer be possible to >> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the >> fix in after the cTAKES 3.2.0 release. >> https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the >> corpus size, I could see the permutation engine eventually only have >> a single permutation of 1,2,3,4. >> >> Typically though, this isn't very easily detected in the first 100 or >> so documents. >> >> We discovered this issue when we made cTAKES have consistent output >> of codes in our system. >> >> >> >> [image: IMAT Solutions] <http://imatsolutions.com> >> >> *Kim Ebert* >> Software Engineer >> [image: Office:]801.669.7342 >> kim.eb...@imatsolutions.com <greg.hub...@imatsolutions.com> >> >> On 12/19/2014 07:05 AM, Savova, Guergana wrote: >> >> We are doing a similar kind of evaluation and will report the results. >> >> >> >> Before we released the Fast lookup, we did a systematic evaluation across >> three gold standard sets. We did not see the trend that Bruce reported >> below. The P, R and F1 results from the old dictionary look up and the fast >> one were similar. >> >> >> >> Thank you everyone! >> >> --Guergana >> >> >> >> -----Original Message----- >> >> From: David Kincaid [mailto:kincaid.d...@gmail.com >> <kincaid.d...@gmail.com>] >> >> Sent: Friday, December 19, 2014 9:02 AM >> >> To: dev@ctakes.apache.org >> >> Subject: Re: cTakes Annotation Comparison >> >> >> >> Thanks for this, Bruce! Very interesting work. It confirms what I've seen in >> my small tests that I've done in a non-systematic way. Did you happen to >> capture the number of false positives yet (annotations made by cTAKES that >> are not in the human adjudicated standard)? I've seen a lot of dictionary >> hits that are not actually entity mentions, but I haven't had a chance to do >> a systematic analysis (we're working on our annotated gold standard now). >> One great example is the antibiotic "Today". Every time the word today >> appears in any text it is annotated as a medication mention when it almost >> never is being used in that sense. >> >> >> >> These results by themselves are quite disappointing to me. Both the >> UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor >> recall. It seems like the trade off for more speed is a ten-fold (or more) >> decrease in entity recognition. >> >> >> >> Thanks again for sharing your results with us. I think they are very useful >> to the project. >> >> >> >> - Dave >> >> >> >> On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen < >> bruce.tiet...@perfectsearchcorp.com> wrote: >> >> >> >> Actually, we are working on a similar tool to compare it to the human >> >> adjudicated standard for the set we tested against. I didn't mention >> >> it before because the tool isn't complete yet, but initial results >> for >> >> the set (excluding those marked as "CUI-less") was as follows: >> >> >> >> Human adjudicated annotations: 4591 (excluding CUI-less) >> >> >> >> Annotations found matching the human adjudicated standard >> >> UMLSProcessor 2245 >> >> FastUMLSProcessor 215 >> >> >> >> >> >> >> >> >> >> >> >> >> >> [image: IMAT Solutions] <http://imatsolutions.com> >> <http://imatsolutions.com> Bruce Tietjen >> >> Senior Software Engineer >> >> [image: Mobile:] 801.634.1547 >> >> bruce.tiet...@imatsolutions.com >> >> >> >> On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei >> >> <pei.c...@childrens.harvard.edu >> >> >> >> wrote: >> >> >> >> Bruce, >> >> Thanks for this-- very useful. >> >> Perhaps Sean Finan comment more- >> >> but it's also probably worth it to compare to an adjudicated human >> >> annotated gold standard. >> >> >> >> --Pei >> >> >> >> -----Original Message----- >> >> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com >> <bruce.tiet...@perfectsearchcorp.com>] >> >> Sent: Thursday, December 18, 2014 1:45 PM >> >> To: dev@ctakes.apache.org >> >> Subject: cTakes Annotation Comparison >> >> >> >> With the recent release of cTakes 3.2.1, we were very interested in >> >> checking for any differences in annotations between using the >> >> AggregatePlaintextUMLSProcessor pipeline and the >> >> AggregatePlanetextFastUMLSProcessor pipeline within this release of >> >> cTakes >> >> with its associated set of UMLS resources. >> >> >> >> We chose to use the SHARE 14-a-b Training data that consists of 199 >> >> documents (Discharge 61, ECG 54, Echo 42 and Radiology 42) as the >> >> basis for the comparison. >> >> >> >> We decided to share a summary of the results with the development >> >> community. >> >> >> >> Documents Processed: 199 >> >> >> >> Processing Time: >> >> UMLSProcessor 2,439 seconds >> >> FastUMLSProcessor 1,837 seconds >> >> >> >> Total Annotations Reported: >> >> UMLSProcessor 20,365 annotations >> >> FastUMLSProcessor 8,284 annotations >> >> >> >> >> >> Annotation Comparisons: >> >> Annotations common to both sets: 3,940 >> >> Annotations reported only by the UMLSProcessor: 16,425 >> >> Annotations reported only by the FastUMLSProcessor: 4,344 >> >> >> >> >> >> If anyone is interested, following was our test procedure: >> >> >> >> We used the UIMA CPE to process the document set twice, once using >> >> the AggregatePlaintextUMLSProcessor pipeline and once using the >> >> AggregatePlaintextFastUMLSProcessor pipeline. We used the >> >> WriteCAStoFile CAS consumer to write the results to output files. >> >> >> >> We used a tool we recently developed to analyze and compare the >> >> annotations generated by the two pipelines. The tool compares the >> >> two outputs for each file and reports any differences in the >> >> annotations (MedicationMention, SignSymptomMention, >> >> ProcedureMention, AnatomicalSiteMention, and >> >> DiseaseDisorderMention) between the two output sets. The tool >> >> reports the number of 'matches' and 'misses' between each annotation set. A >> 'match' >> >> is >> >> defined as the presence of an identified source text interval with >> >> its associated CUI appearing in both annotation sets. A 'miss' is >> >> defined as the presence of an identified source text interval and >> >> its associated CUI in one annotation set, but no matching identified >> >> source text interval >> >> and >> >> CUI in the other. The tool also reports the total number of >> >> annotations (source text intervals with associated CUIs) reported in >> >> each annotation set. The compare tool is in our GitHub repository at >> >> https://github.com/perfectsearch/cTAKES-compare >> >> >> >> >> >> >> >> >> >