Re: cTakes Annotation Comparison

2014-12-19 Thread John Green
Wow, great work. Thank you for sharing. 


John Green
—
Sent from Mailbox

On Thu, Dec 18, 2014 at 6:08 PM, Bruce Tietjen
 wrote:

> Actually, we are working on a similar tool to compare it to the human
> adjudicated standard for the set we tested against.  I didn't mention it
> before because the tool isn't complete yet, but initial results for the set
> (excluding those marked as "CUI-less") was as follows:
> Human adjudicated annotations: 4591 (excluding CUI-less)
> Annotations found matching the human adjudicated standard
> UMLSProcessor  2245
> FastUMLSProcessor   215
>  [image: IMAT Solutions] 
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
> On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei 
> wrote:
>>
>> Bruce,
>> Thanks for this-- very useful.
>> Perhaps Sean Finan comment more-
>> but it's also probably worth it to compare to an adjudicated human
>> annotated gold standard.
>>
>> --Pei
>>
>> -Original Message-
>> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>> Sent: Thursday, December 18, 2014 1:45 PM
>> To: dev@ctakes.apache.org
>> Subject: cTakes Annotation Comparison
>>
>> With the recent release of cTakes 3.2.1, we were very interested in
>> checking for any differences in annotations between using the
>> AggregatePlaintextUMLSProcessor pipeline and the
>> AggregatePlanetextFastUMLSProcessor pipeline within this release of cTakes
>> with its associated set of UMLS resources.
>>
>> We chose to use the SHARE 14-a-b Training data that consists of 199
>> documents (Discharge  61, ECG 54, Echo 42 and Radiology 42) as the basis
>> for the comparison.
>>
>> We decided to share a summary of the results with the development
>> community.
>>
>> Documents Processed: 199
>>
>> Processing Time:
>> UMLSProcessor   2,439 seconds
>> FastUMLSProcessor1,837 seconds
>>
>> Total Annotations Reported:
>> UMLSProcessor  20,365 annotations
>> FastUMLSProcessor 8,284 annotations
>>
>>
>> Annotation Comparisons:
>> Annotations common to both sets:  3,940
>> Annotations reported only by the UMLSProcessor: 16,425
>> Annotations reported only by the FastUMLSProcessor:4,344
>>
>>
>> If anyone is interested, following was our test procedure:
>>
>> We used the UIMA CPE to process the document set twice, once using the
>> AggregatePlaintextUMLSProcessor pipeline and once using the
>> AggregatePlaintextFastUMLSProcessor pipeline. We used the WriteCAStoFile
>> CAS consumer to write the results to output files.
>>
>> We used a tool we recently developed to analyze and compare the
>> annotations generated by the two pipelines. The tool compares the two
>> outputs for each file and reports any differences in the annotations
>> (MedicationMention, SignSymptomMention, ProcedureMention,
>> AnatomicalSiteMention, and
>> DiseaseDisorderMention) between the two output sets. The tool reports the
>> number of 'matches' and 'misses' between each annotation set. A 'match' is
>> defined as the presence of an identified source text interval with its
>> associated CUI appearing in both annotation sets. A 'miss' is defined as
>> the presence of an identified source text interval and its associated CUI
>> in one annotation set, but no matching identified source text interval and
>> CUI in the other. The tool also reports the total number of annotations
>> (source text intervals with associated CUIs) reported in each annotation
>> set. The compare tool is in our GitHub repository at
>> https://github.com/perfectsearch/cTAKES-compare
>>

Re: cTakes Annotation Comparison

2014-12-19 Thread David Kincaid
Thanks for this, Bruce! Very interesting work. It confirms what I've seen
in my small tests that I've done in a non-systematic way. Did you happen to
capture the number of false positives yet (annotations made by cTAKES that
are not in the human adjudicated standard)? I've seen a lot of dictionary
hits that are not actually entity mentions, but I haven't had a chance to
do a systematic analysis (we're working on our annotated gold standard
now). One great example is the antibiotic "Today". Every time the word
today appears in any text it is annotated as a medication mention when it
almost never is being used in that sense.

These results by themselves are quite disappointing to me. Both the
UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor
recall. It seems like the trade off for more speed is a ten-fold (or more)
decrease in entity recognition.

Thanks again for sharing your results with us. I think they are very useful
to the project.

- Dave

On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen <
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> Actually, we are working on a similar tool to compare it to the human
> adjudicated standard for the set we tested against.  I didn't mention it
> before because the tool isn't complete yet, but initial results for the set
> (excluding those marked as "CUI-less") was as follows:
>
> Human adjudicated annotations: 4591 (excluding CUI-less)
>
> Annotations found matching the human adjudicated standard
> UMLSProcessor  2245
> FastUMLSProcessor   215
>
>
>
>
>
>
>  [image: IMAT Solutions] 
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei  >
> wrote:
> >
> > Bruce,
> > Thanks for this-- very useful.
> > Perhaps Sean Finan comment more-
> > but it's also probably worth it to compare to an adjudicated human
> > annotated gold standard.
> >
> > --Pei
> >
> > -Original Message-
> > From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> > Sent: Thursday, December 18, 2014 1:45 PM
> > To: dev@ctakes.apache.org
> > Subject: cTakes Annotation Comparison
> >
> > With the recent release of cTakes 3.2.1, we were very interested in
> > checking for any differences in annotations between using the
> > AggregatePlaintextUMLSProcessor pipeline and the
> > AggregatePlanetextFastUMLSProcessor pipeline within this release of
> cTakes
> > with its associated set of UMLS resources.
> >
> > We chose to use the SHARE 14-a-b Training data that consists of 199
> > documents (Discharge  61, ECG 54, Echo 42 and Radiology 42) as the basis
> > for the comparison.
> >
> > We decided to share a summary of the results with the development
> > community.
> >
> > Documents Processed: 199
> >
> > Processing Time:
> > UMLSProcessor   2,439 seconds
> > FastUMLSProcessor1,837 seconds
> >
> > Total Annotations Reported:
> > UMLSProcessor  20,365 annotations
> > FastUMLSProcessor 8,284 annotations
> >
> >
> > Annotation Comparisons:
> > Annotations common to both sets:  3,940
> > Annotations reported only by the UMLSProcessor: 16,425
> > Annotations reported only by the FastUMLSProcessor:4,344
> >
> >
> > If anyone is interested, following was our test procedure:
> >
> > We used the UIMA CPE to process the document set twice, once using the
> > AggregatePlaintextUMLSProcessor pipeline and once using the
> > AggregatePlaintextFastUMLSProcessor pipeline. We used the WriteCAStoFile
> > CAS consumer to write the results to output files.
> >
> > We used a tool we recently developed to analyze and compare the
> > annotations generated by the two pipelines. The tool compares the two
> > outputs for each file and reports any differences in the annotations
> > (MedicationMention, SignSymptomMention, ProcedureMention,
> > AnatomicalSiteMention, and
> > DiseaseDisorderMention) between the two output sets. The tool reports the
> > number of 'matches' and 'misses' between each annotation set. A 'match'
> is
> > defined as the presence of an identified source text interval with its
> > associated CUI appearing in both annotation sets. A 'miss' is defined as
> > the presence of an identified source text interval and its associated CUI
> > in one annotation set, but no matching identified source text interval
> and
> > CUI in the other. The tool also reports the total number of annotations
> > (source text intervals with associated CUIs) reported in each annotation
> > set. The compare tool is in our GitHub repository at
> > https://github.com/perfectsearch/cTAKES-compare
> >
>


RE: cTakes Annotation Comparison

2014-12-19 Thread Savova, Guergana
We are doing a similar kind of evaluation and will report the results.

Before we released the Fast lookup, we did a systematic evaluation across three 
gold standard sets. We did not see the trend that Bruce reported below. The P, 
R and F1 results from the old dictionary look up and the fast one were similar.

Thank you everyone!
--Guergana

-Original Message-
From: David Kincaid [mailto:kincaid.d...@gmail.com] 
Sent: Friday, December 19, 2014 9:02 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Thanks for this, Bruce! Very interesting work. It confirms what I've seen in my 
small tests that I've done in a non-systematic way. Did you happen to capture 
the number of false positives yet (annotations made by cTAKES that are not in 
the human adjudicated standard)? I've seen a lot of dictionary hits that are 
not actually entity mentions, but I haven't had a chance to do a systematic 
analysis (we're working on our annotated gold standard now). One great example 
is the antibiotic "Today". Every time the word today appears in any text it is 
annotated as a medication mention when it almost never is being used in that 
sense.

These results by themselves are quite disappointing to me. Both the 
UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor 
recall. It seems like the trade off for more speed is a ten-fold (or more) 
decrease in entity recognition.

Thanks again for sharing your results with us. I think they are very useful to 
the project.

- Dave

On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen < 
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> Actually, we are working on a similar tool to compare it to the human 
> adjudicated standard for the set we tested against.  I didn't mention 
> it before because the tool isn't complete yet, but initial results for 
> the set (excluding those marked as "CUI-less") was as follows:
>
> Human adjudicated annotations: 4591 (excluding CUI-less)
>
> Annotations found matching the human adjudicated standard
> UMLSProcessor  2245
> FastUMLSProcessor   215
>
>
>
>
>
>
>  [image: IMAT Solutions]   Bruce Tietjen 
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei 
>  >
> wrote:
> >
> > Bruce,
> > Thanks for this-- very useful.
> > Perhaps Sean Finan comment more-
> > but it's also probably worth it to compare to an adjudicated human 
> > annotated gold standard.
> >
> > --Pei
> >
> > -Original Message-
> > From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> > Sent: Thursday, December 18, 2014 1:45 PM
> > To: dev@ctakes.apache.org
> > Subject: cTakes Annotation Comparison
> >
> > With the recent release of cTakes 3.2.1, we were very interested in 
> > checking for any differences in annotations between using the 
> > AggregatePlaintextUMLSProcessor pipeline and the 
> > AggregatePlanetextFastUMLSProcessor pipeline within this release of
> cTakes
> > with its associated set of UMLS resources.
> >
> > We chose to use the SHARE 14-a-b Training data that consists of 199 
> > documents (Discharge  61, ECG 54, Echo 42 and Radiology 42) as the 
> > basis for the comparison.
> >
> > We decided to share a summary of the results with the development 
> > community.
> >
> > Documents Processed: 199
> >
> > Processing Time:
> > UMLSProcessor   2,439 seconds
> > FastUMLSProcessor1,837 seconds
> >
> > Total Annotations Reported:
> > UMLSProcessor  20,365 annotations
> > FastUMLSProcessor 8,284 annotations
> >
> >
> > Annotation Comparisons:
> > Annotations common to both sets:  3,940
> > Annotations reported only by the UMLSProcessor: 16,425
> > Annotations reported only by the FastUMLSProcessor:4,344
> >
> >
> > If anyone is interested, following was our test procedure:
> >
> > We used the UIMA CPE to process the document set twice, once using 
> > the AggregatePlaintextUMLSProcessor pipeline and once using the 
> > AggregatePlaintextFastUMLSProcessor pipeline. We used the 
> > WriteCAStoFile CAS consumer to write the results to output files.
> >
> > We used a tool we recently developed to analyze and compare the 
> > annotations generated by the two pipelines. The tool compares the 
> > two outputs for each file and reports any differences in the 
> > annotations (MedicationMention, SignSymptomMention, 
> > ProcedureMention, AnatomicalSiteMention, and
> > DiseaseDisorderMention) between the two output sets. The tool 
> > reports the number of 'matches' and 'misses' between each annotation set. A 
> > 'match'
> is
> > defined as the presence of an identified source text interval with 
> > its associated CUI appearing in both annotation sets. A 'miss' is 
> > defined as the presence of an identified source text interval and 
> > its associated CUI in one annotation set, but no matching iden

Re: cTakes Annotation Comparison

2014-12-19 Thread David Kincaid
Thanks, Guergana. I'll share our results as well once we're done as well.

On Fri, Dec 19, 2014 at 8:05 AM, Savova, Guergana <
guergana.sav...@childrens.harvard.edu> wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
> Before we released the Fast lookup, we did a systematic evaluation across
> three gold standard sets. We did not see the trend that Bruce reported
> below. The P, R and F1 results from the old dictionary look up and the fast
> one were similar.
>
> Thank you everyone!
> --Guergana
>
> -Original Message-
> From: David Kincaid [mailto:kincaid.d...@gmail.com]
> Sent: Friday, December 19, 2014 9:02 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes Annotation Comparison
>
> Thanks for this, Bruce! Very interesting work. It confirms what I've seen
> in my small tests that I've done in a non-systematic way. Did you happen to
> capture the number of false positives yet (annotations made by cTAKES that
> are not in the human adjudicated standard)? I've seen a lot of dictionary
> hits that are not actually entity mentions, but I haven't had a chance to
> do a systematic analysis (we're working on our annotated gold standard
> now). One great example is the antibiotic "Today". Every time the word
> today appears in any text it is annotated as a medication mention when it
> almost never is being used in that sense.
>
> These results by themselves are quite disappointing to me. Both the
> UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor
> recall. It seems like the trade off for more speed is a ten-fold (or more)
> decrease in entity recognition.
>
> Thanks again for sharing your results with us. I think they are very
> useful to the project.
>
> - Dave
>
> On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen <
> bruce.tiet...@perfectsearchcorp.com> wrote:
> >
> > Actually, we are working on a similar tool to compare it to the human
> > adjudicated standard for the set we tested against.  I didn't mention
> > it before because the tool isn't complete yet, but initial results for
> > the set (excluding those marked as "CUI-less") was as follows:
> >
> > Human adjudicated annotations: 4591 (excluding CUI-less)
> >
> > Annotations found matching the human adjudicated standard
> > UMLSProcessor  2245
> > FastUMLSProcessor   215
> >
> >
> >
> >
> >
> >
> >  [image: IMAT Solutions]   Bruce Tietjen
> > Senior Software Engineer
> > [image: Mobile:] 801.634.1547
> > bruce.tiet...@imatsolutions.com
> >
> > On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei
> >  > >
> > wrote:
> > >
> > > Bruce,
> > > Thanks for this-- very useful.
> > > Perhaps Sean Finan comment more-
> > > but it's also probably worth it to compare to an adjudicated human
> > > annotated gold standard.
> > >
> > > --Pei
> > >
> > > -Original Message-
> > > From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> > > Sent: Thursday, December 18, 2014 1:45 PM
> > > To: dev@ctakes.apache.org
> > > Subject: cTakes Annotation Comparison
> > >
> > > With the recent release of cTakes 3.2.1, we were very interested in
> > > checking for any differences in annotations between using the
> > > AggregatePlaintextUMLSProcessor pipeline and the
> > > AggregatePlanetextFastUMLSProcessor pipeline within this release of
> > cTakes
> > > with its associated set of UMLS resources.
> > >
> > > We chose to use the SHARE 14-a-b Training data that consists of 199
> > > documents (Discharge  61, ECG 54, Echo 42 and Radiology 42) as the
> > > basis for the comparison.
> > >
> > > We decided to share a summary of the results with the development
> > > community.
> > >
> > > Documents Processed: 199
> > >
> > > Processing Time:
> > > UMLSProcessor   2,439 seconds
> > > FastUMLSProcessor1,837 seconds
> > >
> > > Total Annotations Reported:
> > > UMLSProcessor  20,365 annotations
> > > FastUMLSProcessor 8,284 annotations
> > >
> > >
> > > Annotation Comparisons:
> > > Annotations common to both sets:  3,940
> > > Annotations reported only by the UMLSProcessor: 16,425
> > > Annotations reported only by the FastUMLSProcessor:4,344
> > >
> > >
> > > If anyone is interested, following was our test procedure:
> > >
> > > We used the UIMA CPE to process the document set twice, once using
> > > the AggregatePlaintextUMLSProcessor pipeline and once using the
> > > AggregatePlaintextFastUMLSProcessor pipeline. We used the
> > > WriteCAStoFile CAS consumer to write the results to output files.
> > >
> > > We used a tool we recently developed to analyze and compare the
> > > annotations generated by the two pipelines. The tool compares the
> > > two outputs for each file and reports any differences in the
> > > annotations (MedicationMention, SignSymptomMention,
> > > ProcedureMention, AnatomicalSiteMention, and
> > > DiseaseDisorderMention) between the two output sets. The tool
>

RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link

2014-12-19 Thread John Green
Great article. Im not a fan of the email solution, simply because of size 
problems. Given how small the rate of new video uploads is likely to be, it 
seems a common drop box solution may be the best solution for our case. Maybe 
someone very central to the project could volunteer as this point of contact 
and play relay between dropbox and youtube? 


JG
—
Sent from Mailbox

On Wed, Dec 17, 2014 at 3:11 PM, Finan, Sean
 wrote:

> Hmmm, well this is a ticker:
> http://www.ampercent.com/upload-videos-youtube-channel-without-knowing-username-password/9374/
> -Original Message-
> From: John Green [mailto:john.travis.gr...@gmail.com] 
> Sent: Wednesday, December 17, 2014 2:08 PM
> To: dev@ctakes.apache.org
> Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes Channel 
> Direct Link
> Isnt this to upload for my account? What about to the channel?
> On Tue, Dec 16, 2014 at 12:16 PM, Finan, Sean < 
> sean.fi...@childrens.harvard.edu> wrote:
>>
>> Hi John,
>>
>> Look for an "Upload" button in the upper-left corner next to a blue 
>> "Sign in" button.
>>
>> Sean
>>
>> -Original Message-
>> From: John Green [mailto:john.travis.gr...@gmail.com]
>> Sent: Tuesday, December 16, 2014 11:12 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: intro video and ctakes youtube : Youtube Apache cTakes 
>> Channel Direct Link
>>
>> That is, how do we upload videos *to the channel. *
>>
>> On Tue, Dec 16, 2014 at 11:09 AM, John Green 
>> 
>> wrote:
>> >
>> > How do we upload videos we wish to contribute? I dont have any 
>> > experience with youtube other than as a watcher.
>> >
>> > JG
>> >
>> > On Mon, Dec 15, 2014 at 11:43 AM, Finan, Sean < 
>> > sean.fi...@childrens.harvard.edu> wrote:
>> >>
>> >> Hmmm, I can't find it in a search.  However, here is a direct link:
>> >>
>> >> https://www.youtube.com/channel/UC8hQoOKz3v4PNEf6cqSkjbQ
>> >>
>> >> Maybe it needs a few videos to register in the search engine ?
>> >>
>> >> Sean
>> >>
>> >> -Original Message-
>> >> From: Pei Chen [mailto:chen...@apache.org]
>> >> Sent: Monday, December 15, 2014 11:32 AM
>> >> To: dev@ctakes.apache.org
>> >> Subject: Re: intro video and ctakes youtube
>> >>
>> >> John,
>> >> I presume you this thread:
>> >>
>> >> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201408.mbox/%3C
>> >> 39 3252f14c42f946952f1ed75d316cad39158...@chexmbx4a.chboston.org%3E
>> >>
>> >> Strange, I couldn't find it anymore either... The place holder 
>> >> could have been auto deleted because it was empty?  I think it's 
>> >> worth it if you're willing to create and add to it again...
>> >>
>> >> ---Pei
>> >>
>> >> On Fri, Dec 12, 2014 at 11:46 PM, John Green 
>> >> > >> >
>> >> wrote:
>> >> >
>> >> > I was going to post some basic how to videos that help with the 
>> >> > learning curve I've walked over the last year and a half. I went 
>> >> > looking for ctakes youtube channel mentioned awhile back and I 
>> >> > did not
>> >> find it...
>> >> >
>> >> > Anyone know where it went?
>> >> >
>> >> > Best,
>> >> > JG
>> >> >
>> >>
>> >
>>

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Guergana,

I'm curious to the number of records that are in your gold standard
sets, or if your gold standard set was run through a long running cTAKES
process. I know at some point we fixed a bug in the old dictionary
lookup that caused the permutations to become corrupted over time.
Typically this isn't seen in the first few records, but over time as
patterns are used the permutations would become corrupted. This caused
documents that were fed through cTAKES more than once to have less codes
returned than the first time.

For example, if a permutation of 4,2,3,1 was found, the permutation
would be corrupted to be 1,2,3,4. It would no longer be possible to
detect permutations of 4,2,3,1 until cTAKES was restarted. We got the
fix in after the cTAKES 3.2.0 release.
https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the
corpus size, I could see the permutation engine eventually only have a
single permutation of 1,2,3,4.

Typically though, this isn't very easily detected in the first 100 or so
documents.

We discovered this issue when we made cTAKES have consistent output of
codes in our system.

IMAT Solutions 
Kim Ebert
Software Engineer
Office: 801.669.7342
kim.eb...@imatsolutions.com 
On 12/19/2014 07:05 AM, Savova, Guergana wrote:
> We are doing a similar kind of evaluation and will report the results.
>
> Before we released the Fast lookup, we did a systematic evaluation across 
> three gold standard sets. We did not see the trend that Bruce reported below. 
> The P, R and F1 results from the old dictionary look up and the fast one were 
> similar.
>
> Thank you everyone!
> --Guergana
>
> -Original Message-
> From: David Kincaid [mailto:kincaid.d...@gmail.com] 
> Sent: Friday, December 19, 2014 9:02 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes Annotation Comparison
>
> Thanks for this, Bruce! Very interesting work. It confirms what I've seen in 
> my small tests that I've done in a non-systematic way. Did you happen to 
> capture the number of false positives yet (annotations made by cTAKES that 
> are not in the human adjudicated standard)? I've seen a lot of dictionary 
> hits that are not actually entity mentions, but I haven't had a chance to do 
> a systematic analysis (we're working on our annotated gold standard now). One 
> great example is the antibiotic "Today". Every time the word today appears in 
> any text it is annotated as a medication mention when it almost never is 
> being used in that sense.
>
> These results by themselves are quite disappointing to me. Both the 
> UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor 
> recall. It seems like the trade off for more speed is a ten-fold (or more) 
> decrease in entity recognition.
>
> Thanks again for sharing your results with us. I think they are very useful 
> to the project.
>
> - Dave
>
> On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen < 
> bruce.tiet...@perfectsearchcorp.com> wrote:
>> Actually, we are working on a similar tool to compare it to the human 
>> adjudicated standard for the set we tested against.  I didn't mention 
>> it before because the tool isn't complete yet, but initial results for 
>> the set (excluding those marked as "CUI-less") was as follows:
>>
>> Human adjudicated annotations: 4591 (excluding CUI-less)
>>
>> Annotations found matching the human adjudicated standard
>> UMLSProcessor  2245
>> FastUMLSProcessor   215
>>
>>
>>
>>
>>
>>
>>  [image: IMAT Solutions]   Bruce Tietjen 
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
>> bruce.tiet...@imatsolutions.com
>>
>> On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei 
>> > wrote:
>>> Bruce,
>>> Thanks for this-- very useful.
>>> Perhaps Sean Finan comment more-
>>> but it's also probably worth it to compare to an adjudicated human 
>>> annotated gold standard.
>>>
>>> --Pei
>>>
>>> -Original Message-
>>> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>>> Sent: Thursday, December 18, 2014 1:45 PM
>>> To: dev@ctakes.apache.org
>>> Subject: cTakes Annotation Comparison
>>>
>>> With the recent release of cTakes 3.2.1, we were very interested in 
>>> checking for any differences in annotations between using the 
>>> AggregatePlaintextUMLSProcessor pipeline and the 
>>> AggregatePlanetextFastUMLSProcessor pipeline within this release of
>> cTakes
>>> with its associated set of UMLS resources.
>>>
>>> We chose to use the SHARE 14-a-b Training data that consists of 199 
>>> documents (Discharge  61, ECG 54, Echo 42 and Radiology 42) as the 
>>> basis for the comparison.
>>>
>>> We decided to share a summary of the results with the development 
>>> community.
>>>
>>> Documents Processed: 199
>>>
>>> Processing Time:
>>> UMLSProcessor   2,439 seconds
>>> FastUMLSProcessor1,837 seconds
>>>
>>> Total Annotations Reported:
>>> UMLSProcessor  20,365 anno

RE: cTakes Annotation Comparison

2014-12-19 Thread Chen, Pei
Also check out stats that Sean ran before releasing the new component on:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
From the evaluation and experience, the new lookup algorithm should be a huge 
improvement in terms of both speed and accuracy.
This is very different than what Bruce mentioned…  I’m sure Sean will chime 
here.
(The old dictionary lookup is essentially obsolete now- plagued with 
bugs/issues as you mentioned.)
--Pei

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 10:25 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Guergana,

I'm curious to the number of records that are in your gold standard sets, or if 
your gold standard set was run through a long running cTAKES process. I know at 
some point we fixed a bug in the old dictionary lookup that caused the 
permutations to become corrupted over time. Typically this isn't seen in the 
first few records, but over time as patterns are used the permutations would 
become corrupted. This caused documents that were fed through cTAKES more than 
once to have less codes returned than the first time.

For example, if a permutation of 4,2,3,1 was found, the permutation would be 
corrupted to be 1,2,3,4. It would no longer be possible to detect permutations 
of 4,2,3,1 until cTAKES was restarted. We got the fix in after the cTAKES 3.2.0 
release. https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the 
corpus size, I could see the permutation engine eventually only have a single 
permutation of 1,2,3,4.

Typically though, this isn't very easily detected in the first 100 or so 
documents.

We discovered this issue when we made cTAKES have consistent output of codes in 
our system.

[IMAT Solutions]
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.com
On 12/19/2014 07:05 AM, Savova, Guergana wrote:

We are doing a similar kind of evaluation and will report the results.



Before we released the Fast lookup, we did a systematic evaluation across three 
gold standard sets. We did not see the trend that Bruce reported below. The P, 
R and F1 results from the old dictionary look up and the fast one were similar.



Thank you everyone!

--Guergana



-Original Message-

From: David Kincaid [mailto:kincaid.d...@gmail.com]

Sent: Friday, December 19, 2014 9:02 AM

To: dev@ctakes.apache.org

Subject: Re: cTakes Annotation Comparison



Thanks for this, Bruce! Very interesting work. It confirms what I've seen in my 
small tests that I've done in a non-systematic way. Did you happen to capture 
the number of false positives yet (annotations made by cTAKES that are not in 
the human adjudicated standard)? I've seen a lot of dictionary hits that are 
not actually entity mentions, but I haven't had a chance to do a systematic 
analysis (we're working on our annotated gold standard now). One great example 
is the antibiotic "Today". Every time the word today appears in any text it is 
annotated as a medication mention when it almost never is being used in that 
sense.



These results by themselves are quite disappointing to me. Both the 
UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor 
recall. It seems like the trade off for more speed is a ten-fold (or more) 
decrease in entity recognition.



Thanks again for sharing your results with us. I think they are very useful to 
the project.



- Dave



On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen < 
bruce.tiet...@perfectsearchcorp.com>
 wrote:



Actually, we are working on a similar tool to compare it to the human

adjudicated standard for the set we tested against.  I didn't mention

it before because the tool isn't complete yet, but initial results for

the set (excluding those marked as "CUI-less") was as follows:



Human adjudicated annotations: 4591 (excluding CUI-less)



Annotations found matching the human adjudicated standard

UMLSProcessor  2245

FastUMLSProcessor   215













 [image: IMAT Solutions]   
Bruce Tietjen

Senior Software Engineer

[image: Mobile:] 801.634.1547

bruce.tiet...@imatsolutions.com



On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei

mailto:pei.c...@childrens.harvard.edu>



wrote:



Bruce,

Thanks for this-- very useful.

Perhaps Sean Finan comment more-

but it's also probably worth it to compare to an adjudicated human

annotated gold standard.



--Pei



-Original Message-

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]

Sent: Thursday, December 18, 2014 1:45 PM

To: dev@ctakes.apache.org

Subject: cTakes Annotation Comparison



With the recent release of cTakes 3.2.1, we we

Re: cTakes Annotation Comparison

2014-12-19 Thread Miller, Timothy
Thanks Kim,
This sounds interesting though I don't totally understand it. Are you saying 
that extraction performance for a given note depends on which order the note 
was in the processing queue? If so that's pretty bad! If you (or anyone else 
who understands this issue) has a concrete example I think that might help me 
understand what the problem is/was.

Even though, as Pei mentioned, we are going to try moving the community to the 
faster dictionary, I would like to understand better just to help myself avoid 
issues of this type going forward (and verify the new dictionary doesn't use 
similar logic).

Also, when we finish annotating the sample notes, might we use that as a point 
of comparison for the two dictionaries? That would get around the issue that 
not everyone has access to the datasets we used for validation and others are 
likely not able to share theirs either. And maybe we can replicate the notes if 
we want to simulate the scenario Kim is talking about with thousands or more 
notes.

Tim


On 12/19/2014 10:24 AM, Kim Ebert wrote:
Guergana,

I'm curious to the number of records that are in your gold standard sets, or if 
your gold standard set was run through a long running cTAKES process. I know at 
some point we fixed a bug in the old dictionary lookup that caused the 
permutations to become corrupted over time. Typically this isn't seen in the 
first few records, but over time as patterns are used the permutations would 
become corrupted. This caused documents that were fed through cTAKES more than 
once to have less codes returned than the first time.

For example, if a permutation of 4,2,3,1 was found, the permutation would be 
corrupted to be 1,2,3,4. It would no longer be possible to detect permutations 
of 4,2,3,1 until cTAKES was restarted. We got the fix in after the cTAKES 3.2.0 
release. https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the 
corpus size, I could see the permutation engine eventually only have a single 
permutation of 1,2,3,4.

Typically though, this isn't very easily detected in the first 100 or so 
documents.

We discovered this issue when we made cTAKES have consistent output of codes in 
our system.

[IMAT Solutions]
Kim Ebert
Software Engineer
[Office:] 801.669.7342
kim.eb...@imatsolutions.com
On 12/19/2014 07:05 AM, Savova, Guergana wrote:

We are doing a similar kind of evaluation and will report the results.

Before we released the Fast lookup, we did a systematic evaluation across three 
gold standard sets. We did not see the trend that Bruce reported below. The P, 
R and F1 results from the old dictionary look up and the fast one were similar.

Thank you everyone!
--Guergana

-Original Message-
From: David Kincaid [mailto:kincaid.d...@gmail.com]
Sent: Friday, December 19, 2014 9:02 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Thanks for this, Bruce! Very interesting work. It confirms what I've seen in my 
small tests that I've done in a non-systematic way. Did you happen to capture 
the number of false positives yet (annotations made by cTAKES that are not in 
the human adjudicated standard)? I've seen a lot of dictionary hits that are 
not actually entity mentions, but I haven't had a chance to do a systematic 
analysis (we're working on our annotated gold standard now). One great example 
is the antibiotic "Today". Every time the word today appears in any text it is 
annotated as a medication mention when it almost never is being used in that 
sense.

These results by themselves are quite disappointing to me. Both the 
UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor 
recall. It seems like the trade off for more speed is a ten-fold (or more) 
decrease in entity recognition.

Thanks again for sharing your results with us. I think they are very useful to 
the project.

- Dave

On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen < 
bruce.tiet...@perfectsearchcorp.com>
 wrote:


Actually, we are working on a similar tool to compare it to the human
adjudicated standard for the set we tested against.  I didn't mention
it before because the tool isn't complete yet, but initial results for
the set (excluding those marked as "CUI-less") was as follows:

Human adjudicated annotations: 4591 (excluding CUI-less)

Annotations found matching the human adjudicated standard
UMLSProcessor  2245
FastUMLSProcessor   215






 [image: IMAT Solutions]   
Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei
mailto:pei.c...@childrens.harvard.edu>


wrote:


Bruce,
Thanks for this-- very useful.
Perhaps Sean Finan comment more-
but it's also probably worth it 

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Well, I guess that it is time for me to speak up …

I must say that I’m happy that people are showing interest in the fast lookup.  
I am also happy (sort of) that some concerns are being raised – and that there 
is now community participation in my little toy.  I  have some concerns about 
what people are reporting.  This does not coincide with what I have seen at 
all.  Yesterday I started (without knowing this thread existed) testing a 
bare-minimum pipeline for CUI extraction.  It is just the stripped-down 
Aggregate with only: segment, tokens, sentences, POS, and the fast lookup.  The 
people at Children’s wanted to know how fast we could get.  1,196 notes in 
under 90 seconds on my laptop with over 210,000 annotations, which is 175/note. 
 After reading the thread I decided to run the fast lookup with several 
configurations.  I also ran the default for 10.5 hours.  I am comparing the 
annotations from each system against the human annotations that we have, and I 
will let everybody know what I find – for better or worse.

The fast lookup does not (out-of-box) do the exact same thing as the default.  
Some things can be configured to make it more closely approximate the default 
dictionary.

1.Set the minimum annotation span length to 2 (default is 3).  This is 
in desc/[ae]/UmlsLookupAnnotator.xml : line #78.  The annotator should then 
pick up text like “CT” and improve recall, but it will hurt precision.

2.   Set the Lookup Window to LookupWindowAnnotation.  This is in 
desc/[ae]/UmlsLookupAnnotator.xml: lines #65 & #93.   The LookupWindowAnnotator 
will need to be added to the aggregate pipeline 
AggregatePlaintextFastUMLSProcesor.xml  lines #50 & #172.  This will narrow the 
lookup window and may increase precision, but (in my experience) reduces recall.

3.   Allow the –rough- identification of Overlapping spans.  The default 
dictionary will often identify text like “metastatic colorectal carcinoma” when 
that text actually does not exist anywhere in umls.  It basically ignores 
“colorectal” and gives the whole span the CUI for “metastatic carcinoma”.  In 
this case it is arguably a good thing.  In many others it is arguably not so 
much.  There is a Class ... lookup2.ae.OverlapJCasTermAnnotator.java that will 
do the same thing.  You can create a new desc/[ae]/*Annotator.xml or just 
change the  in desc/[ae]/UmlsLookupAnnotator.xml 
line #25.  I will check in a new desc xml (sorry; thought I had) because there 
are 2 parameters unique to OverlapJCasTermAnnotator

4.   You can play with the OverlapJCasTermAnnotator parameters 
“consecutiveSkips” and “totalTokenSkips”.  These control just how lenient you 
want the overlap tagging to be.

5.   Create a new dictionary database.  There is a (bit messy) 
DictionaryTool in sandbox that will let you dump whatever you do or do not want 
from UMLS into a database.  It will also help you clean up or –select- stored 
entries as well.  There is a lot of garbage in the default dictionary database: 
repeated terms with caps/no caps (“Cancer”,”cancer”), text with metadata 
(“cancer [finding]”) and text that just clutters (“PhenX: entry for cancer”, 
“1”, “2”).  The fast lookup database should have most of the Snomed and RxNorm 
terms (and synonyms) of interest, but you could always make a new database that 
is much more inclusive.

The main key to the speed of the fast dictionary lookup is actually … the key.  
It is the way that the database is indexed and the lookup by “rare” word 
instead of “first” word.  Everything else can be changed around it and it 
should still be a faster version.

As for the false positives like “Today”, that will always be a problem until we 
have disambiguation.  The lookup is basically a glorified grep.

Sean

From: Chen, Pei [mailto:pei.c...@childrens.harvard.edu]
Sent: Friday, December 19, 2014 10:43 AM
To: dev@ctakes.apache.org; kim.eb...@imatsolutions.com
Subject: RE: cTakes Annotation Comparison

Also check out stats that Sean ran before releasing the new component on:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
From the evaluation and experience, the new lookup algorithm should be a huge 
improvement in terms of both speed and accuracy.
This is very different than what Bruce mentioned…  I’m sure Sean will chime 
here.
(The old dictionary lookup is essentially obsolete now- plagued with 
bugs/issues as you mentioned.)
--Pei

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 10:25 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Guergana,

I'm curious to the number of records that are in your gold standard sets, or if 
your gold standard set was run through a long running cTAKES process. I know at 
some point we fixed a bug in the old dictionary lookup that caused the 
permutations to become corrupted over time. Typically this isn't seen in the 
fi

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
Our analysis against the human adjudicated gold standard from this SHARE
corpus is using a simple check to see if the cTakes output included the
annotation specified by the gold standard. The initial results I reported
were for exact matches of CUI and text span.  Only exact matches were
counted.

It looks like if we also count as matches cTakes annotations with a
matching CUI and a text span that overlaps the gold standard text span then
the matches increase to 224 matching annotations for the FastUMLS pipeline
and 2319 for the the old pipeline.

The question was also asked about annotations in the cTakes output that
were not in the human adjudicated gold standard. The answer is yes, there
were a lot of additional annotations made by cTakes that don't appear to be
in the gold standard. We haven't analyzed that yet, but it looks like the
gold standard we are using may only have Disease_Disorder annotations.



 [image: IMAT Solutions] 
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:
>
> Thanks Kim,
> This sounds interesting though I don't totally understand it. Are you
> saying that extraction performance for a given note depends on which order
> the note was in the processing queue? If so that's pretty bad! If you (or
> anyone else who understands this issue) has a concrete example I think that
> might help me understand what the problem is/was.
>
> Even though, as Pei mentioned, we are going to try moving the community to
> the faster dictionary, I would like to understand better just to help
> myself avoid issues of this type going forward (and verify the new
> dictionary doesn't use similar logic).
>
> Also, when we finish annotating the sample notes, might we use that as a
> point of comparison for the two dictionaries? That would get around the
> issue that not everyone has access to the datasets we used for validation
> and others are likely not able to share theirs either. And maybe we can
> replicate the notes if we want to simulate the scenario Kim is talking
> about with thousands or more notes.
>
> Tim
>
>
> On 12/19/2014 10:24 AM, Kim Ebert wrote:
> Guergana,
>
> I'm curious to the number of records that are in your gold standard sets,
> or if your gold standard set was run through a long running cTAKES process.
> I know at some point we fixed a bug in the old dictionary lookup that
> caused the permutations to become corrupted over time. Typically this isn't
> seen in the first few records, but over time as patterns are used the
> permutations would become corrupted. This caused documents that were fed
> through cTAKES more than once to have less codes returned than the first
> time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation would
> be corrupted to be 1,2,3,4. It would no longer be possible to detect
> permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after
> the cTAKES 3.2.0 release. https://issues.apache.org/jira/browse/CTAKES-310
> Depending upon the corpus size, I could see the permutation engine
> eventually only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or so
> documents.
>
> We discovered this issue when we made cTAKES have consistent output of
> codes in our system.
>
> [IMAT Solutions]
> Kim Ebert
> Software Engineer
> [Office:] 801.669.7342
> kim.eb...@imatsolutions.com
> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
> Before we released the Fast lookup, we did a systematic evaluation across
> three gold standard sets. We did not see the trend that Bruce reported
> below. The P, R and F1 results from the old dictionary look up and the fast
> one were similar.
>
> Thank you everyone!
> --Guergana
>
> -Original Message-
> From: David Kincaid [mailto:kincaid.d...@gmail.com]
> Sent: Friday, December 19, 2014 9:02 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes Annotation Comparison
>
> Thanks for this, Bruce! Very interesting work. It confirms what I've seen
> in my small tests that I've done in a non-systematic way. Did you happen to
> capture the number of false positives yet (annotations made by cTAKES that
> are not in the human adjudicated standard)? I've seen a lot of dictionary
> hits that are not actually entity mentions, but I haven't had a chance to
> do a systematic analysis (we're working on our annotated gold standard
> now). One great example is the antibiotic "Today". Every time the word
> today appears in any text it is annotated as a medication mention when it
> almost never is being used in that sense.
>
> These results by themselves are quite disappointing to me. Both the
> UMLSProcessor and espec

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Hi Tim,

Here is an untested example, but should show the concept.

Document 1:

Sarah had Induced Abortion Illegally.

Document 2:

John had a previous history of Abuse Health Service.

The following CUIs would be the matches if everything went well.

Illegally Induced Abortion, C000804
Health Service Abuse, C000864

When running in cTAKES with the bug, the following would happen.

Document 1 would be processed, and different permutations would be
tested to find a match against the document. For example, the
permutation 1, 2, 3 would be tried (Induced Abortion Illegally). For the
sake of this discussion, we will say that returned nothing. All the
permutations of 1, 2, and 3 would be tried. A match for the permutation
3, 1, 2 (Illegally Induced Abortion) would be found. When the bug is
present, the list of permutations, (1, 2, 3), (1, 3, 2), (2, 1, 3), ...
(3, 1, 2) would be changed to (1, 2, 3), (1, 3, 2), (2, 1, 3), ... (1,
2, 3). The permutation was sorted to find the starting and ending span
of the match. As you can see the permutation of (1, 2, 3) now exists
twice, and, (3, 1, 2) no longer exists.

Document 2 would be processed, and no match for the permutation of (3,
1, 2) would be tried, so Abuse Health Service would never be tried, but
(1, 2,3) Health Service Abuse would be attempted twice.

With the small number of documents being process, I doubt this skewed
the tests that were be running between the two sets. I was curious
though if this could have been a factor, as the more documents and cuis
that were processed, the more permutations that would be sorted to (1,
2, 3). Note, permutations of different lengths also occur.

Using the permutations, cTAKES can find additional cuis that may not be
discovered using exact matching techniques, so to have this degrade
overtime was something that we wanted to fix.

This example was a contrived example, and would not really match in
cTAKES due to the first word being different; but I think it adequately
displays the concept and the bug that was caused by the permutations
being sorted.

Please let me know if you have any questions about my example.

Thanks,

IMAT Solutions 
Kim Ebert
Software Engineer
Office: 801.669.7342
kim.eb...@imatsolutions.com 
On 12/19/2014 09:54 AM, Miller, Timothy wrote:
> Thanks Kim,
> This sounds interesting though I don't totally understand it. Are you saying 
> that extraction performance for a given note depends on which order the note 
> was in the processing queue? If so that's pretty bad! If you (or anyone else 
> who understands this issue) has a concrete example I think that might help me 
> understand what the problem is/was.
>
> Even though, as Pei mentioned, we are going to try moving the community to 
> the faster dictionary, I would like to understand better just to help myself 
> avoid issues of this type going forward (and verify the new dictionary 
> doesn't use similar logic).
>
> Also, when we finish annotating the sample notes, might we use that as a 
> point of comparison for the two dictionaries? That would get around the issue 
> that not everyone has access to the datasets we used for validation and 
> others are likely not able to share theirs either. And maybe we can replicate 
> the notes if we want to simulate the scenario Kim is talking about with 
> thousands or more notes.
>
> Tim
>
>
> On 12/19/2014 10:24 AM, Kim Ebert wrote:
> Guergana,
>
> I'm curious to the number of records that are in your gold standard sets, or 
> if your gold standard set was run through a long running cTAKES process. I 
> know at some point we fixed a bug in the old dictionary lookup that caused 
> the permutations to become corrupted over time. Typically this isn't seen in 
> the first few records, but over time as patterns are used the permutations 
> would become corrupted. This caused documents that were fed through cTAKES 
> more than once to have less codes returned than the first time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation would be 
> corrupted to be 1,2,3,4. It would no longer be possible to detect 
> permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after 
> the cTAKES 3.2.0 release. https://issues.apache.org/jira/browse/CTAKES-310 
> Depending upon the corpus size, I could see the permutation engine eventually 
> only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or so 
> documents.
>
> We discovered this issue when we made cTAKES have consistent output of codes 
> in our system.
>
> [IMAT Solutions]
> Kim Ebert
> Software Engineer
> [Office:] 801.669.7342
> kim.eb...@imatsolutions.com
> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
> Before we released the Fast lookup, we did a systematic evaluat

RE: cTakes Annotation Comparison

2014-12-19 Thread Savova, Guergana
Several thoughts:
1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.

2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder. 

3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.

Hope this makes sense...
--Guergana

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 1:16 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.

It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.

The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.



 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu> wrote:
>
> Thanks Kim,
> This sounds interesting though I don't totally understand it. Are you 
> saying that extraction performance for a given note depends on which 
> order the note was in the processing queue? If so that's pretty bad! 
> If you (or anyone else who understands this issue) has a concrete 
> example I think that might help me understand what the problem is/was.
>
> Even though, as Pei mentioned, we are going to try moving the 
> community to the faster dictionary, I would like to understand better 
> just to help myself avoid issues of this type going forward (and 
> verify the new dictionary doesn't use similar logic).
>
> Also, when we finish annotating the sample notes, might we use that as 
> a point of comparison for the two dictionaries? That would get around 
> the issue that not everyone has access to the datasets we used for 
> validation and others are likely not able to share theirs either. And 
> maybe we can replicate the notes if we want to simulate the scenario 
> Kim is talking about with thousands or more notes.
>
> Tim
>
>
> On 12/19/2014 10:24 AM, Kim Ebert wrote:
> Guergana,
>
> I'm curious to the number of records that are in your gold standard 
> sets, or if your gold standard set was run through a long running cTAKES 
> process.
> I know at some point we fixed a bug in the old dictionary lookup that 
> caused the permutations to become corrupted over time. Typically this 
> isn't seen in the first few records, but over time as patterns are 
> used the permutations would become corrupted. This caused documents 
> that were fed through cTAKES more than once to have less codes 
> returned than the first time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation 
> would be corrupted to be 1,2,3,4. It would no longer be possible to 
> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
> fix in after the cTAKES 3.2.0 release. 
> https://issues.apache.org/jira/browse/CTAKES-310
> Depending upon the corpus size, I could see the permutation engine 
> eventually only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or 
> so documents.
>
> We discovered this issue when we made cTAKES have consistent output of 
> codes in our system.
>
> [IMAT Solutions]
> Kim Ebert
> Software Engineer
> [Office:] 801.669.7342
> kim.eb...@imatsolutions.com
> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
> Before we released the Fast lookup, we did a systematic evaluation 
> across three gold standard sets. We did not see the trend that Bruce 
> reported below. The P, R and F1 results from the old dictionary look 
> up and the fast one were similar.
>
> Thank you eve

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
One quick mention:

The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
were not done using the same UMLS version then there WILL be differences in CUI 
and Semantic group.  I don't have time to go into it with details, examples, 
etc. just be aware that every 6 months cuis are added, removed, deprecated, and 
moved from one TUI to another.

Sean

-Original Message-
From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] 
Sent: Friday, December 19, 2014 1:28 PM
To: dev@ctakes.apache.org
Subject: RE: cTakes Annotation Comparison

Several thoughts:
1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.

2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder. 

3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.

Hope this makes sense...
--Guergana

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 1:16 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.

It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.

The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.



 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu> wrote:
>
> Thanks Kim,
> This sounds interesting though I don't totally understand it. Are you 
> saying that extraction performance for a given note depends on which 
> order the note was in the processing queue? If so that's pretty bad! 
> If you (or anyone else who understands this issue) has a concrete 
> example I think that might help me understand what the problem is/was.
>
> Even though, as Pei mentioned, we are going to try moving the 
> community to the faster dictionary, I would like to understand better 
> just to help myself avoid issues of this type going forward (and 
> verify the new dictionary doesn't use similar logic).
>
> Also, when we finish annotating the sample notes, might we use that as 
> a point of comparison for the two dictionaries? That would get around 
> the issue that not everyone has access to the datasets we used for 
> validation and others are likely not able to share theirs either. And 
> maybe we can replicate the notes if we want to simulate the scenario 
> Kim is talking about with thousands or more notes.
>
> Tim
>
>
> On 12/19/2014 10:24 AM, Kim Ebert wrote:
> Guergana,
>
> I'm curious to the number of records that are in your gold standard 
> sets, or if your gold standard set was run through a long running cTAKES 
> process.
> I know at some point we fixed a bug in the old dictionary lookup that 
> caused the permutations to become corrupted over time. Typically this 
> isn't seen in the first few records, but over time as patterns are 
> used the permutations would become corrupted. This caused documents 
> that were fed through cTAKES more than once to have less codes 
> returned than the first time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation 
> would be corrupted to be 1,2,3,4. It would no longer be possible to 
> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
> fix in after the cTAKES 3.2.0 release. 
> https://issues.apache.org/jira/browse/CTAKES-310
> Depending upon the corpus size, I could see the permutation engine 
> eventually only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or 
> so documents.
>
> We discovered this issue when we made cTAKES have consistent output of 
> cod

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Sean,

I don't think that would be an issue since both the rare word lookup and
the first word lookup are using UMLS 2011AB. Or is the rare word lookup
using a different dictionary?

I would expect roughly similar results between the two when it comes to
differences between UMLS versions.

IMAT Solutions 
Kim Ebert
Software Engineer
Office: 801.669.7342
kim.eb...@imatsolutions.com 
On 12/19/2014 11:31 AM, Finan, Sean wrote:
> One quick mention:
>
> The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
> were not done using the same UMLS version then there WILL be differences in 
> CUI and Semantic group.  I don't have time to go into it with details, 
> examples, etc. just be aware that every 6 months cuis are added, removed, 
> deprecated, and moved from one TUI to another.
>
> Sean
>
> -Original Message-
> From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu] 
> Sent: Friday, December 19, 2014 1:28 PM
> To: dev@ctakes.apache.org
> Subject: RE: cTakes Annotation Comparison
>
> Several thoughts:
> 1. The ShARE corpus annotates only mentions of type Diseases/Disorders and 
> only Anatomical Sites associated with a Disease/Disorder. This is by design. 
> cTAKES annotates all mentions of types Diseases/Disorders, Signs/Symptoms, 
> Procedures, Medications and Anatomical Sites. Therefore you will get MANY 
> more annotations with cTAKES. Eventually the ShARe corpus will be expanded to 
> the other types.
>
> 2. Keeping (1) in mind, you can approximately estimate the 
> precision/recall/f1 of cTAKES on the ShARe corpus if you output only mentions 
> of type Disease/Disorder. 
>
> 3. Could you send us the list of files you use from ShARe to test? We have 
> the corpus and would like to run against as well.
>
> Hope this makes sense...
> --Guergana
>
> -Original Message-
> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
> Sent: Friday, December 19, 2014 1:16 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes Annotation Comparison
>
> Our analysis against the human adjudicated gold standard from this SHARE 
> corpus is using a simple check to see if the cTakes output included the 
> annotation specified by the gold standard. The initial results I reported 
> were for exact matches of CUI and text span.  Only exact matches were counted.
>
> It looks like if we also count as matches cTakes annotations with a matching 
> CUI and a text span that overlaps the gold standard text span then the 
> matches increase to 224 matching annotations for the FastUMLS pipeline and 
> 2319 for the the old pipeline.
>
> The question was also asked about annotations in the cTakes output that were 
> not in the human adjudicated gold standard. The answer is yes, there were a 
> lot of additional annotations made by cTakes that don't appear to be in the 
> gold standard. We haven't analyzed that yet, but it looks like the gold 
> standard we are using may only have Disease_Disorder annotations.
>
>
>
>  [image: IMAT Solutions]   Bruce Tietjen Senior 
> Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy < 
> timothy.mil...@childrens.harvard.edu> wrote:
>> Thanks Kim,
>> This sounds interesting though I don't totally understand it. Are you 
>> saying that extraction performance for a given note depends on which 
>> order the note was in the processing queue? If so that's pretty bad! 
>> If you (or anyone else who understands this issue) has a concrete 
>> example I think that might help me understand what the problem is/was.
>>
>> Even though, as Pei mentioned, we are going to try moving the 
>> community to the faster dictionary, I would like to understand better 
>> just to help myself avoid issues of this type going forward (and 
>> verify the new dictionary doesn't use similar logic).
>>
>> Also, when we finish annotating the sample notes, might we use that as 
>> a point of comparison for the two dictionaries? That would get around 
>> the issue that not everyone has access to the datasets we used for 
>> validation and others are likely not able to share theirs either. And 
>> maybe we can replicate the notes if we want to simulate the scenario 
>> Kim is talking about with thousands or more notes.
>>
>> Tim
>>
>>
>> On 12/19/2014 10:24 AM, Kim Ebert wrote:
>> Guergana,
>>
>> I'm curious to the number of records that are in your gold standard 
>> sets, or if your gold standard set was run through a long running cTAKES 
>> process.
>> I know at some point we fixed a bug in the old dictionary lookup that 
>> caused the permutations to become corrupted over time. Typically this 
>> isn't seen in the first few records, but over time as patterns are 
>> used the permutations would become corrupted. This caused documents 
>> that were fed through cTAKES more than once to have less codes 
>

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
I’m bringing it up in case the Human Annotations were done using a different 
version.

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 1:40 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I don't think that would be an issue since both the rare word lookup and the 
first word lookup are using UMLS 2011AB. Or is the rare word lookup using a 
different dictionary?

I would expect roughly similar results between the two when it comes to 
differences between UMLS versions.

[IMAT Solutions]
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.com
On 12/19/2014 11:31 AM, Finan, Sean wrote:

One quick mention:



The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
were not done using the same UMLS version then there WILL be differences in CUI 
and Semantic group.  I don't have time to go into it with details, examples, 
etc. just be aware that every 6 months cuis are added, removed, deprecated, and 
moved from one TUI to another.



Sean



-Original Message-

From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu]

Sent: Friday, December 19, 2014 1:28 PM

To: dev@ctakes.apache.org

Subject: RE: cTakes Annotation Comparison



Several thoughts:

1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.



2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder.



3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.



Hope this makes sense...

--Guergana



-Original Message-

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]

Sent: Friday, December 19, 2014 1:16 PM

To: dev@ctakes.apache.org

Subject: Re: cTakes Annotation Comparison



Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.



It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.



The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.







 [image: IMAT Solutions]   
Bruce Tietjen Senior Software Engineer

[image: Mobile:] 801.634.1547

bruce.tiet...@imatsolutions.com



On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu>
 wrote:



Thanks Kim,

This sounds interesting though I don't totally understand it. Are you

saying that extraction performance for a given note depends on which

order the note was in the processing queue? If so that's pretty bad!

If you (or anyone else who understands this issue) has a concrete

example I think that might help me understand what the problem is/was.



Even though, as Pei mentioned, we are going to try moving the

community to the faster dictionary, I would like to understand better

just to help myself avoid issues of this type going forward (and

verify the new dictionary doesn't use similar logic).



Also, when we finish annotating the sample notes, might we use that as

a point of comparison for the two dictionaries? That would get around

the issue that not everyone has access to the datasets we used for

validation and others are likely not able to share theirs either. And

maybe we can replicate the notes if we want to simulate the scenario

Kim is talking about with thousands or more notes.



Tim





On 12/19/2014 10:24 AM, Kim Ebert wrote:

Guergana,



I'm curious to the number of records that are in your gold standard

sets, or if your gold standard set was run through a long running cTAKES 
process.

I know at some point we fixed a bug in the old dictionary lookup tha

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Pei,

I don't think bugs/issues should be part of determining if one algorithm
vs the other is superior. Obviously, it is worth mentioning the bugs,
but if the fast lookup method has worse precision and recall but better
performance, vs the slower but more accurate first word lookup
algorithm, then time should be invested in fixing those bugs and
resolving those weird issues.

Now I'm not saying which one is superior in this case, as the data will
end up speaking for itself one way or the other; bus as of right now,
I'm not convinced yet that the old dictionary lookup is obsolete yet,
and I'm not sure the community is convinced yet either.


IMAT Solutions 
Kim Ebert
Software Engineer
Office: 801.669.7342
kim.eb...@imatsolutions.com 
On 12/19/2014 08:39 AM, Chen, Pei wrote:
>
> Also check out stats that Sean ran before releasing the new component on:
>
> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
>
> From the evaluation and experience, the new lookup algorithm should be
> a huge improvement in terms of both speed and accuracy.
>
> This is very different than what Bruce mentioned…  I’m sure Sean will
> chime here.
>
> (The old dictionary lookup is essentially obsolete now- plagued with
> bugs/issues as you mentioned.)
>
> --Pei
>
>  
>
> *From:*Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
> *Sent:* Friday, December 19, 2014 10:25 AM
> *To:* dev@ctakes.apache.org
> *Subject:* Re: cTakes Annotation Comparison
>
>  
>
> Guergana,
>
> I'm curious to the number of records that are in your gold standard
> sets, or if your gold standard set was run through a long running
> cTAKES process. I know at some point we fixed a bug in the old
> dictionary lookup that caused the permutations to become corrupted
> over time. Typically this isn't seen in the first few records, but
> over time as patterns are used the permutations would become
> corrupted. This caused documents that were fed through cTAKES more
> than once to have less codes returned than the first time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation
> would be corrupted to be 1,2,3,4. It would no longer be possible to
> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the
> fix in after the cTAKES 3.2.0 release.
> https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the
> corpus size, I could see the permutation engine eventually only have a
> single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or
> so documents.
>
> We discovered this issue when we made cTAKES have consistent output of
> codes in our system.
>
>  
>
> IMAT Solutions 
>
> *Kim Ebert*
> Software Engineer
> Office:801.669.7342
> kim.eb...@imatsolutions.com 
>
> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
>  
>
> Before we released the Fast lookup, we did a systematic evaluation across 
> three gold standard sets. We did not see the trend that Bruce reported below. 
> The P, R and F1 results from the old dictionary look up and the fast one were 
> similar.
>
>  
>
> Thank you everyone!
>
> --Guergana
>
>  
>
> -Original Message-
>
> From: David Kincaid [mailto:kincaid.d...@gmail.com] 
>
> Sent: Friday, December 19, 2014 9:02 AM
>
> To: dev@ctakes.apache.org 
>
> Subject: Re: cTakes Annotation Comparison
>
>  
>
> Thanks for this, Bruce! Very interesting work. It confirms what I've seen 
> in my small tests that I've done in a non-systematic way. Did you happen to 
> capture the number of false positives yet (annotations made by cTAKES that 
> are not in the human adjudicated standard)? I've seen a lot of dictionary 
> hits that are not actually entity mentions, but I haven't had a chance to do 
> a systematic analysis (we're working on our annotated gold standard now). One 
> great example is the antibiotic "Today". Every time the word today appears in 
> any text it is annotated as a medication mention when it almost never is 
> being used in that sense.
>
>  
>
> These results by themselves are quite disappointing to me. Both the 
> UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor 
> recall. It seems like the trade off for more speed is a ten-fold (or more) 
> decrease in entity recognition.
>
>  
>
> Thanks again for sharing your results with us. I think they are very 
> useful to the project.
>
>  
>
> - Dave
>
>  
>
> On Thu, Dec 18, 2014 at 5:06 PM, Bruce Tietjen < 
> bruce.tiet...@perfectsearchcorp.com 
> > wrote:
>
>  
>
> Actually, we are working on a similar tool to compare it to the human 
>
>

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
Rather than spam the mailing list with the list of filenames for the files
in the set we used, I would be happy to send it to anyone interested
privately.


 [image: IMAT Solutions] 
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 11:47 AM, Kim Ebert 
wrote:
>
>  Pei,
>
> I don't think bugs/issues should be part of determining if one algorithm
> vs the other is superior. Obviously, it is worth mentioning the bugs, but
> if the fast lookup method has worse precision and recall but better
> performance, vs the slower but more accurate first word lookup algorithm,
> then time should be invested in fixing those bugs and resolving those weird
> issues.
>
> Now I'm not saying which one is superior in this case, as the data will
> end up speaking for itself one way or the other; bus as of right now, I'm
> not convinced yet that the old dictionary lookup is obsolete yet, and I'm
> not sure the community is convinced yet either.
>
>
>  [image: IMAT Solutions] 
>  Kim Ebert
> Software Engineer
> [image: Office:] 801.669.7342
> kim.eb...@imatsolutions.com 
>  On 12/19/2014 08:39 AM, Chen, Pei wrote:
>
>  Also check out stats that Sean ran before releasing the new component on:
>
>
> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
>
> From the evaluation and experience, the new lookup algorithm should be a
> huge improvement in terms of both speed and accuracy.
>
> This is very different than what Bruce mentioned…  I’m sure Sean will
> chime here.
>
> (The old dictionary lookup is essentially obsolete now- plagued with
> bugs/issues as you mentioned.)
>
> --Pei
>
>
>
> *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
> ]
> *Sent:* Friday, December 19, 2014 10:25 AM
> *To:* dev@ctakes.apache.org
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Guergana,
>
> I'm curious to the number of records that are in your gold standard sets,
> or if your gold standard set was run through a long running cTAKES process.
> I know at some point we fixed a bug in the old dictionary lookup that
> caused the permutations to become corrupted over time. Typically this isn't
> seen in the first few records, but over time as patterns are used the
> permutations would become corrupted. This caused documents that were fed
> through cTAKES more than once to have less codes returned than the first
> time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation would
> be corrupted to be 1,2,3,4. It would no longer be possible to detect
> permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after
> the cTAKES 3.2.0 release. https://issues.apache.org/jira/browse/CTAKES-310
> Depending upon the corpus size, I could see the permutation engine
> eventually only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or so
> documents.
>
> We discovered this issue when we made cTAKES have consistent output of
> codes in our system.
>
>
>
> [image: IMAT Solutions] 
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com 
>
> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
>
>
> Before we released the Fast lookup, we did a systematic evaluation across 
> three gold standard sets. We did not see the trend that Bruce reported below. 
> The P, R and F1 results from the old dictionary look up and the fast one were 
> similar.
>
>
>
> Thank you everyone!
>
> --Guergana
>
>
>
> -Original Message-
>
> From: David Kincaid [mailto:kincaid.d...@gmail.com ]
>
> Sent: Friday, December 19, 2014 9:02 AM
>
> To: dev@ctakes.apache.org
>
> Subject: Re: cTakes Annotation Comparison
>
>
>
> Thanks for this, Bruce! Very interesting work. It confirms what I've seen in 
> my small tests that I've done in a non-systematic way. Did you happen to 
> capture the number of false positives yet (annotations made by cTAKES that 
> are not in the human adjudicated standard)? I've seen a lot of dictionary 
> hits that are not actually entity mentions, but I haven't had a chance to do 
> a systematic analysis (we're working on our annotated gold standard now). One 
> great example is the antibiotic "Today". Every time the word today appears in 
> any text it is annotated as a medication mention when it almost never is 
> being used in that sense.
>
>
>
> These results by themselves are quite disappointing to me. Both the 
> UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor 
> recall. It seems like the trade off for more speed is a ten-fold (or more) 
> decrease in entity recognition.
>
>
>
> Thanks again for sharing your results with us. I think they are very useful 
> to the project.
>
>
>
> - Dave
>
>
>
> On Thu, Dec 18, 2014

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Our human annotators on Share used 2012AB.  I mention it because when I have 
done manual spot-checks between human and system annotations I had 
head-scratchers that ended up being differences in the UMLS version.  I first 
noticed these discrepancies before I had started working on the fast lookup 
(that is to say: when working with the default lookup).

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 1:40 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I don't think that would be an issue since both the rare word lookup and the 
first word lookup are using UMLS 2011AB. Or is the rare word lookup using a 
different dictionary?

I would expect roughly similar results between the two when it comes to 
differences between UMLS versions.

[IMAT Solutions]
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.com
On 12/19/2014 11:31 AM, Finan, Sean wrote:

One quick mention:



The cTakes dictionaries are built with UMLS 2011AB.  If the Human annotations 
were not done using the same UMLS version then there WILL be differences in CUI 
and Semantic group.  I don't have time to go into it with details, examples, 
etc. just be aware that every 6 months cuis are added, removed, deprecated, and 
moved from one TUI to another.



Sean



-Original Message-

From: Savova, Guergana [mailto:guergana.sav...@childrens.harvard.edu]

Sent: Friday, December 19, 2014 1:28 PM

To: dev@ctakes.apache.org

Subject: RE: cTakes Annotation Comparison



Several thoughts:

1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only 
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES 
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, 
Medications and Anatomical Sites. Therefore you will get MANY more annotations 
with cTAKES. Eventually the ShARe corpus will be expanded to the other types.



2. Keeping (1) in mind, you can approximately estimate the precision/recall/f1 
of cTAKES on the ShARe corpus if you output only mentions of type 
Disease/Disorder.



3. Could you send us the list of files you use from ShARe to test? We have the 
corpus and would like to run against as well.



Hope this makes sense...

--Guergana



-Original Message-

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]

Sent: Friday, December 19, 2014 1:16 PM

To: dev@ctakes.apache.org

Subject: Re: cTakes Annotation Comparison



Our analysis against the human adjudicated gold standard from this SHARE corpus 
is using a simple check to see if the cTakes output included the annotation 
specified by the gold standard. The initial results I reported were for exact 
matches of CUI and text span.  Only exact matches were counted.



It looks like if we also count as matches cTakes annotations with a matching 
CUI and a text span that overlaps the gold standard text span then the matches 
increase to 224 matching annotations for the FastUMLS pipeline and 2319 for the 
the old pipeline.



The question was also asked about annotations in the cTakes output that were 
not in the human adjudicated gold standard. The answer is yes, there were a lot 
of additional annotations made by cTakes that don't appear to be in the gold 
standard. We haven't analyzed that yet, but it looks like the gold standard we 
are using may only have Disease_Disorder annotations.







 [image: IMAT Solutions]   
Bruce Tietjen Senior Software Engineer

[image: Mobile:] 801.634.1547

bruce.tiet...@imatsolutions.com



On Fri, Dec 19, 2014 at 9:54 AM, Miller, Timothy < 
timothy.mil...@childrens.harvard.edu>
 wrote:



Thanks Kim,

This sounds interesting though I don't totally understand it. Are you

saying that extraction performance for a given note depends on which

order the note was in the processing queue? If so that's pretty bad!

If you (or anyone else who understands this issue) has a concrete

example I think that might help me understand what the problem is/was.



Even though, as Pei mentioned, we are going to try moving the

community to the faster dictionary, I would like to understand better

just to help myself avoid issues of this type going forward (and

verify the new dictionary doesn't use similar logic).



Also, when we finish annotating the sample notes, might we use that as

a point of comparison for the two dictionaries? That would get around

the issue that not everyone has access to the datasets we used for

validation and others are likely not able to share theirs either. And

maybe we can replicate the notes if we want to simulate the scenario

Kim is talking about with thousands or more notes.



Tim





On 12

RE: cTakes Annotation Comparison

2014-12-19 Thread Chen, Pei
Kim,
Maintenance is the factor not bugs/issue to forge ahead.
They are 2 components that do the same thing with the same goal (As Sean 
mentioned, one should be able configure the new code base to  replicate the old 
algorithm if required- it’s just a simpler and cleaner code base.  If this is 
not the case or if there are issues, we should fix it and move forward.).
We can keep the old component around for as long as needed, but it’s likely 
going to have limited support…
--Pei

From: Kim Ebert [mailto:kim.eb...@imatsolutions.com]
Sent: Friday, December 19, 2014 1:47 PM
To: Chen, Pei; dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Pei,

I don't think bugs/issues should be part of determining if one algorithm vs the 
other is superior. Obviously, it is worth mentioning the bugs, but if the fast 
lookup method has worse precision and recall but better performance, vs the 
slower but more accurate first word lookup algorithm, then time should be 
invested in fixing those bugs and resolving those weird issues.

Now I'm not saying which one is superior in this case, as the data will end up 
speaking for itself one way or the other; bus as of right now, I'm not 
convinced yet that the old dictionary lookup is obsolete yet, and I'm not sure 
the community is convinced yet either.

[IMAT Solutions]
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.com
On 12/19/2014 08:39 AM, Chen, Pei wrote:
Also check out stats that Sean ran before releasing the new component on:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
From the evaluation and experience, the new lookup algorithm should be a huge 
improvement in terms of both speed and accuracy.
This is very different than what Bruce mentioned…  I’m sure Sean will chime 
here.
(The old dictionary lookup is essentially obsolete now- plagued with 
bugs/issues as you mentioned.)
--Pei

From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 10:25 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Guergana,

I'm curious to the number of records that are in your gold standard sets, or if 
your gold standard set was run through a long running cTAKES process. I know at 
some point we fixed a bug in the old dictionary lookup that caused the 
permutations to become corrupted over time. Typically this isn't seen in the 
first few records, but over time as patterns are used the permutations would 
become corrupted. This caused documents that were fed through cTAKES more than 
once to have less codes returned than the first time.

For example, if a permutation of 4,2,3,1 was found, the permutation would be 
corrupted to be 1,2,3,4. It would no longer be possible to detect permutations 
of 4,2,3,1 until cTAKES was restarted. We got the fix in after the cTAKES 3.2.0 
release. https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the 
corpus size, I could see the permutation engine eventually only have a single 
permutation of 1,2,3,4.

Typically though, this isn't very easily detected in the first 100 or so 
documents.

We discovered this issue when we made cTAKES have consistent output of codes in 
our system.

[IMAT Solutions]
Kim Ebert
Software Engineer
[Office:]801.669.7342
kim.eb...@imatsolutions.com
On 12/19/2014 07:05 AM, Savova, Guergana wrote:

We are doing a similar kind of evaluation and will report the results.



Before we released the Fast lookup, we did a systematic evaluation across three 
gold standard sets. We did not see the trend that Bruce reported below. The P, 
R and F1 results from the old dictionary look up and the fast one were similar.



Thank you everyone!

--Guergana



-Original Message-

From: David Kincaid [mailto:kincaid.d...@gmail.com]

Sent: Friday, December 19, 2014 9:02 AM

To: dev@ctakes.apache.org

Subject: Re: cTakes Annotation Comparison



Thanks for this, Bruce! Very interesting work. It confirms what I've seen in my 
small tests that I've done in a non-systematic way. Did you happen to capture 
the number of false positives yet (annotations made by cTAKES that are not in 
the human adjudicated standard)? I've seen a lot of dictionary hits that are 
not actually entity mentions, but I haven't had a chance to do a systematic 
analysis (we're working on our annotated gold standard now). One great example 
is the antibiotic "Today". Every time the word today appears in any text it is 
annotated as a medication mention when it almost never is being used in that 
sense.



These results by themselves are quite disappointing to me. Both the 
UMLSProcessor and especially the FastUMLSProcessor seem to have pretty poor 
recall. It seems like the trade off for more speed is a ten-fold (or more) 
dec

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
Sean,

I tried the configuration changes you mentioned in your earlier email.

The results are as follows:

Total Annotations found: 12,161 (default configuration found 8,284)

If counting exact span matches, this run only matched 211 (default
configuration matched 215).

If counting overlapping spans, this run only matched 220 (default
configuration matched 224)

Bruce



 [image: IMAT Solutions] 
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei 
wrote:
>
>  Kim,
>
> Maintenance is the factor not bugs/issue to forge ahead.
>
> They are 2 components that do the same thing with the same goal (As Sean
> mentioned, one should be able configure the new code base to  replicate the
> old algorithm if required- it’s just a simpler and cleaner code base.  If
> this is not the case or if there are issues, we should fix it and move
> forward.).
>
> We can keep the old component around for as long as needed, but it’s
> likely going to have limited support…
>
> --Pei
>
>
>
> *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
> *Sent:* Friday, December 19, 2014 1:47 PM
> *To:* Chen, Pei; dev@ctakes.apache.org
>
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Pei,
>
> I don't think bugs/issues should be part of determining if one algorithm
> vs the other is superior. Obviously, it is worth mentioning the bugs, but
> if the fast lookup method has worse precision and recall but better
> performance, vs the slower but more accurate first word lookup algorithm,
> then time should be invested in fixing those bugs and resolving those weird
> issues.
>
> Now I'm not saying which one is superior in this case, as the data will
> end up speaking for itself one way or the other; bus as of right now, I'm
> not convinced yet that the old dictionary lookup is obsolete yet, and I'm
> not sure the community is convinced yet either.
>
>
>
> [image: IMAT Solutions] 
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com 
>
> On 12/19/2014 08:39 AM, Chen, Pei wrote:
>
> Also check out stats that Sean ran before releasing the new component on:
>
>
> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
>
> From the evaluation and experience, the new lookup algorithm should be a
> huge improvement in terms of both speed and accuracy.
>
> This is very different than what Bruce mentioned…  I’m sure Sean will
> chime here.
>
> (The old dictionary lookup is essentially obsolete now- plagued with
> bugs/issues as you mentioned.)
>
> --Pei
>
>
>
> *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
> ]
> *Sent:* Friday, December 19, 2014 10:25 AM
> *To:* dev@ctakes.apache.org
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Guergana,
>
> I'm curious to the number of records that are in your gold standard sets,
> or if your gold standard set was run through a long running cTAKES process.
> I know at some point we fixed a bug in the old dictionary lookup that
> caused the permutations to become corrupted over time. Typically this isn't
> seen in the first few records, but over time as patterns are used the
> permutations would become corrupted. This caused documents that were fed
> through cTAKES more than once to have less codes returned than the first
> time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation would
> be corrupted to be 1,2,3,4. It would no longer be possible to detect
> permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after
> the cTAKES 3.2.0 release. https://issues.apache.org/jira/browse/CTAKES-310
> Depending upon the corpus size, I could see the permutation engine
> eventually only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or so
> documents.
>
> We discovered this issue when we made cTAKES have consistent output of
> codes in our system.
>
>
>
> [image: IMAT Solutions] 
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com 
>
> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
>
>
> Before we released the Fast lookup, we did a systematic evaluation across 
> three gold standard sets. We did not see the trend that Bruce reported below. 
> The P, R and F1 results from the old dictionary look up and the fast one were 
> similar.
>
>
>
> Thank you everyone!
>
> --Guergana
>
>
>
> -Original Message-
>
> From: David Kincaid [mailto:kincaid.d...@gmail.com ]
>
> Sent: Friday, December 19, 2014 9:02 AM
>
> To: dev@ctakes.apache.org
>
> Subject: Re: cTakes Annotation Comparison
>
>
>
> Thanks for this, Bruce! Very interesting work. It confirms what I've seen in 
> my small tests that I've done in a non-systematic way. Di

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
Correction -- So far, I did steps 1 and 2 of Sean's email.


 [image: IMAT Solutions] 
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:22 PM, Bruce Tietjen <
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> Sean,
>
> I tried the configuration changes you mentioned in your earlier email.
>
> The results are as follows:
>
> Total Annotations found: 12,161 (default configuration found 8,284)
>
> If counting exact span matches, this run only matched 211 (default
> configuration matched 215).
>
> If counting overlapping spans, this run only matched 220 (default
> configuration matched 224)
>
> Bruce
>
>
>
>  [image: IMAT Solutions] 
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei <
> pei.c...@childrens.harvard.edu> wrote:
>>
>>  Kim,
>>
>> Maintenance is the factor not bugs/issue to forge ahead.
>>
>> They are 2 components that do the same thing with the same goal (As Sean
>> mentioned, one should be able configure the new code base to  replicate the
>> old algorithm if required- it’s just a simpler and cleaner code base.  If
>> this is not the case or if there are issues, we should fix it and move
>> forward.).
>>
>> We can keep the old component around for as long as needed, but it’s
>> likely going to have limited support…
>>
>> --Pei
>>
>>
>>
>> *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
>> *Sent:* Friday, December 19, 2014 1:47 PM
>> *To:* Chen, Pei; dev@ctakes.apache.org
>>
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> Pei,
>>
>> I don't think bugs/issues should be part of determining if one algorithm
>> vs the other is superior. Obviously, it is worth mentioning the bugs, but
>> if the fast lookup method has worse precision and recall but better
>> performance, vs the slower but more accurate first word lookup algorithm,
>> then time should be invested in fixing those bugs and resolving those weird
>> issues.
>>
>> Now I'm not saying which one is superior in this case, as the data will
>> end up speaking for itself one way or the other; bus as of right now, I'm
>> not convinced yet that the old dictionary lookup is obsolete yet, and I'm
>> not sure the community is convinced yet either.
>>
>>
>>
>> [image: IMAT Solutions] 
>>
>> *Kim Ebert*
>> Software Engineer
>> [image: Office:]801.669.7342
>> kim.eb...@imatsolutions.com 
>>
>> On 12/19/2014 08:39 AM, Chen, Pei wrote:
>>
>> Also check out stats that Sean ran before releasing the new component on:
>>
>>
>> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
>>
>> From the evaluation and experience, the new lookup algorithm should be a
>> huge improvement in terms of both speed and accuracy.
>>
>> This is very different than what Bruce mentioned…  I’m sure Sean will
>> chime here.
>>
>> (The old dictionary lookup is essentially obsolete now- plagued with
>> bugs/issues as you mentioned.)
>>
>> --Pei
>>
>>
>>
>> *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
>> ]
>> *Sent:* Friday, December 19, 2014 10:25 AM
>> *To:* dev@ctakes.apache.org
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> Guergana,
>>
>> I'm curious to the number of records that are in your gold standard sets,
>> or if your gold standard set was run through a long running cTAKES process.
>> I know at some point we fixed a bug in the old dictionary lookup that
>> caused the permutations to become corrupted over time. Typically this isn't
>> seen in the first few records, but over time as patterns are used the
>> permutations would become corrupted. This caused documents that were fed
>> through cTAKES more than once to have less codes returned than the first
>> time.
>>
>> For example, if a permutation of 4,2,3,1 was found, the permutation would
>> be corrupted to be 1,2,3,4. It would no longer be possible to detect
>> permutations of 4,2,3,1 until cTAKES was restarted. We got the fix in after
>> the cTAKES 3.2.0 release.
>> https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the
>> corpus size, I could see the permutation engine eventually only have a
>> single permutation of 1,2,3,4.
>>
>> Typically though, this isn't very easily detected in the first 100 or so
>> documents.
>>
>> We discovered this issue when we made cTAKES have consistent output of
>> codes in our system.
>>
>>
>>
>> [image: IMAT Solutions] 
>>
>> *Kim Ebert*
>> Software Engineer
>> [image: Office:]801.669.7342
>> kim.eb...@imatsolutions.com 
>>
>> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>>
>> We are doing a similar kind of evaluation and will report the results.
>>
>>
>>
>> Before we released the Fast lookup, we did a systematic evaluation across 
>> three gold standard sets. We did not see the trend that Bruce re

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Hi Bruce,

I'm not sure how there would be fewer matches with the overlap processor.  
There should be all of the matches from the non-overlap processor plus those 
from the overlap.  Decreasing from 215 to 211 is strange.  Have you done any 
manual spot checks on this?  It is really bizarre that you'd only have two 
matches per document (100 docs?).  

Thanks,
Sean

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 3:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I tried the configuration changes you mentioned in your earlier email.

The results are as follows:

Total Annotations found: 12,161 (default configuration found 8,284)

If counting exact span matches, this run only matched 211 (default 
configuration matched 215).

If counting overlapping spans, this run only matched 220 (default configuration 
matched 224)

Bruce



 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei 
wrote:
>
>  Kim,
>
> Maintenance is the factor not bugs/issue to forge ahead.
>
> They are 2 components that do the same thing with the same goal (As 
> Sean mentioned, one should be able configure the new code base to  
> replicate the old algorithm if required- it’s just a simpler and 
> cleaner code base.  If this is not the case or if there are issues, we 
> should fix it and move forward.).
>
> We can keep the old component around for as long as needed, but it’s 
> likely going to have limited support…
>
> --Pei
>
>
>
> *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
> *Sent:* Friday, December 19, 2014 1:47 PM
> *To:* Chen, Pei; dev@ctakes.apache.org
>
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Pei,
>
> I don't think bugs/issues should be part of determining if one 
> algorithm vs the other is superior. Obviously, it is worth mentioning 
> the bugs, but if the fast lookup method has worse precision and recall 
> but better performance, vs the slower but more accurate first word 
> lookup algorithm, then time should be invested in fixing those bugs 
> and resolving those weird issues.
>
> Now I'm not saying which one is superior in this case, as the data 
> will end up speaking for itself one way or the other; bus as of right 
> now, I'm not convinced yet that the old dictionary lookup is obsolete 
> yet, and I'm not sure the community is convinced yet either.
>
>
>
> [image: IMAT Solutions] 
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com 
>
> On 12/19/2014 08:39 AM, Chen, Pei wrote:
>
> Also check out stats that Sean ran before releasing the new component on:
>
>
> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-
> fast/doc/DictionaryLookupStats.docx
>
> From the evaluation and experience, the new lookup algorithm should be 
> a huge improvement in terms of both speed and accuracy.
>
> This is very different than what Bruce mentioned…  I’m sure Sean will 
> chime here.
>
> (The old dictionary lookup is essentially obsolete now- plagued with 
> bugs/issues as you mentioned.)
>
> --Pei
>
>
>
> *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
> ]
> *Sent:* Friday, December 19, 2014 10:25 AM
> *To:* dev@ctakes.apache.org
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Guergana,
>
> I'm curious to the number of records that are in your gold standard 
> sets, or if your gold standard set was run through a long running cTAKES 
> process.
> I know at some point we fixed a bug in the old dictionary lookup that 
> caused the permutations to become corrupted over time. Typically this 
> isn't seen in the first few records, but over time as patterns are 
> used the permutations would become corrupted. This caused documents 
> that were fed through cTAKES more than once to have less codes 
> returned than the first time.
>
> For example, if a permutation of 4,2,3,1 was found, the permutation 
> would be corrupted to be 1,2,3,4. It would no longer be possible to 
> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
> fix in after the cTAKES 3.2.0 release. 
> https://issues.apache.org/jira/browse/CTAKES-310
> Depending upon the corpus size, I could see the permutation engine 
> eventually only have a single permutation of 1,2,3,4.
>
> Typically though, this isn't very easily detected in the first 100 or 
> so documents.
>
> We discovered this issue when we made cTAKES have consistent output of 
> codes in our system.
>
>
>
> [image: IMAT Solutions] 
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com 
>
> On 12/19/2014 07:05 AM, Savova, Guergana wrote:
>
> We are doing a similar kind of evaluation and will report the results.
>
>
>
> Before we released the Fast lookup, 

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
My original results were using a newly downloaded cTakes 3.2.1 with the
separately downloaded resources copied in. There were no changes to any of
the configuration files.

As far as this last run, I modified the UMLSLookupAnnotator.xml and
AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified ones I
used (but they may not get through the mailing list).



 [image: IMAT Solutions] 
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:
>
> Hi Bruce,
>
> I'm not sure how there would be fewer matches with the overlap processor.
> There should be all of the matches from the non-overlap processor plus
> those from the overlap.  Decreasing from 215 to 211 is strange.  Have you
> done any manual spot checks on this?  It is really bizarre that you'd only
> have two matches per document (100 docs?).
>
> Thanks,
> Sean
>
> -Original Message-
> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> Sent: Friday, December 19, 2014 3:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes Annotation Comparison
>
> Sean,
>
> I tried the configuration changes you mentioned in your earlier email.
>
> The results are as follows:
>
> Total Annotations found: 12,161 (default configuration found 8,284)
>
> If counting exact span matches, this run only matched 211 (default
> configuration matched 215).
>
> If counting overlapping spans, this run only matched 220 (default
> configuration matched 224)
>
> Bruce
>
>
>
>  [image: IMAT Solutions]   Bruce Tietjen Senior
> Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei <
> pei.c...@childrens.harvard.edu>
> wrote:
> >
> >  Kim,
> >
> > Maintenance is the factor not bugs/issue to forge ahead.
> >
> > They are 2 components that do the same thing with the same goal (As
> > Sean mentioned, one should be able configure the new code base to
> > replicate the old algorithm if required- it’s just a simpler and
> > cleaner code base.  If this is not the case or if there are issues, we
> > should fix it and move forward.).
> >
> > We can keep the old component around for as long as needed, but it’s
> > likely going to have limited support…
> >
> > --Pei
> >
> >
> >
> > *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
> > *Sent:* Friday, December 19, 2014 1:47 PM
> > *To:* Chen, Pei; dev@ctakes.apache.org
> >
> > *Subject:* Re: cTakes Annotation Comparison
> >
> >
> >
> > Pei,
> >
> > I don't think bugs/issues should be part of determining if one
> > algorithm vs the other is superior. Obviously, it is worth mentioning
> > the bugs, but if the fast lookup method has worse precision and recall
> > but better performance, vs the slower but more accurate first word
> > lookup algorithm, then time should be invested in fixing those bugs
> > and resolving those weird issues.
> >
> > Now I'm not saying which one is superior in this case, as the data
> > will end up speaking for itself one way or the other; bus as of right
> > now, I'm not convinced yet that the old dictionary lookup is obsolete
> > yet, and I'm not sure the community is convinced yet either.
> >
> >
> >
> > [image: IMAT Solutions] 
> >
> > *Kim Ebert*
> > Software Engineer
> > [image: Office:]801.669.7342
> > kim.eb...@imatsolutions.com 
> >
> > On 12/19/2014 08:39 AM, Chen, Pei wrote:
> >
> > Also check out stats that Sean ran before releasing the new component on:
> >
> >
> > http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-
> > fast/doc/DictionaryLookupStats.docx
> >
> > From the evaluation and experience, the new lookup algorithm should be
> > a huge improvement in terms of both speed and accuracy.
> >
> > This is very different than what Bruce mentioned…  I’m sure Sean will
> > chime here.
> >
> > (The old dictionary lookup is essentially obsolete now- plagued with
> > bugs/issues as you mentioned.)
> >
> > --Pei
> >
> >
> >
> > *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
> > ]
> > *Sent:* Friday, December 19, 2014 10:25 AM
> > *To:* dev@ctakes.apache.org
> > *Subject:* Re: cTakes Annotation Comparison
> >
> >
> >
> > Guergana,
> >
> > I'm curious to the number of records that are in your gold standard
> > sets, or if your gold standard set was run through a long running cTAKES
> process.
> > I know at some point we fixed a bug in the old dictionary lookup that
> > caused the permutations to become corrupted over time. Typically this
> > isn't seen in the first few records, but over time as patterns are
> > used the permutations would become corrupted. This caused documents
> > that were fed through cTAKES more than once to have less codes
> > returned than the first time.
> >
> > For example, if a permutation of 4,2,3,1 was found, the permutation
> > would be 

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Hi Bruce,
> Correction -- So far, I did steps 1 and 2 of Sean's email.

No problem.  Aside from recreating the database, those two steps have the 
greatest impact.  But before you change anything else, please do some manual 
spot checks.  I have never seen a case where the lookup would be so horribly 
inaccurate.

Thanks

-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 3:29 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Correction -- So far, I did steps 1 and 2 of Sean's email.


 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:22 PM, Bruce Tietjen < 
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> Sean,
>
> I tried the configuration changes you mentioned in your earlier email.
>
> The results are as follows:
>
> Total Annotations found: 12,161 (default configuration found 8,284)
>
> If counting exact span matches, this run only matched 211 (default 
> configuration matched 215).
>
> If counting overlapping spans, this run only matched 220 (default 
> configuration matched 224)
>
> Bruce
>
>
>
>  [image: IMAT Solutions]   Bruce Tietjen 
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei < 
> pei.c...@childrens.harvard.edu> wrote:
>>
>>  Kim,
>>
>> Maintenance is the factor not bugs/issue to forge ahead.
>>
>> They are 2 components that do the same thing with the same goal (As 
>> Sean mentioned, one should be able configure the new code base to  
>> replicate the old algorithm if required- it’s just a simpler and 
>> cleaner code base.  If this is not the case or if there are issues, 
>> we should fix it and move forward.).
>>
>> We can keep the old component around for as long as needed, but it’s 
>> likely going to have limited support…
>>
>> --Pei
>>
>>
>>
>> *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
>> *Sent:* Friday, December 19, 2014 1:47 PM
>> *To:* Chen, Pei; dev@ctakes.apache.org
>>
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> Pei,
>>
>> I don't think bugs/issues should be part of determining if one 
>> algorithm vs the other is superior. Obviously, it is worth mentioning 
>> the bugs, but if the fast lookup method has worse precision and 
>> recall but better performance, vs the slower but more accurate first 
>> word lookup algorithm, then time should be invested in fixing those 
>> bugs and resolving those weird issues.
>>
>> Now I'm not saying which one is superior in this case, as the data 
>> will end up speaking for itself one way or the other; bus as of right 
>> now, I'm not convinced yet that the old dictionary lookup is obsolete 
>> yet, and I'm not sure the community is convinced yet either.
>>
>>
>>
>> [image: IMAT Solutions] 
>>
>> *Kim Ebert*
>> Software Engineer
>> [image: Office:]801.669.7342
>> kim.eb...@imatsolutions.com 
>>
>> On 12/19/2014 08:39 AM, Chen, Pei wrote:
>>
>> Also check out stats that Sean ran before releasing the new component on:
>>
>>
>> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup
>> -fast/doc/DictionaryLookupStats.docx
>>
>> From the evaluation and experience, the new lookup algorithm should 
>> be a huge improvement in terms of both speed and accuracy.
>>
>> This is very different than what Bruce mentioned…  I’m sure Sean will 
>> chime here.
>>
>> (The old dictionary lookup is essentially obsolete now- plagued with 
>> bugs/issues as you mentioned.)
>>
>> --Pei
>>
>>
>>
>> *From:* Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com
>> ]
>> *Sent:* Friday, December 19, 2014 10:25 AM
>> *To:* dev@ctakes.apache.org
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> Guergana,
>>
>> I'm curious to the number of records that are in your gold standard 
>> sets, or if your gold standard set was run through a long running cTAKES 
>> process.
>> I know at some point we fixed a bug in the old dictionary lookup that 
>> caused the permutations to become corrupted over time. Typically this 
>> isn't seen in the first few records, but over time as patterns are 
>> used the permutations would become corrupted. This caused documents 
>> that were fed through cTAKES more than once to have less codes 
>> returned than the first time.
>>
>> For example, if a permutation of 4,2,3,1 was found, the permutation 
>> would be corrupted to be 1,2,3,4. It would no longer be possible to 
>> detect permutations of 4,2,3,1 until cTAKES was restarted. We got the 
>> fix in after the cTAKES 3.2.0 release.
>> https://issues.apache.org/jira/browse/CTAKES-310 Depending upon the 
>> corpus size, I could see the permutation engine eventually only have 
>> a single permutation of 1,2,3,4.
>>
>> Typically though, this isn't very easily detected in the first 100 or 
>> 

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Sorry, I meant “Do some spot checks on the validity”.  In other words, when 
your script reports that a cui and/or span is missing, manually look at the 
data and see if it really is.  Just open up one .xmi in the CVD and see what it 
looks like.

Thanks,
Sean

From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 3:37 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

My original results were using a newly downloaded cTakes 3.2.1 with the 
separately downloaded resources copied in. There were no changes to any of the 
configuration files.
As far as this last run, I modified the UMLSLookupAnnotator.xml and 
AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified ones I 
used (but they may not get through the mailing list).



[Image removed by sender. IMAT Solutions]
Bruce Tietjen
Senior Software Engineer
[Image removed by sender. Mobile:]801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean 
mailto:sean.fi...@childrens.harvard.edu>> 
wrote:
Hi Bruce,

I'm not sure how there would be fewer matches with the overlap processor.  
There should be all of the matches from the non-overlap processor plus those 
from the overlap.  Decreasing from 215 to 211 is strange.  Have you done any 
manual spot checks on this?  It is really bizarre that you'd only have two 
matches per document (100 docs?).

Thanks,
Sean

-Original Message-
From: Bruce Tietjen 
[mailto:bruce.tiet...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 3:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

Sean,

I tried the configuration changes you mentioned in your earlier email.

The results are as follows:

Total Annotations found: 12,161 (default configuration found 8,284)

If counting exact span matches, this run only matched 211 (default 
configuration matched 215).

If counting overlapping spans, this run only matched 220 (default configuration 
matched 224)

Bruce



 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei 
mailto:pei.c...@childrens.harvard.edu>>
wrote:
>
>  Kim,
>
> Maintenance is the factor not bugs/issue to forge ahead.
>
> They are 2 components that do the same thing with the same goal (As
> Sean mentioned, one should be able configure the new code base to
> replicate the old algorithm if required- it’s just a simpler and
> cleaner code base.  If this is not the case or if there are issues, we
> should fix it and move forward.).
>
> We can keep the old component around for as long as needed, but it’s
> likely going to have limited support…
>
> --Pei
>
>
>
> *From:* Kim Ebert 
> [mailto:kim.eb...@imatsolutions.com]
> *Sent:* Friday, December 19, 2014 1:47 PM
> *To:* Chen, Pei; dev@ctakes.apache.org
>
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> Pei,
>
> I don't think bugs/issues should be part of determining if one
> algorithm vs the other is superior. Obviously, it is worth mentioning
> the bugs, but if the fast lookup method has worse precision and recall
> but better performance, vs the slower but more accurate first word
> lookup algorithm, then time should be invested in fixing those bugs
> and resolving those weird issues.
>
> Now I'm not saying which one is superior in this case, as the data
> will end up speaking for itself one way or the other; bus as of right
> now, I'm not convinced yet that the old dictionary lookup is obsolete
> yet, and I'm not sure the community is convinced yet either.
>
>
>
> [image: IMAT Solutions] 
>
> *Kim Ebert*
> Software Engineer
> [image: Office:]801.669.7342
> kim.eb...@imatsolutions.com 
> mailto:greg.hub...@imatsolutions.com>>
>
> On 12/19/2014 08:39 AM, Chen, Pei wrote:
>
> Also check out stats that Sean ran before releasing the new component on:
>
>
> http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-
> fast/doc/DictionaryLookupStats.docx
>
> From the evaluation and experience, the new lookup algorithm should be
> a huge improvement in terms of both speed and accuracy.
>
> This is very different than what Bruce mentioned…  I’m sure Sean will
> chime here.
>
> (The old dictionary lookup is essentially obsolete now- plagued with
> bugs/issues as you mentioned.)
>
> --Pei
>
>
>
> *From:* Kim Ebert 
> [mailto:kim.eb...@perfectsearchcorp.com
> mailto:kim.eb...@perfectsearchcorp.com>>]
> *Sent:* Friday, December 19, 2014 10:25 AM
> *To:* dev@ctakes.apache.org
> *Subject:* Re: cTakes Annotatio

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
I'll do that -- there is always a possibility of bugs in the analysis tool.



 [image: IMAT Solutions] 
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:
>
>  Sorry, I meant “Do some spot checks on the validity”.  In other words,
> when your script reports that a cui and/or span is missing, manually look
> at the data and see if it really is.  Just open up one .xmi in the CVD and
> see what it looks like.
>
>
>
> Thanks,
>
> Sean
>
>
>
> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> *Sent:* Friday, December 19, 2014 3:37 PM
> *To:* dev@ctakes.apache.org
> *Subject:* Re: cTakes Annotation Comparison
>
>
>
> My original results were using a newly downloaded cTakes 3.2.1 with the
> separately downloaded resources copied in. There were no changes to any of
> the configuration files.
>
> As far as this last run, I modified the UMLSLookupAnnotator.xml and
> AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified ones I
> used (but they may not get through the mailing list).
>
>
>
>
>
>
> [image: Image removed by sender. IMAT Solutions]
> 
>
> *Bruce Tietjen*
> Senior Software Engineer
> [image: Image removed by sender. Mobile:]801.634.1547
> bruce.tiet...@imatsolutions.com
>
>
>
> On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> Hi Bruce,
>
> I'm not sure how there would be fewer matches with the overlap processor.
> There should be all of the matches from the non-overlap processor plus
> those from the overlap.  Decreasing from 215 to 211 is strange.  Have you
> done any manual spot checks on this?  It is really bizarre that you'd only
> have two matches per document (100 docs?).
>
> Thanks,
> Sean
>
> -Original Message-
> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> Sent: Friday, December 19, 2014 3:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes Annotation Comparison
>
> Sean,
>
> I tried the configuration changes you mentioned in your earlier email.
>
> The results are as follows:
>
> Total Annotations found: 12,161 (default configuration found 8,284)
>
> If counting exact span matches, this run only matched 211 (default
> configuration matched 215).
>
> If counting overlapping spans, this run only matched 220 (default
> configuration matched 224)
>
> Bruce
>
>
>
>  [image: IMAT Solutions]   Bruce Tietjen Senior
> Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei <
> pei.c...@childrens.harvard.edu>
> wrote:
> >
> >  Kim,
> >
> > Maintenance is the factor not bugs/issue to forge ahead.
> >
> > They are 2 components that do the same thing with the same goal (As
> > Sean mentioned, one should be able configure the new code base to
> > replicate the old algorithm if required- it’s just a simpler and
> > cleaner code base.  If this is not the case or if there are issues, we
> > should fix it and move forward.).
> >
> > We can keep the old component around for as long as needed, but it’s
> > likely going to have limited support…
> >
> > --Pei
> >
> >
> >
> > *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
> > *Sent:* Friday, December 19, 2014 1:47 PM
> > *To:* Chen, Pei; dev@ctakes.apache.org
> >
> > *Subject:* Re: cTakes Annotation Comparison
> >
> >
> >
> > Pei,
> >
> > I don't think bugs/issues should be part of determining if one
> > algorithm vs the other is superior. Obviously, it is worth mentioning
> > the bugs, but if the fast lookup method has worse precision and recall
> > but better performance, vs the slower but more accurate first word
> > lookup algorithm, then time should be invested in fixing those bugs
> > and resolving those weird issues.
> >
> > Now I'm not saying which one is superior in this case, as the data
> > will end up speaking for itself one way or the other; bus as of right
> > now, I'm not convinced yet that the old dictionary lookup is obsolete
> > yet, and I'm not sure the community is convinced yet either.
> >
> >
> >
> > [image: IMAT Solutions] 
> >
> > *Kim Ebert*
> > Software Engineer
> > [image: Office:]801.669.7342
> > kim.eb...@imatsolutions.com 
> >
> > On 12/19/2014 08:39 AM, Chen, Pei wrote:
> >
> > Also check out stats that Sean ran before releasing the new component on:
> >
> >
> > http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-
> > fast/doc/DictionaryLookupStats.docx
> >
> > From the evaluation and experience, the new lookup algorithm should be
> > a huge improvement in terms of both speed and accuracy.
> >
> > This is very different than what Bruce mentioned…  I’m sure Sean will
> > chime here.
> >
> > (The old dictionary lookup is essentially obsolete now- plagued with
> > bugs/issues as you mention

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
My apologies to Sean and everyone,

I am happy to report that I found a bug in our analysis tools that was
missing the last FSArray entry for any FSArray list.

With the bug fixed, the results look MUCH better.

UMLSProcessor found 31,598 annotations
FastUMLSProcessor found 30,716 annotations

There were 23,522 annotations that were exact matches between the two.

When comparing with the gold standard annotations (4591 annotations):

UMLSProcessor found 2632 matches (2,735 including overlaps)
FastUMLSProcessor found 2795 matches (2,842 including overlaps)






 [image: IMAT Solutions] 
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:49 PM, Bruce Tietjen <
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> I'll do that -- there is always a possibility of bugs in the analysis
> tool.
>
>
>  [image: IMAT Solutions] 
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>>
>>  Sorry, I meant “Do some spot checks on the validity”.  In other words,
>> when your script reports that a cui and/or span is missing, manually look
>> at the data and see if it really is.  Just open up one .xmi in the CVD and
>> see what it looks like.
>>
>>
>>
>> Thanks,
>>
>> Sean
>>
>>
>>
>> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>> *Sent:* Friday, December 19, 2014 3:37 PM
>> *To:* dev@ctakes.apache.org
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> My original results were using a newly downloaded cTakes 3.2.1 with the
>> separately downloaded resources copied in. There were no changes to any of
>> the configuration files.
>>
>> As far as this last run, I modified the UMLSLookupAnnotator.xml and
>> AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified ones I
>> used (but they may not get through the mailing list).
>>
>>
>>
>>
>>
>>
>> [image: Image removed by sender. IMAT Solutions]
>> 
>>
>> *Bruce Tietjen*
>> Senior Software Engineer
>> [image: Image removed by sender. Mobile:]801.634.1547
>> bruce.tiet...@imatsolutions.com
>>
>>
>>
>> On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean <
>> sean.fi...@childrens.harvard.edu> wrote:
>>
>> Hi Bruce,
>>
>> I'm not sure how there would be fewer matches with the overlap
>> processor.  There should be all of the matches from the non-overlap
>> processor plus those from the overlap.  Decreasing from 215 to 211 is
>> strange.  Have you done any manual spot checks on this?  It is really
>> bizarre that you'd only have two matches per document (100 docs?).
>>
>> Thanks,
>> Sean
>>
>> -Original Message-
>> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>> Sent: Friday, December 19, 2014 3:23 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: cTakes Annotation Comparison
>>
>> Sean,
>>
>> I tried the configuration changes you mentioned in your earlier email.
>>
>> The results are as follows:
>>
>> Total Annotations found: 12,161 (default configuration found 8,284)
>>
>> If counting exact span matches, this run only matched 211 (default
>> configuration matched 215).
>>
>> If counting overlapping spans, this run only matched 220 (default
>> configuration matched 224)
>>
>> Bruce
>>
>>
>>
>>  [image: IMAT Solutions]   Bruce Tietjen
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
>> bruce.tiet...@imatsolutions.com
>>
>> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei <
>> pei.c...@childrens.harvard.edu>
>> wrote:
>> >
>> >  Kim,
>> >
>> > Maintenance is the factor not bugs/issue to forge ahead.
>> >
>> > They are 2 components that do the same thing with the same goal (As
>> > Sean mentioned, one should be able configure the new code base to
>> > replicate the old algorithm if required- it’s just a simpler and
>> > cleaner code base.  If this is not the case or if there are issues, we
>> > should fix it and move forward.).
>> >
>> > We can keep the old component around for as long as needed, but it’s
>> > likely going to have limited support…
>> >
>> > --Pei
>> >
>> >
>> >
>> > *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
>> > *Sent:* Friday, December 19, 2014 1:47 PM
>> > *To:* Chen, Pei; dev@ctakes.apache.org
>> >
>> > *Subject:* Re: cTakes Annotation Comparison
>> >
>> >
>> >
>> > Pei,
>> >
>> > I don't think bugs/issues should be part of determining if one
>> > algorithm vs the other is superior. Obviously, it is worth mentioning
>> > the bugs, but if the fast lookup method has worse precision and recall
>> > but better performance, vs the slower but more accurate first word
>> > lookup algorithm, then time should be invested in fixing those bugs
>> > and resolving those weird issues.
>> >
>> > Now I'm not saying which one is superior in this case, as the data
>> > will end up 

Re: cTakes Annotation Comparison

2014-12-19 Thread Pei Chen
ah! Excellent news... that's much more inline with our experience and
evaluation results.

On Fri, Dec 19, 2014 at 5:04 PM, Bruce Tietjen <
bruce.tiet...@perfectsearchcorp.com> wrote:

> My apologies to Sean and everyone,
>
> I am happy to report that I found a bug in our analysis tools that was
> missing the last FSArray entry for any FSArray list.
>
> With the bug fixed, the results look MUCH better.
>
> UMLSProcessor found 31,598 annotations
> FastUMLSProcessor found 30,716 annotations
>
> There were 23,522 annotations that were exact matches between the two.
>
> When comparing with the gold standard annotations (4591 annotations):
>
> UMLSProcessor found 2632 matches (2,735 including overlaps)
> FastUMLSProcessor found 2795 matches (2,842 including overlaps)
>
>
>
>
>
>
>  [image: IMAT Solutions] 
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 1:49 PM, Bruce Tietjen <
> bruce.tiet...@perfectsearchcorp.com> wrote:
> >
> > I'll do that -- there is always a possibility of bugs in the analysis
> > tool.
> >
> >
> >  [image: IMAT Solutions] 
> >  Bruce Tietjen
> > Senior Software Engineer
> > [image: Mobile:] 801.634.1547
> > bruce.tiet...@imatsolutions.com
> >
> > On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean <
> > sean.fi...@childrens.harvard.edu> wrote:
> >>
> >>  Sorry, I meant “Do some spot checks on the validity”.  In other words,
> >> when your script reports that a cui and/or span is missing, manually
> look
> >> at the data and see if it really is.  Just open up one .xmi in the CVD
> and
> >> see what it looks like.
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Sean
> >>
> >>
> >>
> >> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> >> *Sent:* Friday, December 19, 2014 3:37 PM
> >> *To:* dev@ctakes.apache.org
> >> *Subject:* Re: cTakes Annotation Comparison
> >>
> >>
> >>
> >> My original results were using a newly downloaded cTakes 3.2.1 with the
> >> separately downloaded resources copied in. There were no changes to any
> of
> >> the configuration files.
> >>
> >> As far as this last run, I modified the UMLSLookupAnnotator.xml and
> >> AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified
> ones I
> >> used (but they may not get through the mailing list).
> >>
> >>
> >>
> >>
> >>
> >>
> >> [image: Image removed by sender. IMAT Solutions]
> >> 
> >>
> >> *Bruce Tietjen*
> >> Senior Software Engineer
> >> [image: Image removed by sender. Mobile:]801.634.1547
> >> bruce.tiet...@imatsolutions.com
> >>
> >>
> >>
> >> On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean <
> >> sean.fi...@childrens.harvard.edu> wrote:
> >>
> >> Hi Bruce,
> >>
> >> I'm not sure how there would be fewer matches with the overlap
> >> processor.  There should be all of the matches from the non-overlap
> >> processor plus those from the overlap.  Decreasing from 215 to 211 is
> >> strange.  Have you done any manual spot checks on this?  It is really
> >> bizarre that you'd only have two matches per document (100 docs?).
> >>
> >> Thanks,
> >> Sean
> >>
> >> -Original Message-
> >> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
> >> Sent: Friday, December 19, 2014 3:23 PM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: cTakes Annotation Comparison
> >>
> >> Sean,
> >>
> >> I tried the configuration changes you mentioned in your earlier email.
> >>
> >> The results are as follows:
> >>
> >> Total Annotations found: 12,161 (default configuration found 8,284)
> >>
> >> If counting exact span matches, this run only matched 211 (default
> >> configuration matched 215).
> >>
> >> If counting overlapping spans, this run only matched 220 (default
> >> configuration matched 224)
> >>
> >> Bruce
> >>
> >>
> >>
> >>  [image: IMAT Solutions]   Bruce Tietjen
> >> Senior Software Engineer
> >> [image: Mobile:] 801.634.1547
> >> bruce.tiet...@imatsolutions.com
> >>
> >> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei <
> >> pei.c...@childrens.harvard.edu>
> >> wrote:
> >> >
> >> >  Kim,
> >> >
> >> > Maintenance is the factor not bugs/issue to forge ahead.
> >> >
> >> > They are 2 components that do the same thing with the same goal (As
> >> > Sean mentioned, one should be able configure the new code base to
> >> > replicate the old algorithm if required- it’s just a simpler and
> >> > cleaner code base.  If this is not the case or if there are issues, we
> >> > should fix it and move forward.).
> >> >
> >> > We can keep the old component around for as long as needed, but it’s
> >> > likely going to have limited support…
> >> >
> >> > --Pei
> >> >
> >> >
> >> >
> >> > *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
> >> > *Sent:* Friday, December 19, 2014 1:47 PM
> >> > *To:* Chen, Pei; dev@ctakes.apache.org
> >> >
> >> > *Subject:* Re: cTakes Annotation Comparison
> >> >
> >> >
> >> >
> >> > Pei,
> 

RE: cTakes Annotation Comparison --- (^:

2014-12-19 Thread Finan, Sean
Apologies accepted.  I'm really glad that you found the problem.

So what you are saying is (just to be very very clear to everybody reading this 
thread):

>FastUMLSProcessor found 2795 matches (2,842 including overlaps)
While
> UMLSProcessor found 2632 matches (2,735 including overlaps)

--- So recall is BETTER in the fast lookup

And...
>FastUMLSProcessor found 30,716 annotations
While
>UMLSProcessor found 31,598 annotations

--- So precision is also looking BETTER in the fast lookup

Now maybe there will be a little more buy-in for the fast lookup.

Cheers,
Sean


-Original Message-
From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] 
Sent: Friday, December 19, 2014 5:05 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison

My apologies to Sean and everyone,

I am happy to report that I found a bug in our analysis tools that was missing 
the last FSArray entry for any FSArray list.

With the bug fixed, the results look MUCH better.

UMLSProcessor found 31,598 annotations
FastUMLSProcessor found 30,716 annotations

There were 23,522 annotations that were exact matches between the two.

When comparing with the gold standard annotations (4591 annotations):

UMLSProcessor found 2632 matches (2,735 including overlaps) FastUMLSProcessor 
found 2795 matches (2,842 including overlaps)






 [image: IMAT Solutions]   Bruce Tietjen Senior 
Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 1:49 PM, Bruce Tietjen < 
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> I'll do that -- there is always a possibility of bugs in the analysis 
> tool.
>
>
>  [image: IMAT Solutions]   Bruce Tietjen 
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean < 
> sean.fi...@childrens.harvard.edu> wrote:
>>
>>  Sorry, I meant “Do some spot checks on the validity”.  In other 
>> words, when your script reports that a cui and/or span is missing, 
>> manually look at the data and see if it really is.  Just open up one 
>> .xmi in the CVD and see what it looks like.
>>
>>
>>
>> Thanks,
>>
>> Sean
>>
>>
>>
>> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>> *Sent:* Friday, December 19, 2014 3:37 PM
>> *To:* dev@ctakes.apache.org
>> *Subject:* Re: cTakes Annotation Comparison
>>
>>
>>
>> My original results were using a newly downloaded cTakes 3.2.1 with 
>> the separately downloaded resources copied in. There were no changes 
>> to any of the configuration files.
>>
>> As far as this last run, I modified the UMLSLookupAnnotator.xml and 
>> AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified 
>> ones I used (but they may not get through the mailing list).
>>
>>
>>
>>
>>
>>
>> [image: Image removed by sender. IMAT Solutions] 
>> 
>>
>> *Bruce Tietjen*
>> Senior Software Engineer
>> [image: Image removed by sender. Mobile:]801.634.1547 
>> bruce.tiet...@imatsolutions.com
>>
>>
>>
>> On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean < 
>> sean.fi...@childrens.harvard.edu> wrote:
>>
>> Hi Bruce,
>>
>> I'm not sure how there would be fewer matches with the overlap 
>> processor.  There should be all of the matches from the non-overlap 
>> processor plus those from the overlap.  Decreasing from 215 to 211 is 
>> strange.  Have you done any manual spot checks on this?  It is really 
>> bizarre that you'd only have two matches per document (100 docs?).
>>
>> Thanks,
>> Sean
>>
>> -Original Message-
>> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>> Sent: Friday, December 19, 2014 3:23 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: cTakes Annotation Comparison
>>
>> Sean,
>>
>> I tried the configuration changes you mentioned in your earlier email.
>>
>> The results are as follows:
>>
>> Total Annotations found: 12,161 (default configuration found 8,284)
>>
>> If counting exact span matches, this run only matched 211 (default 
>> configuration matched 215).
>>
>> If counting overlapping spans, this run only matched 220 (default 
>> configuration matched 224)
>>
>> Bruce
>>
>>
>>
>>  [image: IMAT Solutions]   Bruce Tietjen 
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
>> bruce.tiet...@imatsolutions.com
>>
>> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei < 
>> pei.c...@childrens.harvard.edu>
>> wrote:
>> >
>> >  Kim,
>> >
>> > Maintenance is the factor not bugs/issue to forge ahead.
>> >
>> > They are 2 components that do the same thing with the same goal (As 
>> > Sean mentioned, one should be able configure the new code base to 
>> > replicate the old algorithm if required- it’s just a simpler and 
>> > cleaner code base.  If this is not the case or if there are issues, 
>> > we should fix it and move forward.).
>> >
>> > We can keep the old component around for as long as needed, but 
>> > i

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Bruce,

I think we all feel a lot better now. I think the tool will be helpful
moving forward.

I've updated the git repo with the fix in case anyone is interested.

IMAT Solutions 
Kim Ebert
Software Engineer
Office: 801.669.7342
kim.eb...@imatsolutions.com 
On 12/19/2014 03:04 PM, Bruce Tietjen wrote:
> My apologies to Sean and everyone,
>
> I am happy to report that I found a bug in our analysis tools that was
> missing the last FSArray entry for any FSArray list.
>
> With the bug fixed, the results look MUCH better.
>
> UMLSProcessor found 31,598 annotations
> FastUMLSProcessor found 30,716 annotations
>
> There were 23,522 annotations that were exact matches between the two.
>
> When comparing with the gold standard annotations (4591 annotations):
>
> UMLSProcessor found 2632 matches (2,735 including overlaps)
> FastUMLSProcessor found 2795 matches (2,842 including overlaps)
>
>
>
>
>
>
>  [image: IMAT Solutions] 
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 1:49 PM, Bruce Tietjen <
> bruce.tiet...@perfectsearchcorp.com> wrote:
>> I'll do that -- there is always a possibility of bugs in the analysis
>> tool.
>>
>>
>>  [image: IMAT Solutions] 
>>  Bruce Tietjen
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
>> bruce.tiet...@imatsolutions.com
>>
>> On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean <
>> sean.fi...@childrens.harvard.edu> wrote:
>>>  Sorry, I meant “Do some spot checks on the validity”.  In other words,
>>> when your script reports that a cui and/or span is missing, manually look
>>> at the data and see if it really is.  Just open up one .xmi in the CVD and
>>> see what it looks like.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Sean
>>>
>>>
>>>
>>> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>>> *Sent:* Friday, December 19, 2014 3:37 PM
>>> *To:* dev@ctakes.apache.org
>>> *Subject:* Re: cTakes Annotation Comparison
>>>
>>>
>>>
>>> My original results were using a newly downloaded cTakes 3.2.1 with the
>>> separately downloaded resources copied in. There were no changes to any of
>>> the configuration files.
>>>
>>> As far as this last run, I modified the UMLSLookupAnnotator.xml and
>>> AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified ones I
>>> used (but they may not get through the mailing list).
>>>
>>>
>>>
>>>
>>>
>>>
>>> [image: Image removed by sender. IMAT Solutions]
>>> 
>>>
>>> *Bruce Tietjen*
>>> Senior Software Engineer
>>> [image: Image removed by sender. Mobile:]801.634.1547
>>> bruce.tiet...@imatsolutions.com
>>>
>>>
>>>
>>> On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean <
>>> sean.fi...@childrens.harvard.edu> wrote:
>>>
>>> Hi Bruce,
>>>
>>> I'm not sure how there would be fewer matches with the overlap
>>> processor.  There should be all of the matches from the non-overlap
>>> processor plus those from the overlap.  Decreasing from 215 to 211 is
>>> strange.  Have you done any manual spot checks on this?  It is really
>>> bizarre that you'd only have two matches per document (100 docs?).
>>>
>>> Thanks,
>>> Sean
>>>
>>> -Original Message-
>>> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>>> Sent: Friday, December 19, 2014 3:23 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes Annotation Comparison
>>>
>>> Sean,
>>>
>>> I tried the configuration changes you mentioned in your earlier email.
>>>
>>> The results are as follows:
>>>
>>> Total Annotations found: 12,161 (default configuration found 8,284)
>>>
>>> If counting exact span matches, this run only matched 211 (default
>>> configuration matched 215).
>>>
>>> If counting overlapping spans, this run only matched 220 (default
>>> configuration matched 224)
>>>
>>> Bruce
>>>
>>>
>>>
>>>  [image: IMAT Solutions]   Bruce Tietjen
>>> Senior Software Engineer
>>> [image: Mobile:] 801.634.1547
>>> bruce.tiet...@imatsolutions.com
>>>
>>> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei <
>>> pei.c...@childrens.harvard.edu>
>>> wrote:
  Kim,

 Maintenance is the factor not bugs/issue to forge ahead.

 They are 2 components that do the same thing with the same goal (As
 Sean mentioned, one should be able configure the new code base to
 replicate the old algorithm if required- it’s just a simpler and
 cleaner code base.  If this is not the case or if there are issues, we
 should fix it and move forward.).

 We can keep the old component around for as long as needed, but it’s
 likely going to have limited support…

 --Pei



 *From:* Kim Ebert [mailto:kim.eb...@imatsolutions.com]
 *Sent:* Friday, December 19, 2014 1:47 PM
 *To:* Chen, Pei; dev@ctakes.apache.org

 *Subject:* Re: cTakes Annotation Comparison



 

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
When I only include SignSymptomMention and DiseaseDisorderMention in the
analysis (which excludes annotations not included in the gold standard),
the matched annotations remain the same while the total annotations found
in those categories drop to the following:

Total Annotations found:
FastUMLSProcessing: 12,811
UMLSProcessing:46,571


 [image: IMAT Solutions] 
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com

On Fri, Dec 19, 2014 at 3:04 PM, Bruce Tietjen <
bruce.tiet...@perfectsearchcorp.com> wrote:
>
> My apologies to Sean and everyone,
>
> I am happy to report that I found a bug in our analysis tools that was
> missing the last FSArray entry for any FSArray list.
>
> With the bug fixed, the results look MUCH better.
>
> UMLSProcessor found 31,598 annotations
> FastUMLSProcessor found 30,716 annotations
>
> There were 23,522 annotations that were exact matches between the two.
>
> When comparing with the gold standard annotations (4591 annotations):
>
> UMLSProcessor found 2632 matches (2,735 including overlaps)
> FastUMLSProcessor found 2795 matches (2,842 including overlaps)
>
>
>
>
>
>
>  [image: IMAT Solutions] 
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tiet...@imatsolutions.com
>
> On Fri, Dec 19, 2014 at 1:49 PM, Bruce Tietjen <
> bruce.tiet...@perfectsearchcorp.com> wrote:
>>
>> I'll do that -- there is always a possibility of bugs in the analysis
>> tool.
>>
>>
>>  [image: IMAT Solutions] 
>>  Bruce Tietjen
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
>> bruce.tiet...@imatsolutions.com
>>
>> On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean <
>> sean.fi...@childrens.harvard.edu> wrote:
>>>
>>>  Sorry, I meant “Do some spot checks on the validity”.  In other words,
>>> when your script reports that a cui and/or span is missing, manually look
>>> at the data and see if it really is.  Just open up one .xmi in the CVD and
>>> see what it looks like.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Sean
>>>
>>>
>>>
>>> *From:* Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>>> *Sent:* Friday, December 19, 2014 3:37 PM
>>> *To:* dev@ctakes.apache.org
>>> *Subject:* Re: cTakes Annotation Comparison
>>>
>>>
>>>
>>> My original results were using a newly downloaded cTakes 3.2.1 with the
>>> separately downloaded resources copied in. There were no changes to any of
>>> the configuration files.
>>>
>>> As far as this last run, I modified the UMLSLookupAnnotator.xml and
>>> AggregatePlaintextFastUMLSProcessor.xml.  I've attached the modified ones I
>>> used (but they may not get through the mailing list).
>>>
>>>
>>>
>>>
>>>
>>>
>>> [image: Image removed by sender. IMAT Solutions]
>>> 
>>>
>>> *Bruce Tietjen*
>>> Senior Software Engineer
>>> [image: Image removed by sender. Mobile:]801.634.1547
>>> bruce.tiet...@imatsolutions.com
>>>
>>>
>>>
>>> On Fri, Dec 19, 2014 at 1:27 PM, Finan, Sean <
>>> sean.fi...@childrens.harvard.edu> wrote:
>>>
>>> Hi Bruce,
>>>
>>> I'm not sure how there would be fewer matches with the overlap
>>> processor.  There should be all of the matches from the non-overlap
>>> processor plus those from the overlap.  Decreasing from 215 to 211 is
>>> strange.  Have you done any manual spot checks on this?  It is really
>>> bizarre that you'd only have two matches per document (100 docs?).
>>>
>>> Thanks,
>>> Sean
>>>
>>> -Original Message-
>>> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com]
>>> Sent: Friday, December 19, 2014 3:23 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes Annotation Comparison
>>>
>>> Sean,
>>>
>>> I tried the configuration changes you mentioned in your earlier email.
>>>
>>> The results are as follows:
>>>
>>> Total Annotations found: 12,161 (default configuration found 8,284)
>>>
>>> If counting exact span matches, this run only matched 211 (default
>>> configuration matched 215).
>>>
>>> If counting overlapping spans, this run only matched 220 (default
>>> configuration matched 224)
>>>
>>> Bruce
>>>
>>>
>>>
>>>  [image: IMAT Solutions]   Bruce Tietjen
>>> Senior Software Engineer
>>> [image: Mobile:] 801.634.1547
>>> bruce.tiet...@imatsolutions.com
>>>
>>> On Fri, Dec 19, 2014 at 12:16 PM, Chen, Pei <
>>> pei.c...@childrens.harvard.edu>
>>> wrote:
>>> >
>>> >  Kim,
>>> >
>>> > Maintenance is the factor not bugs/issue to forge ahead.
>>> >
>>> > They are 2 components that do the same thing with the same goal (As
>>> > Sean mentioned, one should be able configure the new code base to
>>> > replicate the old algorithm if required- it’s just a simpler and
>>> > cleaner code base.  If this is not the case or if there are issues, we
>>> > should fix it and move forward.).
>>> >
>>> > We can keep the old component around for as long as needed, but it’s
>>> > likely going to have limited support…
>>