Re: cTakes output predictability

Kim Ebert Tue, 07 Oct 2014 09:04:47 -0700

I think it would be helpful actually, as digging deeper into the issue
has highlighted to me a few places in the code that actually cause
inconsistent results to be returned when running the same document
through multiple times. I think having the code base be predictable will
make it easier to debug.


Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:58 AM, Masanz, James J. wrote:
> FWIW, I agree with Sean that comparing should be a post-processing step and 
> trying to get UIMA internal IDs to match on subsequent runs is not worth 
> opening the code for.
>
> -----Original Message-----
> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
> Sent: Tuesday, October 07, 2014 10:56 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> I think we may really prefer the first method. Since it doesn't appear
> that there are any consequences with moving forward with changing the
> code, we would really like to move forward with this approach.
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 09:35 AM, britt fitch wrote:
>> The option Sean mentioned of writing your own custom consumer (without
>> the UIMA id that is causing your issues) should meet these needs I
>> believe. 
>>
>>                       
>>
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com
>> britt.fi...@wiredinformatics.com
>>
>> On Oct 7, 2014, at 11:29 AM, Kim Ebert
>> <kim.eb...@perfectsearchcorp.com
>> <mailto:kim.eb...@perfectsearchcorp.com>> wrote:
>>
>>> Hi Sean,
>>>
>>> Well of course that makes plenty of sense. Testing different cTakes
>>> configurations you would expect different output. In our testing we've
>>> found several cases where running with the same configuration outputs
>>> different data under different moons. Having consistent results helps us
>>> know if we've made improvements to our quality or not. Having output
>>> that is in a predictable order makes checking to see if there are
>>> differences much cheaper when you are dealing with larger data sets.
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>> Hi Kim,
>>>>
>>>> One might want compare the Sentence detector that uses end of line
>>>> characters as sentence splitters with one that does not.  Such a
>>>> change in sentence splitting would not only effect the sentence type
>>>> discoveries but also practically every type that follows.
>>>>
>>>> Another might want to compare a note with "skin cancer" vs. one in
>>>> which you replace "skin cancer" with "melanoma" just to see what the
>>>> CUI differences might be.  There are changes in two words vs. one,
>>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>>> in CUIs.
>>>>
>>>> Of course, if you are just running notes on a new moon and then
>>>> again on a full moon ...
>>>>
>>>> Sean
>>>>
>>>> -----Original Message-----
>>>> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> Sean,
>>>>
>>>> "...being different because of a possibly intentional difference."
>>>>
>>>> I would like you to elaborate a bit on the what would be
>>>> intentionally different between the processing of the same document
>>>> multiple times. It would help my understanding of cTakes.
>>>>
>>>> Thanks,
>>>>
>>>> Kim Ebert
>>>> 1.801.669.7342
>>>> Perfect Search Corp
>>>> http://www.perfectsearchcorp.com/
>>>>
>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>> Steve Bethard wrote:
>>>>>> I spent some time writing a script for diff-ing CASes
>>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>>> this type of approach.  Comparison of program output is a
>>>>> post-process task, and unless absolutely necessary code to juggle
>>>>> data and metadata belongs there.  Attempts to force every module
>>>>> past, present and Future to abide by fixed orderings, enumerations
>>>>> etc. is not as simple a task as one might initially think -
>>>>> especially if third-party libraries are involved.  I won't get into
>>>>> problems associated with why one is comparing output (swapped
>>>>> module?) and IDs, orders etc. being different because of a possibly
>>>>> intentional difference.
>>>>>
>>>>> In addition to or instead of creating a post-processing script, one
>>>>> could write a new "cas-consumer" that writes output in a desired
>>>>> format - but this should not require changes to engines.
>>>>>
>>>>> "If it ain't broke, don't fix it"
>>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Steven Bethard [mailto:steven.beth...@gmail.com]
>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: cTakes output predictability
>>>>>
>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>>> <bruce.tiet...@perfectsearchcorp.com> wrote:
>>>>>> Since I started working with cTakes some time ago, I have found it
>>>>>> difficult to compare the output between subsequent runs on the same
>>>>>> files because annotations are often assigned different IDs, are
>>>>>> listed in different order, etc.
>>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>>> that intended to address some of these kinds of issues. It's still
>>>>> here in cTAKES:
>>>>>
>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>>>>> /CompareFeatureStructures.java
>>>>>
>>>>> You might see if you could use or adapt that to your needs.
>>>>>
>>>>> Steve
> .
>

Re: cTakes output predictability

Reply via email to