Re: cTakes output predictability

Kim Ebert Tue, 07 Oct 2014 10:09:38 -0700

Hi Sean,

Yes, I mean actual type values not matching.


Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 10:46 AM, Finan, Sean wrote:
> Hi Kim,
>
>> It concerns me a bit by making the code return consistent results would be 
>> so concerning. 
> Could you please clarify what you mean by "consistent results"?  Do you mean 
> ordering and IDs or are you talking about actual type values not matching?
>
>> This should be the default mode of operation.
> Depending upon what you meant above, I may agree or disagree.
>
>> Since it doesn't appear that there are any consequences with moving forward 
>> with changing the code
> Why do you say this?  
>
> I think that there may be more required changes than you realize.  Every 
> insertion into the CAS must be of ordered data.  This means that, for 
> instance, named entities discovered by dictionary will need to be inserted in 
> some predictable order, such as by alphabetized cui per every alphabetized 
> tui (and other code) per ordered text span.  You will need to check and 
> recheck every point at which the CAS is modified by every module.  Right now 
> there are at least three or four places in two cTakes dictionary modules 
> where a change would be required - and that doesn't include YTEX lookup.
>
> If you really feel strongly about this and are going to change cTakes code, 
> then I suggest (at the risk of sounding like a complete jerk) that you also 
> consider the following:
> 1.  Don't check anything into trunk until all is well with your changes and 
> tests
> Just in case you abandon the effort
> 2.  Write unit tests for every change   
> True, Map to LinkedMap shouldn't break anything, but they are good to have, 
> and may prevent others in the future from switching back to a non-linked map 
> or any unordered collection (set not list, etc.).  It also makes a better 
> place for explanation in Javadoc than inlines above the code.
> 3.  Run memory requirement tests before all of your changes and then again 
> after your changes
> I'm actually curious about how much memory might be eaten with linkages 
> everywhere
> 4.  Run performance (speed) tests before and after
> On a large corpus to ensure that garbage collection is involved
> 5.  Do the above with every combination possible in current workflows: every 
> combination of available sentence detector, pos tagger, smoking status 
> detector, dictionary lookup, cas consumer, etc.
> As soon as somebody says "all output is consistently ordered between runs" it 
> had better be so for every possible workflow
> 6.  Write system tests to ensure ordered/predicted outputs with each 
> combination
> Otherwise somebody may break it
> 7.  Document the what, how, and why for future development
> Otherwise somebody won't know to stick to the new rules
> 8.  Assist anybody as needed that in the future breaks one of these unit or 
> system tests with a fix or new feature
> By mandating such a rule you are assuming responsibility for it
> 9.  Assist anybody as needed that in the future adds a new module or workflow 
> to cTakes to abide by the ordering requirement
> By mandating such a rule you are assuming responsibility for it
> 10.  Assist anybody as needed that in the future adds a new module or 
> workflow to add system tests to ensure maintenance of the ordering requirement
> By mandating such a rule you are assuming responsibility for it
>
>
> -----Original Message-----
> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] 
> Sent: Tuesday, October 07, 2014 11:57 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> I think we may really prefer the first method. Since it doesn't appear that 
> there are any consequences with moving forward with changing the code, we 
> would really like to move forward with this approach.
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 09:35 AM, britt fitch wrote:
>> The option Sean mentioned of writing your own custom consumer (without 
>> the UIMA id that is causing your issues) should meet these needs I 
>> believe.
>>
>>                       
>>
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com
>> britt.fi...@wiredinformatics.com
>>
>> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
>> <kim.eb...@perfectsearchcorp.com 
>> <mailto:kim.eb...@perfectsearchcorp.com>> wrote:
>>
>>> Hi Sean,
>>>
>>> Well of course that makes plenty of sense. Testing different cTakes 
>>> configurations you would expect different output. In our testing 
>>> we've found several cases where running with the same configuration 
>>> outputs different data under different moons. Having consistent 
>>> results helps us know if we've made improvements to our quality or 
>>> not. Having output that is in a predictable order makes checking to 
>>> see if there are differences much cheaper when you are dealing with larger 
>>> data sets.
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>> Hi Kim,
>>>>
>>>> One might want compare the Sentence detector that uses end of line 
>>>> characters as sentence splitters with one that does not.  Such a 
>>>> change in sentence splitting would not only effect the sentence type 
>>>> discoveries but also practically every type that follows.
>>>>
>>>> Another might want to compare a note with "skin cancer" vs. one in 
>>>> which you replace "skin cancer" with "melanoma" just to see what the 
>>>> CUI differences might be.  There are changes in two words vs. one,
>>>> 11 characters vs. 8, a removed adjective(?), and of course changes 
>>>> in CUIs.
>>>>
>>>> Of course, if you are just running notes on a new moon and then 
>>>> again on a full moon ...
>>>>
>>>> Sean
>>>>
>>>> -----Original Message-----
>>>> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> Sean,
>>>>
>>>> "...being different because of a possibly intentional difference."
>>>>
>>>> I would like you to elaborate a bit on the what would be 
>>>> intentionally different between the processing of the same document 
>>>> multiple times. It would help my understanding of cTakes.
>>>>
>>>> Thanks,
>>>>
>>>> Kim Ebert
>>>> 1.801.669.7342
>>>> Perfect Search Corp
>>>> http://www.perfectsearchcorp.com/
>>>>
>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>> Steve Bethard wrote:
>>>>>> I spent some time writing a script for diff-ing CASes
>>>>> I urge anyone interested in comparing cTakes CASes / output to use 
>>>>> this type of approach.  Comparison of program output is a 
>>>>> post-process task, and unless absolutely necessary code to juggle 
>>>>> data and metadata belongs there.  Attempts to force every module 
>>>>> past, present and Future to abide by fixed orderings, enumerations 
>>>>> etc. is not as simple a task as one might initially think - 
>>>>> especially if third-party libraries are involved.  I won't get into 
>>>>> problems associated with why one is comparing output (swapped
>>>>> module?) and IDs, orders etc. being different because of a possibly 
>>>>> intentional difference.
>>>>>
>>>>> In addition to or instead of creating a post-processing script, one 
>>>>> could write a new "cas-consumer" that writes output in a desired 
>>>>> format - but this should not require changes to engines.
>>>>>
>>>>> "If it ain't broke, don't fix it"
>>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Steven Bethard [mailto:steven.beth...@gmail.com]
>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: cTakes output predictability
>>>>>
>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>>>> <bruce.tiet...@perfectsearchcorp.com> wrote:
>>>>>> Since I started working with cTakes some time ago, I have found it 
>>>>>> difficult to compare the output between subsequent runs on the 
>>>>>> same files because annotations are often assigned different IDs, 
>>>>>> are listed in different order, etc.
>>>>> At one point, I spent some time writing a script for diff-ing CASes 
>>>>> that intended to address some of these kinds of issues. It's still 
>>>>> here in cTAKES:
>>>>>
>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy
>>>>> sis
>>>>> /CompareFeatureStructures.java
>>>>>
>>>>> You might see if you could use or adapt that to your needs.
>>>>>
>>>>> Steve
>

Re: cTakes output predictability

Reply via email to