I think it would be helpful actually, as digging deeper into the issue has highlighted to me a few places in the code that actually cause inconsistent results to be returned when running the same document through multiple times. I think having the code base be predictable will make it easier to debug.
Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 09:58 AM, Masanz, James J. wrote: > FWIW, I agree with Sean that comparing should be a post-processing step and > trying to get UIMA internal IDs to match on subsequent runs is not worth > opening the code for. > > -----Original Message----- > From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] > Sent: Tuesday, October 07, 2014 10:56 AM > To: dev@ctakes.apache.org > Subject: Re: cTakes output predictability > > I think we may really prefer the first method. Since it doesn't appear > that there are any consequences with moving forward with changing the > code, we would really like to move forward with this approach. > > Kim Ebert > 1.801.669.7342 > Perfect Search Corp > http://www.perfectsearchcorp.com/ > > On 10/07/2014 09:35 AM, britt fitch wrote: >> The option Sean mentioned of writing your own custom consumer (without >> the UIMA id that is causing your issues) should meet these needs I >> believe. >> >> >> >> Britt Fitch >> Wired Informatics >> 265 Franklin St Ste 1702 >> Boston, MA 02110 >> http://wiredinformatics.com >> britt.fi...@wiredinformatics.com >> >> On Oct 7, 2014, at 11:29 AM, Kim Ebert >> <kim.eb...@perfectsearchcorp.com >> <mailto:kim.eb...@perfectsearchcorp.com>> wrote: >> >>> Hi Sean, >>> >>> Well of course that makes plenty of sense. Testing different cTakes >>> configurations you would expect different output. In our testing we've >>> found several cases where running with the same configuration outputs >>> different data under different moons. Having consistent results helps us >>> know if we've made improvements to our quality or not. Having output >>> that is in a predictable order makes checking to see if there are >>> differences much cheaper when you are dealing with larger data sets. >>> >>> Kim Ebert >>> 1.801.669.7342 >>> Perfect Search Corp >>> http://www.perfectsearchcorp.com/ >>> >>> On 10/07/2014 08:50 AM, Finan, Sean wrote: >>>> Hi Kim, >>>> >>>> One might want compare the Sentence detector that uses end of line >>>> characters as sentence splitters with one that does not. Such a >>>> change in sentence splitting would not only effect the sentence type >>>> discoveries but also practically every type that follows. >>>> >>>> Another might want to compare a note with "skin cancer" vs. one in >>>> which you replace "skin cancer" with "melanoma" just to see what the >>>> CUI differences might be. There are changes in two words vs. one, >>>> 11 characters vs. 8, a removed adjective(?), and of course changes >>>> in CUIs. >>>> >>>> Of course, if you are just running notes on a new moon and then >>>> again on a full moon ... >>>> >>>> Sean >>>> >>>> -----Original Message----- >>>> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] >>>> Sent: Tuesday, October 07, 2014 10:41 AM >>>> To: dev@ctakes.apache.org >>>> Subject: Re: cTakes output predictability >>>> >>>> Sean, >>>> >>>> "...being different because of a possibly intentional difference." >>>> >>>> I would like you to elaborate a bit on the what would be >>>> intentionally different between the processing of the same document >>>> multiple times. It would help my understanding of cTakes. >>>> >>>> Thanks, >>>> >>>> Kim Ebert >>>> 1.801.669.7342 >>>> Perfect Search Corp >>>> http://www.perfectsearchcorp.com/ >>>> >>>> On 10/07/2014 07:30 AM, Finan, Sean wrote: >>>>> Steve Bethard wrote: >>>>>> I spent some time writing a script for diff-ing CASes >>>>> I urge anyone interested in comparing cTakes CASes / output to use >>>>> this type of approach. Comparison of program output is a >>>>> post-process task, and unless absolutely necessary code to juggle >>>>> data and metadata belongs there. Attempts to force every module >>>>> past, present and Future to abide by fixed orderings, enumerations >>>>> etc. is not as simple a task as one might initially think - >>>>> especially if third-party libraries are involved. I won't get into >>>>> problems associated with why one is comparing output (swapped >>>>> module?) and IDs, orders etc. being different because of a possibly >>>>> intentional difference. >>>>> >>>>> In addition to or instead of creating a post-processing script, one >>>>> could write a new "cas-consumer" that writes output in a desired >>>>> format - but this should not require changes to engines. >>>>> >>>>> "If it ain't broke, don't fix it" >>>>> >>>>> Sean >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Steven Bethard [mailto:steven.beth...@gmail.com] >>>>> Sent: Monday, October 06, 2014 11:23 PM >>>>> To: dev@ctakes.apache.org >>>>> Subject: Re: cTakes output predictability >>>>> >>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen >>>>> <bruce.tiet...@perfectsearchcorp.com> wrote: >>>>>> Since I started working with cTakes some time ago, I have found it >>>>>> difficult to compare the output between subsequent runs on the same >>>>>> files because annotations are often assigned different IDs, are >>>>>> listed in different order, etc. >>>>> At one point, I spent some time writing a script for diff-ing CASes >>>>> that intended to address some of these kinds of issues. It's still >>>>> here in cTAKES: >>>>> >>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis >>>>> /CompareFeatureStructures.java >>>>> >>>>> You might see if you could use or adapt that to your needs. >>>>> >>>>> Steve > . >