Hi Bruce, Could you send the record over that you are seeing this on?
Thanks, Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 11:20 AM, Bruce Tietjen wrote: > I did not intend to step on anyone's toes. > > One of the reasons I proposed the changes was to try to make it extremely > obvious when there are significant difference in output from the cTakes > pipeline when running the same document again, and once identified, make it > easier to identify the source of the difference. > > Because of the huge number of differences between the output using the > FileWriterCasConsumer.xml, first detecting that there is a significant > differences and identifying them for a large set of documents is a daunting > task. > > The following is an example of some significant differences that I have > detected between two subsequent runs on the same document using the current > release of cTakes. (There are actually quite a few documents that exhibit > this kind of behavior. This is only one example.) > > > Snippet from first run: > > <org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation > _indexed="1" _id="9869" _ref_sofa="3" begin="3039" end="3047"/> > <org.apache.ctakes.typesystem.type.textsem.MedicationMention > _indexed="1" _id="9895" _ref_sofa="3" begin="2075" end="2081" id="95" > _ref_ontologyConceptArr="9891" typeID="1" segmentID="SIMPLE_SEGMENT" > discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1" > conditional="false" generic="true" subject="patient" historyOf="0"/> > <org.apache.ctakes.typesystem.type.textsem.MedicationMention > _indexed="1" _id="9937" _ref_sofa="3" begin="2312" end="2322" id="110" > _ref_ontologyConceptArr="9934" typeID="1" segmentID="SIMPLE_SEGMENT" > discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1" > conditional="false" generic="false" subject="patient" historyOf="0"/> > <org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention > _indexed="1" _id="9979" _ref_sofa="3" begin="0" end="4" id="0" > _ref_ontologyConceptArr="9976" typeID="2" segmentID="SIMPLE_SEGMENT" > discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0" > conditional="false" generic="false" subject="patient" historyOf="0"/> > > > Snippet from subsequent trun: > > <org.apache.ctakes.typesystem.type.textsem.ProcedureMention > _indexed="1" _id="15773" _ref_sofa="3" begin="2929" end="2933" id="125" > _ref_ontologyConceptArr="15770" typeID="5" segmentID="SIMPLE_SEGMENT" > discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0" > conditional="false" generic="false" subject="patient" historyOf="0"/> > <org.apache.ctakes.typesystem.type.textsem.MedicationMention > _indexed="1" _id="15928" _ref_sofa="3" begin="2075" end="2081" id="95" > _ref_ontologyConceptArr="15924" typeID="1" segmentID="SIMPLE_SEGMENT" > discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1" > conditional="false" generic="true" subject="patient" historyOf="0"/> > <org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode > _indexed="1" _id="15958" _ref_sofa="3" begin="0" end="5" id="0"/> > > > Note that in the first instance, there were two MedicationMentions, but in > the second, there is only one. > > Yes, everyone could write their own custom compare code, but wouldn't it be > more valuable to the community to make that task easier? > > Thanks, > > Bruce Tietjen > > > > [image: IMAT Solutions] <http://imatsolutions.com> > Bruce Tietjen > Senior Software Engineer > [image: Mobile:] 801.634.1547 > [email protected] > > On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert <[email protected]> > wrote: > >> Hi Sean, >> >> No, your not a jerk. These are things worth considering, and I >> understand your concerns with touching various points of the codebase. >> >> I'll talk with our group over here and see where we want to go. We are >> really interested in cTakes behaving well, so we are usually pretty >> careful in testing our changes before committing anything. >> >> Thanks, >> >> Kim Ebert >> 1.801.669.7342 >> Perfect Search Corp >> http://www.perfectsearchcorp.com/ >> >> On 10/07/2014 10:46 AM, Finan, Sean wrote: >>> Hi Kim, >>> >>>> It concerns me a bit by making the code return consistent results would >> be so concerning. >>> Could you please clarify what you mean by "consistent results"? Do you >> mean ordering and IDs or are you talking about actual type values not >> matching? >>>> This should be the default mode of operation. >>> Depending upon what you meant above, I may agree or disagree. >>> >>>> Since it doesn't appear that there are any consequences with moving >> forward with changing the code >>> Why do you say this? >>> >>> I think that there may be more required changes than you realize. Every >> insertion into the CAS must be of ordered data. This means that, for >> instance, named entities discovered by dictionary will need to be inserted >> in some predictable order, such as by alphabetized cui per every >> alphabetized tui (and other code) per ordered text span. You will need to >> check and recheck every point at which the CAS is modified by every >> module. Right now there are at least three or four places in two cTakes >> dictionary modules where a change would be required - and that doesn't >> include YTEX lookup. >>> If you really feel strongly about this and are going to change cTakes >> code, then I suggest (at the risk of sounding like a complete jerk) that >> you also consider the following: >>> 1. Don't check anything into trunk until all is well with your changes >> and tests >>> Just in case you abandon the effort >>> 2. Write unit tests for every change >>> True, Map to LinkedMap shouldn't break anything, but they are good to >> have, and may prevent others in the future from switching back to a >> non-linked map or any unordered collection (set not list, etc.). It also >> makes a better place for explanation in Javadoc than inlines above the code. >>> 3. Run memory requirement tests before all of your changes and then >> again after your changes >>> I'm actually curious about how much memory might be eaten with linkages >> everywhere >>> 4. Run performance (speed) tests before and after >>> On a large corpus to ensure that garbage collection is involved >>> 5. Do the above with every combination possible in current workflows: >> every combination of available sentence detector, pos tagger, smoking >> status detector, dictionary lookup, cas consumer, etc. >>> As soon as somebody says "all output is consistently ordered between >> runs" it had better be so for every possible workflow >>> 6. Write system tests to ensure ordered/predicted outputs with each >> combination >>> Otherwise somebody may break it >>> 7. Document the what, how, and why for future development >>> Otherwise somebody won't know to stick to the new rules >>> 8. Assist anybody as needed that in the future breaks one of these unit >> or system tests with a fix or new feature >>> By mandating such a rule you are assuming responsibility for it >>> 9. Assist anybody as needed that in the future adds a new module or >> workflow to cTakes to abide by the ordering requirement >>> By mandating such a rule you are assuming responsibility for it >>> 10. Assist anybody as needed that in the future adds a new module or >> workflow to add system tests to ensure maintenance of the ordering >> requirement >>> By mandating such a rule you are assuming responsibility for it >>> >>> >>> -----Original Message----- >>> From: Kim Ebert [mailto:[email protected]] >>> Sent: Tuesday, October 07, 2014 11:57 AM >>> To: [email protected] >>> Subject: Re: cTakes output predictability >>> >>> I think we may really prefer the first method. Since it doesn't appear >> that there are any consequences with moving forward with changing the code, >> we would really like to move forward with this approach. >>> Kim Ebert >>> 1.801.669.7342 >>> Perfect Search Corp >>> http://www.perfectsearchcorp.com/ >>> >>> On 10/07/2014 09:35 AM, britt fitch wrote: >>>> The option Sean mentioned of writing your own custom consumer (without >>>> the UIMA id that is causing your issues) should meet these needs I >>>> believe. >>>> >>>> >>>> >>>> Britt Fitch >>>> Wired Informatics >>>> 265 Franklin St Ste 1702 >>>> Boston, MA 02110 >>>> http://wiredinformatics.com >>>> [email protected] >>>> >>>> On Oct 7, 2014, at 11:29 AM, Kim Ebert >>>> <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>>> Hi Sean, >>>>> >>>>> Well of course that makes plenty of sense. Testing different cTakes >>>>> configurations you would expect different output. In our testing >>>>> we've found several cases where running with the same configuration >>>>> outputs different data under different moons. Having consistent >>>>> results helps us know if we've made improvements to our quality or >>>>> not. Having output that is in a predictable order makes checking to >>>>> see if there are differences much cheaper when you are dealing with >> larger data sets. >>>>> Kim Ebert >>>>> 1.801.669.7342 >>>>> Perfect Search Corp >>>>> http://www.perfectsearchcorp.com/ >>>>> >>>>> On 10/07/2014 08:50 AM, Finan, Sean wrote: >>>>>> Hi Kim, >>>>>> >>>>>> One might want compare the Sentence detector that uses end of line >>>>>> characters as sentence splitters with one that does not. Such a >>>>>> change in sentence splitting would not only effect the sentence type >>>>>> discoveries but also practically every type that follows. >>>>>> >>>>>> Another might want to compare a note with "skin cancer" vs. one in >>>>>> which you replace "skin cancer" with "melanoma" just to see what the >>>>>> CUI differences might be. There are changes in two words vs. one, >>>>>> 11 characters vs. 8, a removed adjective(?), and of course changes >>>>>> in CUIs. >>>>>> >>>>>> Of course, if you are just running notes on a new moon and then >>>>>> again on a full moon ... >>>>>> >>>>>> Sean >>>>>> >>>>>> -----Original Message----- >>>>>> From: Kim Ebert [mailto:[email protected]] >>>>>> Sent: Tuesday, October 07, 2014 10:41 AM >>>>>> To: [email protected] >>>>>> Subject: Re: cTakes output predictability >>>>>> >>>>>> Sean, >>>>>> >>>>>> "...being different because of a possibly intentional difference." >>>>>> >>>>>> I would like you to elaborate a bit on the what would be >>>>>> intentionally different between the processing of the same document >>>>>> multiple times. It would help my understanding of cTakes. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Kim Ebert >>>>>> 1.801.669.7342 >>>>>> Perfect Search Corp >>>>>> http://www.perfectsearchcorp.com/ >>>>>> >>>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote: >>>>>>> Steve Bethard wrote: >>>>>>>> I spent some time writing a script for diff-ing CASes >>>>>>> I urge anyone interested in comparing cTakes CASes / output to use >>>>>>> this type of approach. Comparison of program output is a >>>>>>> post-process task, and unless absolutely necessary code to juggle >>>>>>> data and metadata belongs there. Attempts to force every module >>>>>>> past, present and Future to abide by fixed orderings, enumerations >>>>>>> etc. is not as simple a task as one might initially think - >>>>>>> especially if third-party libraries are involved. I won't get into >>>>>>> problems associated with why one is comparing output (swapped >>>>>>> module?) and IDs, orders etc. being different because of a possibly >>>>>>> intentional difference. >>>>>>> >>>>>>> In addition to or instead of creating a post-processing script, one >>>>>>> could write a new "cas-consumer" that writes output in a desired >>>>>>> format - but this should not require changes to engines. >>>>>>> >>>>>>> "If it ain't broke, don't fix it" >>>>>>> >>>>>>> Sean >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Steven Bethard [mailto:[email protected]] >>>>>>> Sent: Monday, October 06, 2014 11:23 PM >>>>>>> To: [email protected] >>>>>>> Subject: Re: cTakes output predictability >>>>>>> >>>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen >>>>>>> <[email protected]> wrote: >>>>>>>> Since I started working with cTakes some time ago, I have found it >>>>>>>> difficult to compare the output between subsequent runs on the >>>>>>>> same files because annotations are often assigned different IDs, >>>>>>>> are listed in different order, etc. >>>>>>> At one point, I spent some time writing a script for diff-ing CASes >>>>>>> that intended to address some of these kinds of issues. It's still >>>>>>> here in cTAKES: >>>>>>> >>>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy >>>>>>> sis >>>>>>> /CompareFeatureStructures.java >>>>>>> >>>>>>> You might see if you could use or adapt that to your needs. >>>>>>> >>>>>>> Steve >>
