Hi Sean, Yes, I mean actual type values not matching.
Kim Ebert 1.801.669.7342 Perfect Search Corp http://www.perfectsearchcorp.com/ On 10/07/2014 10:46 AM, Finan, Sean wrote: > Hi Kim, > >> It concerns me a bit by making the code return consistent results would be >> so concerning. > Could you please clarify what you mean by "consistent results"? Do you mean > ordering and IDs or are you talking about actual type values not matching? > >> This should be the default mode of operation. > Depending upon what you meant above, I may agree or disagree. > >> Since it doesn't appear that there are any consequences with moving forward >> with changing the code > Why do you say this? > > I think that there may be more required changes than you realize. Every > insertion into the CAS must be of ordered data. This means that, for > instance, named entities discovered by dictionary will need to be inserted in > some predictable order, such as by alphabetized cui per every alphabetized > tui (and other code) per ordered text span. You will need to check and > recheck every point at which the CAS is modified by every module. Right now > there are at least three or four places in two cTakes dictionary modules > where a change would be required - and that doesn't include YTEX lookup. > > If you really feel strongly about this and are going to change cTakes code, > then I suggest (at the risk of sounding like a complete jerk) that you also > consider the following: > 1. Don't check anything into trunk until all is well with your changes and > tests > Just in case you abandon the effort > 2. Write unit tests for every change > True, Map to LinkedMap shouldn't break anything, but they are good to have, > and may prevent others in the future from switching back to a non-linked map > or any unordered collection (set not list, etc.). It also makes a better > place for explanation in Javadoc than inlines above the code. > 3. Run memory requirement tests before all of your changes and then again > after your changes > I'm actually curious about how much memory might be eaten with linkages > everywhere > 4. Run performance (speed) tests before and after > On a large corpus to ensure that garbage collection is involved > 5. Do the above with every combination possible in current workflows: every > combination of available sentence detector, pos tagger, smoking status > detector, dictionary lookup, cas consumer, etc. > As soon as somebody says "all output is consistently ordered between runs" it > had better be so for every possible workflow > 6. Write system tests to ensure ordered/predicted outputs with each > combination > Otherwise somebody may break it > 7. Document the what, how, and why for future development > Otherwise somebody won't know to stick to the new rules > 8. Assist anybody as needed that in the future breaks one of these unit or > system tests with a fix or new feature > By mandating such a rule you are assuming responsibility for it > 9. Assist anybody as needed that in the future adds a new module or workflow > to cTakes to abide by the ordering requirement > By mandating such a rule you are assuming responsibility for it > 10. Assist anybody as needed that in the future adds a new module or > workflow to add system tests to ensure maintenance of the ordering requirement > By mandating such a rule you are assuming responsibility for it > > > -----Original Message----- > From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] > Sent: Tuesday, October 07, 2014 11:57 AM > To: dev@ctakes.apache.org > Subject: Re: cTakes output predictability > > I think we may really prefer the first method. Since it doesn't appear that > there are any consequences with moving forward with changing the code, we > would really like to move forward with this approach. > > Kim Ebert > 1.801.669.7342 > Perfect Search Corp > http://www.perfectsearchcorp.com/ > > On 10/07/2014 09:35 AM, britt fitch wrote: >> The option Sean mentioned of writing your own custom consumer (without >> the UIMA id that is causing your issues) should meet these needs I >> believe. >> >> >> >> Britt Fitch >> Wired Informatics >> 265 Franklin St Ste 1702 >> Boston, MA 02110 >> http://wiredinformatics.com >> britt.fi...@wiredinformatics.com >> >> On Oct 7, 2014, at 11:29 AM, Kim Ebert >> <kim.eb...@perfectsearchcorp.com >> <mailto:kim.eb...@perfectsearchcorp.com>> wrote: >> >>> Hi Sean, >>> >>> Well of course that makes plenty of sense. Testing different cTakes >>> configurations you would expect different output. In our testing >>> we've found several cases where running with the same configuration >>> outputs different data under different moons. Having consistent >>> results helps us know if we've made improvements to our quality or >>> not. Having output that is in a predictable order makes checking to >>> see if there are differences much cheaper when you are dealing with larger >>> data sets. >>> >>> Kim Ebert >>> 1.801.669.7342 >>> Perfect Search Corp >>> http://www.perfectsearchcorp.com/ >>> >>> On 10/07/2014 08:50 AM, Finan, Sean wrote: >>>> Hi Kim, >>>> >>>> One might want compare the Sentence detector that uses end of line >>>> characters as sentence splitters with one that does not. Such a >>>> change in sentence splitting would not only effect the sentence type >>>> discoveries but also practically every type that follows. >>>> >>>> Another might want to compare a note with "skin cancer" vs. one in >>>> which you replace "skin cancer" with "melanoma" just to see what the >>>> CUI differences might be. There are changes in two words vs. one, >>>> 11 characters vs. 8, a removed adjective(?), and of course changes >>>> in CUIs. >>>> >>>> Of course, if you are just running notes on a new moon and then >>>> again on a full moon ... >>>> >>>> Sean >>>> >>>> -----Original Message----- >>>> From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] >>>> Sent: Tuesday, October 07, 2014 10:41 AM >>>> To: dev@ctakes.apache.org >>>> Subject: Re: cTakes output predictability >>>> >>>> Sean, >>>> >>>> "...being different because of a possibly intentional difference." >>>> >>>> I would like you to elaborate a bit on the what would be >>>> intentionally different between the processing of the same document >>>> multiple times. It would help my understanding of cTakes. >>>> >>>> Thanks, >>>> >>>> Kim Ebert >>>> 1.801.669.7342 >>>> Perfect Search Corp >>>> http://www.perfectsearchcorp.com/ >>>> >>>> On 10/07/2014 07:30 AM, Finan, Sean wrote: >>>>> Steve Bethard wrote: >>>>>> I spent some time writing a script for diff-ing CASes >>>>> I urge anyone interested in comparing cTakes CASes / output to use >>>>> this type of approach. Comparison of program output is a >>>>> post-process task, and unless absolutely necessary code to juggle >>>>> data and metadata belongs there. Attempts to force every module >>>>> past, present and Future to abide by fixed orderings, enumerations >>>>> etc. is not as simple a task as one might initially think - >>>>> especially if third-party libraries are involved. I won't get into >>>>> problems associated with why one is comparing output (swapped >>>>> module?) and IDs, orders etc. being different because of a possibly >>>>> intentional difference. >>>>> >>>>> In addition to or instead of creating a post-processing script, one >>>>> could write a new "cas-consumer" that writes output in a desired >>>>> format - but this should not require changes to engines. >>>>> >>>>> "If it ain't broke, don't fix it" >>>>> >>>>> Sean >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Steven Bethard [mailto:steven.beth...@gmail.com] >>>>> Sent: Monday, October 06, 2014 11:23 PM >>>>> To: dev@ctakes.apache.org >>>>> Subject: Re: cTakes output predictability >>>>> >>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen >>>>> <bruce.tiet...@perfectsearchcorp.com> wrote: >>>>>> Since I started working with cTakes some time ago, I have found it >>>>>> difficult to compare the output between subsequent runs on the >>>>>> same files because annotations are often assigned different IDs, >>>>>> are listed in different order, etc. >>>>> At one point, I spent some time writing a script for diff-ing CASes >>>>> that intended to address some of these kinds of issues. It's still >>>>> here in cTAKES: >>>>> >>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy >>>>> sis >>>>> /CompareFeatureStructures.java >>>>> >>>>> You might see if you could use or adapt that to your needs. >>>>> >>>>> Steve >