+1 Tim's suggestion On Jul 2, 2013, at 10:13 AM, "Masanz, James J." <masanz.ja...@mayo.edu> wrote:
> I agree with Tim's diagnosis and treatment plan. > > -----Original Message----- > From: dev-return-1714-Masanz.James=mayo....@ctakes.apache.org > [mailto:dev-return-1714-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of > Chen, Pei > Sent: Friday, June 28, 2013 9:00 AM > To: dev@ctakes.apache.org > Subject: RE: Next cTAKES release (3.1)? > > I completely agree with making cTAKES easier use. I think it is exciting to > hear the different use cases here and understanding where some of the areas > that need improvements are (which we haven't thought about earlier). > I think Tim's suggestions and the 3 concrete actionable items makes a lot of > sense. Hopefully it should attract new users, adopters, and perhaps more > committers. > >> i) Make the typesystem forefront in documentation -- generate javadocs and >> have as a link on the ctakes frontpage/sidebar >> ii) Similar to the way that we are aiming to have tests in every module, also >> have clearly labeled examples in every module that set up a pipeline, run on >> sample notes (could be the same sample notes from the tests), and do >> something with the results. >> iii) Follow Giri's recommendation to have example training data for people >> who want to take the next step and train their own models > > I think Java developers are accustomed to including a library as a > dependency/jar, have an API to pass input, and get the results via pojos; So > the examples could initially shield the complexity of wiring a pipeline > together etc. > If we can improve the API's and how it gets integrated with other apps, we > can add any GUI/CLI tools on top of this afterwards. > > --Pei > >> -----Original Message----- >> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] >> Sent: Friday, June 28, 2013 8:00 AM >> To: dev@ctakes.apache.org >> Subject: Re: Next cTAKES release (3.1)? >> >> Very interesting discussion. I think Giri is right about giving example >> training >> data in the format that our training code can read. While our ultimate goal >> would be to build and release models that are completely domain- >> independent, in the real world it is almost always better to use some >> domain-specific data and we should think more about how to facilitate that. >> >> As for making it easier to get started, it is not totally clear to me what >> this >> means/how to do it so it might be useful to get specific about what this >> means. I think our biggest hurdle is >> >> 1) Prerequisite of understanding UIMA/UIMAFit >> >> Since UIMAFit is officially becoming part of UIMA that will be easier, and >> hopefully people will just learn the easier (in my opinion) UIMAFit way than >> the standard UIMA way of doing things. Is there something we can be doing >> to make understanding UIMA easier? Or do we just need to say upfront that >> this is a prerequisite and hope that people don't give up due to this thing >> that >> is out of our control? >> >> Another hurdle is: >> >> 2) cTAKES is a multi-purpose developer-aimed tool >> >> So it's not just a matter of hiding complexity -- at some point people have >> to >> understand their problem, understand cTAKES' capabilities, and start coding. >> Pei's GUI will help for some common use cases but will not remove the >> requirement that someone at the organization knows cTAKES. >> I think one part of this problem is the fact that the typesystem is not well >> documented. A developer needs to know what the output is (objects from >> the typesystem), how to get them (which modules/pipelines), and what >> information is in them. So maybe on this end my recommendation would be: >> i) Make the typesystem forefront in documentation -- generate javadocs and >> have as a link on the ctakes frontpage/sidebar >> ii) Similar to the way that we are aiming to have tests in every module, also >> have clearly labeled examples in every module that set up a pipeline, run on >> sample notes (could be the same sample notes from the tests), and do >> something with the results. >> iii) Follow Giri's recommendation to have example training data for people >> who want to take the next step and train their own models >> >> This is quite a bit of developer overhead, so it's worth asking whether you >> agree with my "diagnosis" and "treatment" or whether you think there are >> different problems/solutions that should be higher priority. >> >> Tim >> >> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote: >>> Hi Vijay and Andy, >>> >>> Thanks for sharing those examples. >>> >>> "Trouble is, privacy requires that these examples be made up by hand" >>> >>> Agree with this statement and this is very valid concern. >>> >>> In "getting started examples", I think we should just have couple of >>> entries (5-10 small entries), not more than that (with explicit >>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I >> understand >>> handcrafting these may not be easy because we are not medical domain >>> experts, but I feel worth time, because it brings in more user community. >>> >>> Thank you, >>> Giri >>> >>> >>> >>> >>> >>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry >> <mcmurry.a...@gmail.com>wrote: >>> >>>> GREAT ! >>>> >>>> The i2b2 data though isn't publicly distributable, you still need to >>>> request access to it since it is "semi private" >>>> >>>> >>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vnga...@gmail.com> wrote: >>>> >>>>> We released code on using cTAKES to annotate clinical text and SVMs >>>>> that use the annotations to classify clinical text from the CMC 2007 >>>>> and I2B2 >>>>> 2008 challenges: >>>>> >>>>> We did the cmd 2007 with cTAKES 2.5: >>>>> >>>> >> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr >> o >>>> ducing_results_on_CMC_2007_challenge >>>> <https://code.google.com/p/ytex/downloads/list> >>>>> >>>>> And the i2b2 2008 with the version of cTAKES distributed with the >>>>> first version of ARC: >>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008 >>>>> >>>>> These are both publicly available datasets, and represent real-world >>>>> problems (in general I believe when publishing a paper the code >>>>> should be reproducible and made publicly available, but that's a different >> issue). >>>>> >>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to >>>>> upgrade these samples as well. >>>>> >>>>> Best, >>>>> >>>>> VJ >>>>> >>>>> >>>>> >>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry >>>>> <mcmurry.a...@gmail.com >>>>> wrote: >>>>> >>>>>> +1 suggestion for documenting many examples of "getting started" >>>>>> +NLP >>>>>> datasets. >>>>>> >>>>>> I have at least one we can use that was created by our lead >>>>>> Pathologist >>>>>> >>>>>> >>>> >> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas >>>> es/train/traincase.xml >>>>>> We should provide at least one sample for each domain. >>>>>> Trouble is, privacy requires that these examples be made up by hand >>>>>> and not copy-pasted from EMR systems. >>>>>> >>>>>> --Andy >>>>>> >>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari < >>>> girinamb...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> +1 for this observation Andy! >>>>>>> >>>>>>> Lowering time will motive users in writing blogs about features, >>>>>>> how >>>> to, >>>>>>> etc., which reduces core team work load on documentation. >>>>>>> >>>>>>> I have been trying to write a small "how to write standalone >>>>>>> client for ctakes" with my experience (I saw at least 4 users >>>>>>> posted similar >>>>>> question >>>>>>> in last 2 months), but not getting enough time because ctakes >>>>>>> depends >>>> on >>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), >>>>>>> most >>>> of >>>>>>> my spare time is being spent on juggling between these frameworks, >>>>>> posting >>>>>>> and browsing those forums, relating observations to ctakes code. I >>>> think >>>>>> we >>>>>>> need to have some high level documentation about these (with links >>>>>>> to corresponding forums). >>>>>>> >>>>>>> Above case is for developers (I think this will be more user base >>>>>>> as >>>>>> ctakes >>>>>>> progress), for users I think documentation is lot better though >>>>>>> some improvements need to be done. >>>>>>> >>>>>>> As a developer I felt tough with lack of sample training data (I >>>>>>> am >>>> still >>>>>>> struggling in this area even though I browsed all relevant code), >>>> though >>>>>>> training class are there. I understood that there are licensing >>>>>>> issues >>>>>> with >>>>>>> REAL data, but at least some hand made example sentences, which >>>>>>> may not >>>>>> be >>>>>>> real but helps developers in understanding the type/structure of >>>>>>> input TRAINING classes expecting. This way people who browse the >>>>>>> code can >>>>>> reverse >>>>>>> engineer and develop their own models. Sorry if you guys feel this >>>>>>> as novice issue, but I feel most of the developers will be novice >>>>>>> when >>>> they >>>>>>> adopt a system and Machine Learning/NLP is ocean. Some >>>>>>> documentation in this area will same lot of time for us. >>>>>>> >>>>>>> I wish there will be some activity in this area from ctakes core team. >>>>>>> >>>>>>> Thank you, >>>>>>> Giri >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry >>>>>>> <mcmurry.a...@gmail.com >>>>>>> wrote: >>>>>>> >>>>>>>> ctakes is at a point where we have a LOT of features but it is >>>>>>>> still >>>>>> hard >>>>>>>> to get started. >>>>>>>> >>>>>>>> Judging from the mailing lists a lot of how cTakes works is not >>>> obvious >>>>>>>> and requires hand holding. >>>>>>>> This is very typical in early FOSS projects. >>>>>>>> >>>>>>>> Lowering the time to get invested in ctakes gets more users AND >>>>>>>> better >>>>>> bug >>>>>>>> reports, FAQ, etc. >>>>>>>> >>>>>>>> thoughts? >>>>>>>> --Andy >>>>>>>> >>>>>>>> >>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" < >>>>>> pei.c...@childrens.harvard.edu> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> I just wanted to gauge the interest of creating the next release >>>>>>>>> of >>>>>>>> cTAKES (3.1) which is currently marked for May in Jira- >>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed. >>>>>>>> Plenty of bug fixes and new components including: >>>>>>>>> - New CEM Instance Template population >>>>>>>>> - New Dependency Parser/Semantic Role Labeler >>>>>>>>> - New optional Clear POSTagger >>>>>>>>> - New regression testing component >>>>>>>>> >>>>>>>>> Should we wait for the Temporal component? >>>>>>>>> >>>>>>>>> [1] >>>> >> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1% >>>> 22%20AND%20project%20%3D%20CTAKES >>>>>>>> >>>>>> >>>> >