Re: Next cTAKES release (3.1)?

Andy McMurry Tue, 02 Jul 2013 12:43:06 -0700

+1 Tim's suggestion  

On Jul 2, 2013, at 10:13 AM, "Masanz, James J." <masanz.ja...@mayo.edu> wrote:


> I agree with Tim's diagnosis and treatment plan.
> 
> -----Original Message-----
> From: dev-return-1714-Masanz.James=mayo....@ctakes.apache.org 
> [mailto:dev-return-1714-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of 
> Chen, Pei
> Sent: Friday, June 28, 2013 9:00 AM
> To: dev@ctakes.apache.org
> Subject: RE: Next cTAKES release (3.1)?
> 
> I completely agree with making cTAKES easier use.  I think it is exciting to 
> hear the different use cases here and understanding where some of the areas 
> that need improvements are (which we haven't thought about earlier).
> I think Tim's suggestions and the 3 concrete actionable items makes a lot of 
> sense.  Hopefully it should attract new users, adopters, and perhaps more 
> committers.
> 
>> i) Make the typesystem forefront in documentation -- generate javadocs and
>> have as a link on the ctakes frontpage/sidebar
>> ii) Similar to the way that we are aiming to have tests in every module, also
>> have clearly labeled examples in every module that set up a pipeline, run on
>> sample notes (could be the same sample notes from the tests), and do
>> something with the results.
>> iii) Follow Giri's recommendation to have example training data for people
>> who want to take the next step and train their own models
> 
> I think Java developers are accustomed to including a library as a 
> dependency/jar, have an API to pass input, and get the results via pojos;  So 
> the examples could initially shield the complexity of wiring a pipeline 
> together etc.  
> If we can improve the API's and how it gets integrated with other apps, we 
> can add any GUI/CLI tools on top of this afterwards.
> 
> --Pei
> 
>> -----Original Message-----
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>> Sent: Friday, June 28, 2013 8:00 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: Next cTAKES release (3.1)?
>> 
>> Very interesting discussion. I think Giri is right about giving example 
>> training
>> data in the format that our training code can read. While our ultimate goal
>> would be to build and release models that are completely domain-
>> independent, in the real world it is almost always better to use some
>> domain-specific data and we should think more about how to facilitate that.
>> 
>> As for making it easier to get started, it is not totally clear to me what 
>> this
>> means/how to do it so it might be useful to get specific about what this
>> means. I think our biggest hurdle is
>> 
>> 1) Prerequisite of understanding UIMA/UIMAFit
>> 
>> Since UIMAFit is officially becoming part of UIMA that will be easier, and
>> hopefully people will just learn the easier (in my opinion) UIMAFit way than
>> the standard UIMA way of doing things. Is there something we can be doing
>> to make understanding UIMA easier? Or do we just need to say upfront that
>> this is a prerequisite and hope that people don't give up due to this thing 
>> that
>> is out of our control?
>> 
>> Another hurdle is:
>> 
>> 2) cTAKES is a multi-purpose developer-aimed tool
>> 
>> So it's not just a matter of hiding complexity -- at some point people have 
>> to
>> understand their problem, understand cTAKES' capabilities, and start coding.
>> Pei's GUI will help for some common use cases but will not remove the
>> requirement that someone at the organization knows cTAKES.
>> I think one part of this problem is the fact that the typesystem is not well
>> documented. A developer needs to know what the output is (objects from
>> the typesystem), how to get them (which modules/pipelines), and what
>> information is in them. So maybe on this end my recommendation would be:
>> i) Make the typesystem forefront in documentation -- generate javadocs and
>> have as a link on the ctakes frontpage/sidebar
>> ii) Similar to the way that we are aiming to have tests in every module, also
>> have clearly labeled examples in every module that set up a pipeline, run on
>> sample notes (could be the same sample notes from the tests), and do
>> something with the results.
>> iii) Follow Giri's recommendation to have example training data for people
>> who want to take the next step and train their own models
>> 
>> This is quite a bit of developer overhead, so it's worth asking whether you
>> agree with my "diagnosis" and "treatment" or whether you think there are
>> different problems/solutions that should be higher priority.
>> 
>> Tim
>> 
>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>> Hi Vijay and Andy,
>>> 
>>> Thanks for sharing those examples.
>>> 
>>> "Trouble is, privacy requires that these examples be made up by hand"
>>> 
>>> Agree with this statement and this is very valid concern.
>>> 
>>> In "getting started examples", I think we should just have couple of
>>> entries (5-10 small entries), not more than that (with explicit
>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>> understand
>>> handcrafting these may not be easy because we are not medical domain
>>> experts, but I feel worth time, because it brings in more user community.
>>> 
>>> Thank you,
>>> Giri
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>> <mcmurry.a...@gmail.com>wrote:
>>> 
>>>> GREAT !
>>>> 
>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>> request access to it since it is "semi private"
>>>> 
>>>> 
>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vnga...@gmail.com> wrote:
>>>> 
>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>> that use the annotations to classify clinical text from the CMC 2007
>>>>> and I2B2
>>>>> 2008 challenges:
>>>>> 
>>>>> We did the cmd 2007 with cTAKES 2.5:
>>>>> 
>>>> 
>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>> o
>>>> ducing_results_on_CMC_2007_challenge
>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>> 
>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>> first version of ARC:
>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>> 
>>>>> These are both publicly available datasets, and represent real-world
>>>>> problems (in general I believe when publishing a paper the code
>>>>> should be reproducible and made publicly available, but that's a different
>> issue).
>>>>> 
>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>>> upgrade these samples as well.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> VJ
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>> <mcmurry.a...@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>> +NLP
>>>>>> datasets.
>>>>>> 
>>>>>> I have at least one we can use that was created by our lead
>>>>>> Pathologist
>>>>>> 
>>>>>> 
>>>> 
>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>> es/train/traincase.xml
>>>>>> We should provide at least one sample for each domain.
>>>>>> Trouble is, privacy requires that these examples be made up by hand
>>>>>> and not copy-pasted from EMR systems.
>>>>>> 
>>>>>> --Andy
>>>>>> 
>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>> girinamb...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> +1 for this observation Andy!
>>>>>>> 
>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>> how
>>>> to,
>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>> 
>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>> posted similar
>>>>>> question
>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>> depends
>>>> on
>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>> most
>>>> of
>>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>>> posting
>>>>>>> and browsing those forums, relating observations to ctakes code. I
>>>> think
>>>>>> we
>>>>>>> need to have some high level documentation about these (with links
>>>>>>> to corresponding forums).
>>>>>>> 
>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>> as
>>>>>> ctakes
>>>>>>> progress), for users I think documentation is lot better though
>>>>>>> some improvements need to be done.
>>>>>>> 
>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>> am
>>>> still
>>>>>>> struggling in this area even though I browsed all relevant code),
>>>> though
>>>>>>> training class are there. I understood that there are licensing
>>>>>>> issues
>>>>>> with
>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>> may not
>>>>>> be
>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>> code can
>>>>>> reverse
>>>>>>> engineer and develop their own models. Sorry if you guys feel this
>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>> when
>>>> they
>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>> documentation in this area will same lot of time for us.
>>>>>>> 
>>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Giri
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>> <mcmurry.a...@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>> still
>>>>>> hard
>>>>>>>> to get started.
>>>>>>>> 
>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>> obvious
>>>>>>>> and requires hand holding.
>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>> 
>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>> better
>>>>>> bug
>>>>>>>> reports, FAQ, etc.
>>>>>>>> 
>>>>>>>> thoughts?
>>>>>>>> --Andy
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>> pei.c...@childrens.harvard.edu>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> I just wanted to gauge the interest of creating the next release
>>>>>>>>> of
>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>> - New CEM Instance Template population
>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>> - New regression testing component
>>>>>>>>> 
>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>> 
>>>>>>>>> [1]
>>>> 
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>> 22%20AND%20project%20%3D%20CTAKES
>>>>>>>> 
>>>>>> 
>>>> 
>

Re: Next cTAKES release (3.1)?

Reply via email to