Assuming we have a representative training set, are there any objections if we 
default cTAKES to this SentenceAnnotator + Model?
For the upcoming release:
- Consolidate the existing sentence detector, ytex sentence dectector into this 
new? 
- Allow a config parameter to still allow an override of a hard break on 
newline chars.  That way, we won't have maintain multiple sentence annotators 
and it'll be less confusing for new users...

--Pei 


> -----Original Message-----
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Monday, September 29, 2014 2:47 PM
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector model
> 
> That does sound like it would be useful since MIMIC does have both kinds of
> linebreak styles in different notes. If I did some annotations on such a
> dataset would it be re-distributable, say on the physionet website? I believe
> the ShARe project has a download site there (it is a layer of annotations on
> MIMIC). Another option would be you posting your raw data there and I
> could post offset-based annotations on a public repo like github.
> Tim
> 
> 
> On 09/29/2014 01:54 PM, Peter Szolovits wrote:
> > I have a set of about 27K documents from MIMIC (circa 2009) in which I
> have replaced the weird PHI markers by synthesized pseudonymous data.
> These have natural sentence breaks (typically in the middle of lines), normal
> paragraph structure, bulleted lists, etc.  Assuming it goes to people who have
> signed the MIMIC DUA, I could provide these if you are interested.  --Pete
> Sz.
> >
> > On Sep 29, 2014, at 1:37 PM, Miller, Timothy
> <timothy.mil...@childrens.harvard.edu> wrote:
> >
> >> Some of them are a bit artificial for this task, with notes being
> >> annotated as one sentence per line and offset punctuation. I think
> >> maybe the 2008 and 2009 data might have original formatting though,
> >> with newlines not always breaking sentences. That has certain
> >> advantages over raw MIMIC for training since the PHI isn't so weirdly
> >> formatted, but then again is not a mix of styles (that is, the styles
> >> of newline always terminates sentence vs. sometimes terminates
> >> sentence). I think it would still have to be paired with another dataset to
> be a representative sample.
> >> Tim
> >>
> >> On 09/29/2014 01:24 PM, vijay garla wrote:
> >>> Why not use the i2b2 corpora?
> >>>
> >>> On Monday, September 29, 2014, Dligach, Dmitriy <
> >>> dmitriy.dlig...@childrens.harvard.edu> wrote:
> >>>
> >>>> Maybe creating a made-up set of sentences would be an option? That
> >>>> way we could agree on the annotation of concrete cases. Although
> >>>> this would be more of a unit test than a corpus.
> >>>>
> >>>> Dima
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sep 27, 2014, at 12:15, Miller, Timothy <
> >>>> timothy.mil...@childrens.harvard.edu <javascript:;>> wrote:
> >>>>
> >>>>> I've just been using the opennlp command line cross validator on
> >>>>> the
> >>>> small dataset i annotated (along with some eyeballing). It would be
> >>>> cool if there was a standard clinical resource available for this
> >>>> task, but I hadn't considered it much because the data I annotated
> >>>> pulls from multiple datasets and the process of  arranging with
> >>>> different institutions to make something like that available would
> probably be a nightmare.
> >>>>> Tim
> >>>>>
> >>>>> Sent from my iPad. Sorry about the typos.
> >>>>>
> >>>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" <
> >>>> dmitriy.dlig...@childrens.harvard.edu <javascript:;>> wrote:
> >>>>>> Tim, thanks for working on this!
> >>>>>>
> >>>>>> Question: do we have some formal way of evaluating the sentence
> >>>> detector? Maybe we should come up with some dev set that would
> >>>> include examples from mimic...
> >>>>>> Dima
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy <
> >>>> timothy.mil...@childrens.harvard.edu <javascript:;>> wrote:
> >>>>>>> I have been working on the sentence detector newline issue,
> >>>>>>> training a
> >>>> model to probabilistically split sentences on newlines rather than
> >>>> forcing sentence breaks. I have checked in a model to the repo
> >>>> under ctakes-core-res. I also attached a patch to ctakes-core to the jira
> issue:
> >>>>>>> https://issues.apache.org/jira/browse/CTAKES-41
> >>>>>>>
> >>>>>>> for people to test. The status of my testing is that it doesn't
> >>>>>>> seem
> >>>> to break on notes where ctakes worked well before (those where
> >>>> newlines are always sentence breaks), and is a slight improvement
> >>>> on notes where newlines may or may not be sentence breaks. Once
> the
> >>>> change is checked in we can continue improving the model by adding
> >>>> more data and features, but the first hurdle I'd like to get past
> >>>> is making sure it runs well enough on the type of data that the old
> >>>> model worked well on. Let me know if you have any questions.
> >>>>>>> Thanks
> >>>>>>> Tim
> >

Reply via email to