Assuming we have a representative training set, are there any objections if we default cTAKES to this SentenceAnnotator + Model? For the upcoming release: - Consolidate the existing sentence detector, ytex sentence dectector into this new? - Allow a config parameter to still allow an override of a hard break on newline chars. That way, we won't have maintain multiple sentence annotators and it'll be less confusing for new users...
--Pei > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Monday, September 29, 2014 2:47 PM > To: dev@ctakes.apache.org > Subject: Re: sentence detector model > > That does sound like it would be useful since MIMIC does have both kinds of > linebreak styles in different notes. If I did some annotations on such a > dataset would it be re-distributable, say on the physionet website? I believe > the ShARe project has a download site there (it is a layer of annotations on > MIMIC). Another option would be you posting your raw data there and I > could post offset-based annotations on a public repo like github. > Tim > > > On 09/29/2014 01:54 PM, Peter Szolovits wrote: > > I have a set of about 27K documents from MIMIC (circa 2009) in which I > have replaced the weird PHI markers by synthesized pseudonymous data. > These have natural sentence breaks (typically in the middle of lines), normal > paragraph structure, bulleted lists, etc. Assuming it goes to people who have > signed the MIMIC DUA, I could provide these if you are interested. --Pete > Sz. > > > > On Sep 29, 2014, at 1:37 PM, Miller, Timothy > <timothy.mil...@childrens.harvard.edu> wrote: > > > >> Some of them are a bit artificial for this task, with notes being > >> annotated as one sentence per line and offset punctuation. I think > >> maybe the 2008 and 2009 data might have original formatting though, > >> with newlines not always breaking sentences. That has certain > >> advantages over raw MIMIC for training since the PHI isn't so weirdly > >> formatted, but then again is not a mix of styles (that is, the styles > >> of newline always terminates sentence vs. sometimes terminates > >> sentence). I think it would still have to be paired with another dataset to > be a representative sample. > >> Tim > >> > >> On 09/29/2014 01:24 PM, vijay garla wrote: > >>> Why not use the i2b2 corpora? > >>> > >>> On Monday, September 29, 2014, Dligach, Dmitriy < > >>> dmitriy.dlig...@childrens.harvard.edu> wrote: > >>> > >>>> Maybe creating a made-up set of sentences would be an option? That > >>>> way we could agree on the annotation of concrete cases. Although > >>>> this would be more of a unit test than a corpus. > >>>> > >>>> Dima > >>>> > >>>> > >>>> > >>>> > >>>> On Sep 27, 2014, at 12:15, Miller, Timothy < > >>>> timothy.mil...@childrens.harvard.edu <javascript:;>> wrote: > >>>> > >>>>> I've just been using the opennlp command line cross validator on > >>>>> the > >>>> small dataset i annotated (along with some eyeballing). It would be > >>>> cool if there was a standard clinical resource available for this > >>>> task, but I hadn't considered it much because the data I annotated > >>>> pulls from multiple datasets and the process of arranging with > >>>> different institutions to make something like that available would > probably be a nightmare. > >>>>> Tim > >>>>> > >>>>> Sent from my iPad. Sorry about the typos. > >>>>> > >>>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" < > >>>> dmitriy.dlig...@childrens.harvard.edu <javascript:;>> wrote: > >>>>>> Tim, thanks for working on this! > >>>>>> > >>>>>> Question: do we have some formal way of evaluating the sentence > >>>> detector? Maybe we should come up with some dev set that would > >>>> include examples from mimic... > >>>>>> Dima > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy < > >>>> timothy.mil...@childrens.harvard.edu <javascript:;>> wrote: > >>>>>>> I have been working on the sentence detector newline issue, > >>>>>>> training a > >>>> model to probabilistically split sentences on newlines rather than > >>>> forcing sentence breaks. I have checked in a model to the repo > >>>> under ctakes-core-res. I also attached a patch to ctakes-core to the jira > issue: > >>>>>>> https://issues.apache.org/jira/browse/CTAKES-41 > >>>>>>> > >>>>>>> for people to test. The status of my testing is that it doesn't > >>>>>>> seem > >>>> to break on notes where ctakes worked well before (those where > >>>> newlines are always sentence breaks), and is a slight improvement > >>>> on notes where newlines may or may not be sentence breaks. Once > the > >>>> change is checked in we can continue improving the model by adding > >>>> more data and features, but the first hurdle I'd like to get past > >>>> is making sure it runs well enough on the type of data that the old > >>>> model worked well on. Let me know if you have any questions. > >>>>>>> Thanks > >>>>>>> Tim > >