How about this idea the training/test set: 1) Start with a document with NO newlines. Perhaps just the entire document is a single paragraph. 2) Then, any sentence detector should be able to parse it correctly. 3) Then, deterministically add newlines to the document: some after punctuation; some after a word; some after a sentence fragment
Jejo On Sep 29, 2014, at 3:43 PM, Chen, Pei <[email protected]> wrote: > Assuming we have a representative training set, are there any objections if > we default cTAKES to this SentenceAnnotator + Model? > For the upcoming release: > - Consolidate the existing sentence detector, ytex sentence dectector into > this new? > - Allow a config parameter to still allow an override of a hard break on > newline chars. That way, we won't have maintain multiple sentence annotators > and it'll be less confusing for new users... > > --Pei > > >> -----Original Message----- >> From: Miller, Timothy [mailto:[email protected]] >> Sent: Monday, September 29, 2014 2:47 PM >> To: [email protected] >> Subject: Re: sentence detector model >> >> That does sound like it would be useful since MIMIC does have both kinds of >> linebreak styles in different notes. If I did some annotations on such a >> dataset would it be re-distributable, say on the physionet website? I believe >> the ShARe project has a download site there (it is a layer of annotations on >> MIMIC). Another option would be you posting your raw data there and I >> could post offset-based annotations on a public repo like github. >> Tim >> >> >> On 09/29/2014 01:54 PM, Peter Szolovits wrote: >>> I have a set of about 27K documents from MIMIC (circa 2009) in which I >> have replaced the weird PHI markers by synthesized pseudonymous data. >> These have natural sentence breaks (typically in the middle of lines), normal >> paragraph structure, bulleted lists, etc. Assuming it goes to people who >> have >> signed the MIMIC DUA, I could provide these if you are interested. --Pete >> Sz. >>> >>> On Sep 29, 2014, at 1:37 PM, Miller, Timothy >> <[email protected]> wrote: >>> >>>> Some of them are a bit artificial for this task, with notes being >>>> annotated as one sentence per line and offset punctuation. I think >>>> maybe the 2008 and 2009 data might have original formatting though, >>>> with newlines not always breaking sentences. That has certain >>>> advantages over raw MIMIC for training since the PHI isn't so weirdly >>>> formatted, but then again is not a mix of styles (that is, the styles >>>> of newline always terminates sentence vs. sometimes terminates >>>> sentence). I think it would still have to be paired with another dataset to >> be a representative sample. >>>> Tim >>>> >>>> On 09/29/2014 01:24 PM, vijay garla wrote: >>>>> Why not use the i2b2 corpora? >>>>> >>>>> On Monday, September 29, 2014, Dligach, Dmitriy < >>>>> [email protected]> wrote: >>>>> >>>>>> Maybe creating a made-up set of sentences would be an option? That >>>>>> way we could agree on the annotation of concrete cases. Although >>>>>> this would be more of a unit test than a corpus. >>>>>> >>>>>> Dima >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sep 27, 2014, at 12:15, Miller, Timothy < >>>>>> [email protected] <javascript:;>> wrote: >>>>>> >>>>>>> I've just been using the opennlp command line cross validator on >>>>>>> the >>>>>> small dataset i annotated (along with some eyeballing). It would be >>>>>> cool if there was a standard clinical resource available for this >>>>>> task, but I hadn't considered it much because the data I annotated >>>>>> pulls from multiple datasets and the process of arranging with >>>>>> different institutions to make something like that available would >> probably be a nightmare. >>>>>>> Tim >>>>>>> >>>>>>> Sent from my iPad. Sorry about the typos. >>>>>>> >>>>>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" < >>>>>> [email protected] <javascript:;>> wrote: >>>>>>>> Tim, thanks for working on this! >>>>>>>> >>>>>>>> Question: do we have some formal way of evaluating the sentence >>>>>> detector? Maybe we should come up with some dev set that would >>>>>> include examples from mimic... >>>>>>>> Dima >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy < >>>>>> [email protected] <javascript:;>> wrote: >>>>>>>>> I have been working on the sentence detector newline issue, >>>>>>>>> training a >>>>>> model to probabilistically split sentences on newlines rather than >>>>>> forcing sentence breaks. I have checked in a model to the repo >>>>>> under ctakes-core-res. I also attached a patch to ctakes-core to the jira >> issue: >>>>>>>>> https://issues.apache.org/jira/browse/CTAKES-41 >>>>>>>>> >>>>>>>>> for people to test. The status of my testing is that it doesn't >>>>>>>>> seem >>>>>> to break on notes where ctakes worked well before (those where >>>>>> newlines are always sentence breaks), and is a slight improvement >>>>>> on notes where newlines may or may not be sentence breaks. Once >> the >>>>>> change is checked in we can continue improving the model by adding >>>>>> more data and features, but the first hurdle I'd like to get past >>>>>> is making sure it runs well enough on the type of data that the old >>>>>> model worked well on. Let me know if you have any questions. >>>>>>>>> Thanks >>>>>>>>> Tim >>> >
