Maybe creating a made-up set of sentences would be an option? That way we could agree on the annotation of concrete cases. Although this would be more of a unit test than a corpus.
Dima On Sep 27, 2014, at 12:15, Miller, Timothy <timothy.mil...@childrens.harvard.edu> wrote: > I've just been using the opennlp command line cross validator on the small > dataset i annotated (along with some eyeballing). It would be cool if there > was a standard clinical resource available for this task, but I hadn't > considered it much because the data I annotated pulls from multiple datasets > and the process of arranging with different institutions to make something > like that available would probably be a nightmare. > Tim > > Sent from my iPad. Sorry about the typos. > >> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" >> <dmitriy.dlig...@childrens.harvard.edu> wrote: >> >> Tim, thanks for working on this! >> >> Question: do we have some formal way of evaluating the sentence detector? >> Maybe we should come up with some dev set that would include examples from >> mimic... >> >> Dima >> >> >> >> >>> On Sep 27, 2014, at 8:57, Miller, Timothy >>> <timothy.mil...@childrens.harvard.edu> wrote: >>> >>> I have been working on the sentence detector newline issue, training a >>> model to probabilistically split sentences on newlines rather than forcing >>> sentence breaks. I have checked in a model to the repo under >>> ctakes-core-res. I also attached a patch to ctakes-core to the jira issue: >>> https://issues.apache.org/jira/browse/CTAKES-41 >>> >>> for people to test. The status of my testing is that it doesn't seem to >>> break on notes where ctakes worked well before (those where newlines are >>> always sentence breaks), and is a slight improvement on notes where >>> newlines may or may not be sentence breaks. Once the change is checked in >>> we can continue improving the model by adding more data and features, but >>> the first hurdle I'd like to get past is making sure it runs well enough on >>> the type of data that the old model worked well on. Let me know if you have >>> any questions. >>> >>> Thanks >>> Tim >>