Maybe creating a made-up set of sentences would be an option? That way we could 
agree on the annotation of concrete cases. Although this would be more of a 
unit test than a corpus.

Dima




On Sep 27, 2014, at 12:15, Miller, Timothy 
<timothy.mil...@childrens.harvard.edu> wrote:

> I've just been using the opennlp command line cross validator on the small 
> dataset i annotated (along with some eyeballing). It would be cool if there 
> was a standard clinical resource available for this task, but I hadn't 
> considered it much because the data I annotated pulls from multiple datasets 
> and the process of  arranging with different institutions to make something 
> like that available would probably be a nightmare.
> Tim
> 
> Sent from my iPad. Sorry about the typos.
> 
>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" 
>> <dmitriy.dlig...@childrens.harvard.edu> wrote:
>> 
>> Tim, thanks for working on this!
>> 
>> Question: do we have some formal way of evaluating the sentence detector? 
>> Maybe we should come up with some dev set that would include examples from 
>> mimic...
>> 
>> Dima
>> 
>> 
>> 
>> 
>>> On Sep 27, 2014, at 8:57, Miller, Timothy 
>>> <timothy.mil...@childrens.harvard.edu> wrote:
>>> 
>>> I have been working on the sentence detector newline issue, training a 
>>> model to probabilistically split sentences on newlines rather than forcing 
>>> sentence breaks. I have checked in a model to the repo under 
>>> ctakes-core-res. I also attached a patch to ctakes-core to the jira issue:
>>> https://issues.apache.org/jira/browse/CTAKES-41
>>> 
>>> for people to test. The status of my testing is that it doesn't seem to 
>>> break on notes where ctakes worked well before (those where newlines are 
>>> always sentence breaks), and is a slight improvement on notes where 
>>> newlines may or may not be sentence breaks. Once the change is checked in 
>>> we can continue improving the model by adding more data and features, but 
>>> the first hurdle I'd like to get past is making sure it runs well enough on 
>>> the type of data that the old model worked well on. Let me know if you have 
>>> any questions.
>>> 
>>> Thanks
>>> Tim
>> 

Reply via email to