I've just been using the opennlp command line cross validator on the small 
dataset i annotated (along with some eyeballing). It would be cool if there was 
a standard clinical resource available for this task, but I hadn't considered 
it much because the data I annotated pulls from multiple datasets and the 
process of  arranging with different institutions to make something like that 
available would probably be a nightmare.
Tim

Sent from my iPad. Sorry about the typos.

> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" 
> <dmitriy.dlig...@childrens.harvard.edu> wrote:
> 
> Tim, thanks for working on this!
> 
> Question: do we have some formal way of evaluating the sentence detector? 
> Maybe we should come up with some dev set that would include examples from 
> mimic...
> 
> Dima
> 
> 
> 
> 
>> On Sep 27, 2014, at 8:57, Miller, Timothy 
>> <timothy.mil...@childrens.harvard.edu> wrote:
>> 
>> I have been working on the sentence detector newline issue, training a model 
>> to probabilistically split sentences on newlines rather than forcing 
>> sentence breaks. I have checked in a model to the repo under 
>> ctakes-core-res. I also attached a patch to ctakes-core to the jira issue:
>> https://issues.apache.org/jira/browse/CTAKES-41
>> 
>> for people to test. The status of my testing is that it doesn't seem to 
>> break on notes where ctakes worked well before (those where newlines are 
>> always sentence breaks), and is a slight improvement on notes where newlines 
>> may or may not be sentence breaks. Once the change is checked in we can 
>> continue improving the model by adding more data and features, but the first 
>> hurdle I'd like to get past is making sure it runs well enough on the type 
>> of data that the old model worked well on. Let me know if you have any 
>> questions.
>> 
>> Thanks
>> Tim
> 

Reply via email to