Re: sentence detector model

2014-09-29 Thread Dligach, Dmitriy
Maybe creating a made-up set of sentences would be an option? That way we could agree on the annotation of concrete cases. Although this would be more of a unit test than a corpus. Dima On Sep 27, 2014, at 12:15, Miller, Timothy wrote: > I've just been using the opennlp command line cross

Re: sentence detector model

2014-09-29 Thread vijay garla
Why not use the i2b2 corpora? On Monday, September 29, 2014, Dligach, Dmitriy < dmitriy.dlig...@childrens.harvard.edu> wrote: > Maybe creating a made-up set of sentences would be an option? That way we > could agree on the annotation of concrete cases. Although this would be > more of a unit test

Re: sentence detector model

2014-09-29 Thread Miller, Timothy
Some of them are a bit artificial for this task, with notes being annotated as one sentence per line and offset punctuation. I think maybe the 2008 and 2009 data might have original formatting though, with newlines not always breaking sentences. That has certain advantages over raw MIMIC for traini

RE: sentence detector model

2014-09-29 Thread Savova, Guergana
How about pairing it with THYME and MiPACQ? Perhaps you are using them already... --Guergana -Original Message- From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] Sent: Monday, September 29, 2014 1:38 PM To: dev@ctakes.apache.org Subject: Re: sentence detector model Som

Re: sentence detector model

2014-09-29 Thread Peter Szolovits
I have a set of about 27K documents from MIMIC (circa 2009) in which I have replaced the weird PHI markers by synthesized pseudonymous data. These have natural sentence breaks (typically in the middle of lines), normal paragraph structure, bulleted lists, etc. Assuming it goes to people who ha

De-identified lab tests dataset

2014-09-29 Thread Ajay Jain
Hello All, I am working on a use case for lab tests data using cTAKES and my online search to find a test dataset has been futile. I'll greatly appreciate if someone can share such a dataset or can point me in the right direction to go looking for one. Best, Ajay -- Founder & CEO Mobile Insigh

Re: De-identified lab tests dataset

2014-09-29 Thread Peter Szolovits
Ajay, I'm confused by your query. cTakes is good at interpreting text, but most lab test results are reported in tabular form that is most appropriately searched by SQL queries. Sometimes lab results are also reported in narrative notes, but parsing those is often more a matter of deciphering

Re: sentence detector model

2014-09-29 Thread Karthik Sarma
That sounds like it would be perfect for this task On Monday, September 29, 2014, Peter Szolovits wrote: > I have a set of about 27K documents from MIMIC (circa 2009) in which I > have replaced the weird PHI markers by synthesized pseudonymous data. > These have natural sentence breaks (typicall

Re: sentence detector model

2014-09-29 Thread Miller, Timothy
That does sound like it would be useful since MIMIC does have both kinds of linebreak styles in different notes. If I did some annotations on such a dataset would it be re-distributable, say on the physionet website? I believe the ShARe project has a download site there (it is a layer of annotation

RE: De-identified lab tests dataset

2014-09-29 Thread Savova, Guergana
Ajay, cTAKES currently does not implement a method to discover labs from the text. The motivation is that you can get that easily from the structured part of the EMR (what Pete explained below). Hope this makes sense! --Guergana -Original Message- From: Peter Szolovits [mailto:p...@mit.e

RE: sentence detector model

2014-09-29 Thread Chen, Pei
Assuming we have a representative training set, are there any objections if we default cTAKES to this SentenceAnnotator + Model? For the upcoming release: - Consolidate the existing sentence detector, ytex sentence dectector into this new? - Allow a config parameter to still allow an override of

Re: sentence detector model

2014-09-29 Thread Koola, Jejo David
How about this idea the training/test set: 1) Start with a document with NO newlines. Perhaps just the entire document is a single paragraph. 2) Then, any sentence detector should be able to parse it correctly. 3) Then, deterministically add newlines to the document: some after punctuation;

Re: De-identified lab tests dataset

2014-09-29 Thread Ajay Jain
Sorry, I wasn't clear. I am working on a related project and trying to figure out if the code can be repurposed for a lab mention annotator for cTAKES. From what I have seen, test names from different institutions are not standardized which makes it hard to standardize the resulting annotation.