Hi Tim, > It would be preferable to me to put sentence breaks in between the sections, > so > the first two sentences would be: > > 1) PE: Lymphonodes... > 2) Lungs: normal...
The punctuation is (always) after the logical break, being "Term: " for a Term:Definition list. I think that the first three sentences should be 1) PE: 2) Lymphnodes: neck and ... 3) CV: regular and ... Where the first line is an overarching Term: sentence (tree root), because each Term:Definition line that follows is within the physical exam. Just an fyi. Does that make sense? Haven't had my coffee ... Sean > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Saturday, August 02, 2014 7:44 AM > To: dev@ctakes.apache.org > Subject: RE: question about sentence segmentation > > I'm annotating some oncology notes from SHARP right now, and they are > basically a nightmare for our current sentence segmentation model. Mainly > because they eschew explicit markers between sentences. I thought I'd ping the > list with some interesting examples just in case it stimulates ideas. But it > seems > to me that at some point we'll have to augment the opennlp module (preferable) > or roll our own to handle cases like these. > > In this example a bunch of background is on one line with no punctuation > between logical breaks: > PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to > auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2 > normal, no murmur, click, rub or gal*, chest is clear without rales or > wheezing, > no pedal edema, no JVD, no hepatosplenomegaly Breast: negative findings > right/left breast with mild swelling, warmth, mild erythema, slightly tender, > no > seroma or hematoma Abdomen: Abdomen soft, non-tender. > > It would be preferable to me to put sentence breaks in between the sections, > so > the first two sentences would be: > > 1) PE: Lymphonodes... > 2) Lungs: normal... > > but without any candidate characters to split the sentence I don't think it is > possible. > > Another example that breaks our model in a different way (truncated): > 1. Baseline labwork including tumor markers 2. Start DD AC on Friday 8/1 with > RN chemo teach 3. S U parent study > > Our model will break on the period after the number, so we'd probably get: > 1. > Baseline labwork including tumor markers 2. > Start DD.... 3. > S U parent study > > So the number is going in exactly the wrong place. Here it would be preferable > to get: > 1. > Baseline labwork... > 2. > Start DD... > 3. > S U parent study > > Anyways, just something to think about! The problem is much more complex in > clinical data than in edited text, but I'm sure we all knew that already :) > > Tim > > > ________________________________________ > From: Miller, Timothy [timothy.mil...@childrens.harvard.edu] > Sent: Monday, July 28, 2014 2:38 PM > To: dev@ctakes.apache.org > Subject: Re: question about sentence segmentation > > Yes, you're right about that Britt. I've been doing some annotations side by > side > with a treebank viewer and think I have a pretty good handle on the actual > rules. > > Basically, if a header or list identifier is followed by a period or a > newline it is > considered a sentence break and otherwise it is part of the sentence. > > e.g. > > 1. 20 mg flomax > > is two sentences, while: > > 1 - 20 mg flomax > > is one sentence. > > For headings: > > Allergies: Pt is allergic to aspirin. > > is one sentence, while: > > Allergies: > Pt is allergic to aspirin. > > is two sentences. > > I'm planning to follow these guidelines. > > Tim > > On 07/28/2014 01:53 PM, britt fitch wrote: > > Thanks for the document, Tim. It seems to not be explicit about how to handle > sentences occurring in lists. > > Are you still considering having the list number as outside of the sentence? > > Thanks > > Britt > > On Jul 25, 2014, at 7:09 AM, Miller, Timothy > <timothy.mil...@childrens.harvard.edu><mailto:timothy.mil...@childrens.harv > ard.edu> wrote: > > > > Checking with Guergana and other colleagues here the advice is to have the > sentence segmenter follow the treebank guidelines for sentence segmentation: > http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf > > They are a bit light on detail but fortunately we have some treebanked data > so I > will use that for the training data and hopefully that will illuminate the > tricky > cases. > > Tim > > ________________________________________ > From: Masanz, James J. > [masanz.ja...@mayo.edu<mailto:masanz.ja...@mayo.edu>] > Sent: Tuesday, July 15, 2014 4:39 PM > To: 'dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>' > Subject: RE: question about sentence segmentation > > Sorry, I don't know if there was a reason. > > If you haven't checked with Guergana, you might want to ask her if she had a > reason or if it was just the way it had been since that corpus was created. > > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Tuesday, July 15, 2014 3:34 PM > To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> > Subject: Re: question about sentence segmentation > > Thanks James, I was hoping to hear from you. I'll probably go ahead and change > the data to split sentences between the list header and list element. > > You don't happen to know if there is any principled reason for the original > style > or whether it was just an arbitrary convention? The only thing I can think of > is it > might be hard to learn when to separate when there is no period after the list > header (as in your examples). I think it's worth empirically checking on that > point, but there might be other reasons that I'm not thinking of. > > Thanks > Tim > > On 07/15/2014 03:27 PM, Masanz, James J. wrote: > > > I don't have an opinion about how it should work. > > But I can verify that the clinical notes from Mayo Clinic that were used in > the > initial cTAKES sentence detector model had the list markers included in the > first > sentence, so, for example, the following would be two sentences, with each > line > a separate sentence. > > #1 Dilated esophagus. > #2 Adenocarcinoma > > -- James > > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Tuesday, July 15, 2014 6:04 AM > To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> > Subject: RE: question about sentence segmentation > > > > My preference is to treat the list row number as outside of the sentence of > > > interest. Or if it is necessary to be included in a sentence, have it be a > sentence > on its own. > > I can get behind this, I think it makes the issue a bit cleaner, to either > have the > list header as non-sentential or it's own sentence. As far as I can tell, > this is not > the current default behavior. At least in my runs the list header seems to get > attached to the first following sentence, even in cases where it starts with > a digit > and a period ("3. Magnesium oxide 400 mg p.o. daily." is all one sentence). > This behavior is probably strongly dependent on the annotations we give the > sentence detector so as I'm prepping new training data I should have a > default in > mind. > > Does anyone have any objections to changing the sentence detector behavior to > break list headers (things like "3." or "A " or "#5") as their own sentence? > > Tim > > > ________________________________________ > From: Britt Fitch [britt.fi...@gmail.com<mailto:britt.fi...@gmail.com>] > Sent: Monday, July 14, 2014 8:29 AM > To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> > Subject: Re: question about sentence segmentation > > My preference is to treat the list row number as outside of the sentence of > interest. > Or if it is necessary to be included in a sentence, have it be a sentence on > its > own. > That won't be as straightforward as splitting on a period in cases like "2. > Magnesium oxide 400 mg p.o. daily." > In cases where there are more than 1 written sentence like your example in the > original email, I'd prefer those were each a sentence rather than making the > entire list line a single sentence. > My feeling is that each line without terminating punctuation would be a single > sentence and would exclude the list number. > > As an aside, I have encountered several issues with numbered lists being > interpreted differently depending on 1. what number is included at the start > for > example: "2. Magnesium oxide 400 mg p.o. daily." vs "12. Magnesium oxide 400 > mg p.o. daily." (This appears to be a chunking issue where the line starting > with > "12. Magnesium" is identified as starting with chunks [O, O, B-NP, B-NP, > I-NP, B- > NP, B-ADVP, O] even though the parts of speech appear to be correct) 2. > whether there is a period at the end of a list for example: "4. CHF" vs "4. > CHF." > (This appears to be an issue with the chunker though which produces [O,O] in > the first case and [B-VP, B-NP, O] in the second. > > Cheers, > > Britt > > > > On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy < > timothy.mil...@childrens.harvard.edu<mailto:Timothy.Miller@childrens.harvar > d.edu>> wrote: > > > > Just curious about an edge case regarding headers/lists and wondering what > people think the correct behavior and annotation are. > > In cases like this: > > #1 Dilated esophagus. > #2 Adenocarcinoma > > my intuition is that each whole line is one sentence. But then there are cases > where the number may be followed by multiple sentences on one line. > 1. EGD as a complex procedure. If there is an abnormality, obtain biopsies. > > For this example my intuition is not as clear. Should there be a break after > the > "1." or should the first sentence be "1. EGD as a complex procedure."? Again, > my > intuition leans towards the latter but it seems a bit odd since the "1." kind > of > distributes over all the following sentences (i.e. it's like a paragraph > descriptor.) > > Does the period after the 1 matter? The number of sentences after the list > header? The fact that it's all on one line? Anything else? > > Tim > > > > > > > > >