I'm annotating some oncology notes from SHARP right now, and they are basically 
a nightmare for our current sentence segmentation model. Mainly because they 
eschew explicit markers between sentences. I thought I'd ping the list with 
some interesting examples just in case it stimulates ideas. But it seems to me 
that at some point we'll have to augment the opennlp module (preferable) or 
roll our own to handle cases like these.

In this example a bunch of background is on one line with no punctuation 
between logical breaks:
PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to 
auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2 
normal, no murmur, click, rub or gal*, chest is clear without rales or 
wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative 
findings right/left breast with mild swelling, warmth, mild erythema, slightly 
tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.

It would be preferable to me to put sentence breaks in between the sections, so 
the first two sentences would be:

1) PE: Lymphonodes...
2) Lungs: normal...

but without any candidate characters to split the sentence I don't think it is 
possible.

Another example that breaks our model in a different way (truncated):
1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 with 
RN chemo teach  3. S U parent study

Our model will break on the period after the number, so we'd probably get:
1.
Baseline labwork including tumor markers 2.
Start DD.... 3.
S U parent study

So the number is going in exactly the wrong place. Here it would be preferable 
to get:
1.
Baseline labwork...
2.
Start DD...
3.
S U parent study

Anyways, just something to think about! The problem is much more complex in 
clinical data than in edited text, but I'm sure we all knew that already :)

Tim


________________________________________
From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
Sent: Monday, July 28, 2014 2:38 PM
To: dev@ctakes.apache.org
Subject: Re: question about sentence segmentation

Yes, you're right about that Britt. I've been doing some annotations side by 
side with a treebank viewer and think I have a pretty good handle on the actual 
rules.

Basically, if a header or list identifier is followed by a period or a newline 
it is considered a sentence break and otherwise it is part of the sentence.

e.g.

1. 20 mg flomax

is two sentences, while:

1 - 20 mg flomax

is one sentence.

For headings:

Allergies: Pt is allergic to aspirin.

is one sentence, while:

Allergies:
Pt is allergic to aspirin.

is two sentences.

I'm planning to follow these guidelines.

Tim

On 07/28/2014 01:53 PM, britt fitch wrote:

Thanks for the document, Tim. It seems to not be explicit about how to handle 
sentences occurring in lists.

Are you still considering having the list number as outside of the sentence?

Thanks

Britt

On Jul 25, 2014, at 7:09 AM, Miller, Timothy 
<timothy.mil...@childrens.harvard.edu><mailto:timothy.mil...@childrens.harvard.edu>
 wrote:



Checking with Guergana and other colleagues here the advice is to have the 
sentence segmenter follow the treebank guidelines for sentence segmentation:
http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf

They are a bit light on detail but fortunately we have some treebanked data so 
I will use that for the training data and hopefully that will illuminate the 
tricky cases.

Tim

________________________________________
From: Masanz, James J. [masanz.ja...@mayo.edu<mailto:masanz.ja...@mayo.edu>]
Sent: Tuesday, July 15, 2014 4:39 PM
To: 'dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>'
Subject: RE: question about sentence segmentation

Sorry, I don't know if there was a reason.

If you haven't checked with Guergana, you might want to ask her if she had a 
reason or if it was just the way it had been since that corpus was created.

-----Original Message-----
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Tuesday, July 15, 2014 3:34 PM
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: Re: question about sentence segmentation

Thanks James, I was hoping to hear from you. I'll probably go ahead and
change the data to split sentences between the list header and list element.

You don't happen to know if there is any principled reason for the
original style or whether it was just an arbitrary convention? The only
thing I can think of is it might be hard to learn when to separate when
there is no period after the list header (as in your examples). I think
it's worth empirically checking on that point, but there might be other
reasons that I'm not thinking of.

Thanks
Tim

On 07/15/2014 03:27 PM, Masanz, James J. wrote:


I don't have an opinion about how it should work.

But I can verify that the clinical notes from Mayo Clinic that were used in the 
initial cTAKES sentence detector model had the list markers included in the 
first sentence, so, for example, the following would be two sentences, with 
each line a separate sentence.

#1 Dilated esophagus.
#2 Adenocarcinoma

-- James

-----Original Message-----
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Tuesday, July 15, 2014 6:04 AM
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: RE: question about sentence segmentation



My preference is to treat the list row number as outside of the sentence of


interest. Or if it is necessary to be included in a sentence, have it be a 
sentence
on its own.

I can get behind this, I think it makes the issue a bit cleaner, to either have 
the list header as non-sentential or it's own sentence. As far as I can tell, 
this is not the current default behavior. At least in my runs the list header 
seems to get attached to the first following sentence, even in cases where it 
starts with a digit and a period ("3. Magnesium oxide 400 mg p.o. daily." is 
all one sentence).
This behavior is probably strongly dependent on the annotations we give the 
sentence detector so as I'm prepping new training data I should have a default 
in mind.

Does anyone have any objections to changing the sentence detector behavior to 
break list headers (things like "3." or "A " or "#5") as their own sentence?

Tim


________________________________________
From: Britt Fitch [britt.fi...@gmail.com<mailto:britt.fi...@gmail.com>]
Sent: Monday, July 14, 2014 8:29 AM
To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
Subject: Re: question about sentence segmentation

My preference is to treat the list row number as outside of the sentence of
interest.
Or if it is necessary to be included in a sentence, have it be a sentence
on its own.
That won't be as straightforward as splitting on a period in cases
like "2. Magnesium
oxide 400 mg p.o. daily."
In cases where there are more than 1 written sentence like your example in
the original email, I'd prefer those were each a sentence rather than
making the entire list line a single sentence.
My feeling is that each line without terminating punctuation would be a
single sentence and would exclude the list number.

As an aside, I have encountered several issues with numbered lists being
interpreted differently depending on
1. what number is included at the start
for example: "2. Magnesium oxide 400 mg p.o. daily." vs "12. Magnesium
oxide 400 mg p.o. daily." (This appears to be a chunking issue where the
line starting with "12. Magnesium" is identified as starting with chunks [O,
O, B-NP, B-NP, I-NP, B-NP, B-ADVP, O] even though the parts of speech
appear to be correct)
2. whether there is a period at the end of a list
for example: "4. CHF" vs "4. CHF." (This appears to be an issue with the
chunker though which produces [O,O] in the first case and [B-VP, B-NP, O]
in the second.

Cheers,

Britt



On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy <
timothy.mil...@childrens.harvard.edu<mailto:timothy.mil...@childrens.harvard.edu>>
 wrote:



Just curious about an edge case regarding headers/lists and wondering what
people think the correct behavior and annotation are.

In cases like this:

#1 Dilated esophagus.
#2 Adenocarcinoma

my intuition is that each whole line is one sentence. But then there are
cases where the number may be followed by multiple sentences on one line.
1. EGD as a complex procedure. If there is an abnormality, obtain biopsies.

For this example my intuition is not as clear. Should there be a break
after the "1." or should the first sentence be "1. EGD as a complex
procedure."? Again, my intuition leans towards the latter but it seems a
bit odd since the "1." kind of distributes over all the following sentences
(i.e. it's like a paragraph descriptor.)

Does the period after the 1 matter? The number of sentences after the list
header? The fact that it's all on one line? Anything else?

Tim










Reply via email to