Hi Abilash Mathew,

Thanks for providing the links - I am sure that they will be helpful to others 
researching sections.

I am vaguely familiar with your first link to SecTag.  As far as I know it also 
uses pattern recognition to identify section headers.  However, it also tries 
to validate (or extract) section types based upon the following text.  Perhaps 
somebody out there has a ctakes or uima ae that will use it, but I don't know 
of one.  If you create one please share it!  You can find some more Sectag info 
(and files) here: 
https://www.vumc.org/cpm/sectag-tagging-clinical-note-section-headers

I have never heard of that tokenization and crf paper or approach.  I don't 
have time to do more than skim through it, but it looks like they are using an 
ML-based recognizer.  It looks like their model actually trained using sectag 
synonyms as features.  I don't know if you could call this under-the-covers 
pattern recognition or not ...  Their comparison to full-text, window-unbounded 
dictionary lookup methods looks odd to me.  I don't know why anybody would do 
that.  When they do limit their lookup window to sentence only with dictionary 
lookup, lo and behold the precision shoots up to .82 and .85.  Again, I only 
skimmed the paper and I'm not trying to critique their method (it looks good), 
I am only stating that I find their comparison odd.   Their method is getting 
F1 .90 to .96 based upon configuration, which is excellent (again, my opinion 
only).  They used the public i2b2 2014 corpus, but I don't see any information 
about obtaining any code that they might have used.  It would be great to get 
that - in case you write the authors.

I am really running short on time here, so I only took a 10 second glance at 
the legal section id paper.  It looks like they are using pattern recognition, 
but instead of encoding it in a regex they use booleans like "leadingAsterisk" 
and "endsInPeriod".  This is like using regex ^\*.*\.$ (I think).  It is 
different from just one regex per section in that they have a two-step process, 
using the equivalent of the previous regex followed by the equivalent of 
another "contains" patterns.  So, get all ^\*.*\.$.  Then using regex lookahead 
(might be a better way), something with (?=\bTable\b)(?=\bcontents\b) = 
Table_of_Contents.  All that being said, it would be very easy to make an ae 
that
1.  Uses various regex to identify section header candidate sentences.  They 
can be read from a bsv.
2.  Uses a bsv listing SectionTitle||word1 word2 word3|TRUE
3.  Reads the bsv and processes each candidate sentence to see if they contain 
all words (column 2), must contain those words exclusively (column 3), and then 
assign a section title (column 1).
Without looking at the code, I think that the RegexSectionizer in ctakes core 
could easily be extended to do this in a couple of hours.  Then it would just 
be a matter of creating the list.

At this point I should probably admit that I do have a sectionizer in a private 
project that does something similar.  It first identifies candidate sentences 
via regex.  It reads in a list/table of section titles and synonyms for those 
titles and then attempts synonym matching.  The same thing could be done using 
a list/table extracted from SecTag for headers and synonyms.

Ok, back to my real job ...

Thanks again for the excellent links,
Sean




-----Original Message-----
From: abilash.mat...@cognizant.com [mailto:abilash.mat...@cognizant.com] 
Sent: Thursday, October 12, 2017 12:54 AM
To: dev@ctakes.apache.org
Subject: RE: segmentation [EXTERNAL]

Sean,

When I tried with pattern based approach , during the testing we got into 
issues for correctly identifying the segments. I was searching for better 
solution and could find couple of articles talks about NLP and statistical 
model based approaches. See below couple of links.  Let me know if you have any 
insight into these approaches.

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC3002123_&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=siIm_idCCJKx_vKf-0UxVEi5frE1bqB3T2dVaYQeKSA&s=RYmYt2bV1ClcojTwpjhzKdGDZ4_69GTZuKEYORtDy04&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.hindawi.com_journals_bmri_2015_873012_&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=siIm_idCCJKx_vKf-0UxVEi5frE1bqB3T2dVaYQeKSA&s=AriQEqNVgPqN97AWcHZcAwDxSwX_vuH_LyHtWAvsfNI&e=
https://urldefense.proofpoint.com/v2/url?u=http-3A__ceur-2Dws.org_Vol-2D710_paper23.pdf&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=siIm_idCCJKx_vKf-0UxVEi5frE1bqB3T2dVaYQeKSA&s=MAdp_ACJwKPAjtFqhzzIeGFWZU-icnw3UO4HL8ITDo0&e=


Thanks,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
Sent: Wednesday, October 11, 2017 8:54 PM
To: dev@ctakes.apache.org
Subject: RE: segmentation [EXTERNAL]

Hi Matthew,

Could you explain:
> is there a better annotator ..?

You could take a look at the BxvRegexSectionizer in ctakes-core.  It is a 
little more robust than the CDASegmentAnnotator, but they are both 
pattern-based.  One immediate advantage is that the BsvRegexSectionizer has a 
default list of section names and expressions in ctakes-core-res named 
DefaultSectionRegex.bsv that is much more populated than the ccda_sections.txt 
file.

Though it is an ultimate goal, I don't know of any section annotator that can 
be plugged in and immediately catch every section for every domain at every 
institution.  If you give me a little more to go on maybe I can be more helpful.

Sean

-----Original Message-----
From: abilash.mat...@cognizant.com [mailto:abilash.mat...@cognizant.com]
Sent: Wednesday, October 11, 2017 10:51 AM
To: dev@ctakes.apache.org
Subject: segmentation [EXTERNAL]

Hi All,

We are currently using CDASegmentAnnotator for segmenting the medical records. 
It is a pattern based annotator, I would like to know is there a better 
annotator available to segmenting the documents. It is very difficult to 
accommodate all the patterns in CDASegmentAnnotator as we are dealing with 
Medical records from multiple service providers.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful. Where permitted by applicable law, this e-mail 
and other e-mail communications sent to and from Cognizant e-mail addresses may 
be monitored.
This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful. Where permitted by applicable law, this e-mail 
and other e-mail communications sent to and from Cognizant e-mail addresses may 
be monitored.

Reply via email to