Hi Abilash Mathew, Thanks for providing the links - I am sure that they will be helpful to others researching sections.
I am vaguely familiar with your first link to SecTag. As far as I know it also uses pattern recognition to identify section headers. However, it also tries to validate (or extract) section types based upon the following text. Perhaps somebody out there has a ctakes or uima ae that will use it, but I don't know of one. If you create one please share it! You can find some more Sectag info (and files) here: https://www.vumc.org/cpm/sectag-tagging-clinical-note-section-headers I have never heard of that tokenization and crf paper or approach. I don't have time to do more than skim through it, but it looks like they are using an ML-based recognizer. It looks like their model actually trained using sectag synonyms as features. I don't know if you could call this under-the-covers pattern recognition or not ... Their comparison to full-text, window-unbounded dictionary lookup methods looks odd to me. I don't know why anybody would do that. When they do limit their lookup window to sentence only with dictionary lookup, lo and behold the precision shoots up to .82 and .85. Again, I only skimmed the paper and I'm not trying to critique their method (it looks good), I am only stating that I find their comparison odd. Their method is getting F1 .90 to .96 based upon configuration, which is excellent (again, my opinion only). They used the public i2b2 2014 corpus, but I don't see any information about obtaining any code that they might have used. It would be great to get that - in case you write the authors. I am really running short on time here, so I only took a 10 second glance at the legal section id paper. It looks like they are using pattern recognition, but instead of encoding it in a regex they use booleans like "leadingAsterisk" and "endsInPeriod". This is like using regex ^\*.*\.$ (I think). It is different from just one regex per section in that they have a two-step process, using the equivalent of the previous regex followed by the equivalent of another "contains" patterns. So, get all ^\*.*\.$. Then using regex lookahead (might be a better way), something with (?=\bTable\b)(?=\bcontents\b) = Table_of_Contents. All that being said, it would be very easy to make an ae that 1. Uses various regex to identify section header candidate sentences. They can be read from a bsv. 2. Uses a bsv listing SectionTitle||word1 word2 word3|TRUE 3. Reads the bsv and processes each candidate sentence to see if they contain all words (column 2), must contain those words exclusively (column 3), and then assign a section title (column 1). Without looking at the code, I think that the RegexSectionizer in ctakes core could easily be extended to do this in a couple of hours. Then it would just be a matter of creating the list. At this point I should probably admit that I do have a sectionizer in a private project that does something similar. It first identifies candidate sentences via regex. It reads in a list/table of section titles and synonyms for those titles and then attempts synonym matching. The same thing could be done using a list/table extracted from SecTag for headers and synonyms. Ok, back to my real job ... Thanks again for the excellent links, Sean -----Original Message----- From: abilash.mat...@cognizant.com [mailto:abilash.mat...@cognizant.com] Sent: Thursday, October 12, 2017 12:54 AM To: dev@ctakes.apache.org Subject: RE: segmentation [EXTERNAL] Sean, When I tried with pattern based approach , during the testing we got into issues for correctly identifying the segments. I was searching for better solution and could find couple of articles talks about NLP and statistical model based approaches. See below couple of links. Let me know if you have any insight into these approaches. https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC3002123_&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=siIm_idCCJKx_vKf-0UxVEi5frE1bqB3T2dVaYQeKSA&s=RYmYt2bV1ClcojTwpjhzKdGDZ4_69GTZuKEYORtDy04&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__www.hindawi.com_journals_bmri_2015_873012_&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=siIm_idCCJKx_vKf-0UxVEi5frE1bqB3T2dVaYQeKSA&s=AriQEqNVgPqN97AWcHZcAwDxSwX_vuH_LyHtWAvsfNI&e= https://urldefense.proofpoint.com/v2/url?u=http-3A__ceur-2Dws.org_Vol-2D710_paper23.pdf&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=siIm_idCCJKx_vKf-0UxVEi5frE1bqB3T2dVaYQeKSA&s=MAdp_ACJwKPAjtFqhzzIeGFWZU-icnw3UO4HL8ITDo0&e= Thanks, Abilash Mathew -----Original Message----- From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] Sent: Wednesday, October 11, 2017 8:54 PM To: dev@ctakes.apache.org Subject: RE: segmentation [EXTERNAL] Hi Matthew, Could you explain: > is there a better annotator ..? You could take a look at the BxvRegexSectionizer in ctakes-core. It is a little more robust than the CDASegmentAnnotator, but they are both pattern-based. One immediate advantage is that the BsvRegexSectionizer has a default list of section names and expressions in ctakes-core-res named DefaultSectionRegex.bsv that is much more populated than the ccda_sections.txt file. Though it is an ultimate goal, I don't know of any section annotator that can be plugged in and immediately catch every section for every domain at every institution. If you give me a little more to go on maybe I can be more helpful. Sean -----Original Message----- From: abilash.mat...@cognizant.com [mailto:abilash.mat...@cognizant.com] Sent: Wednesday, October 11, 2017 10:51 AM To: dev@ctakes.apache.org Subject: segmentation [EXTERNAL] Hi All, We are currently using CDASegmentAnnotator for segmenting the medical records. It is a pattern based annotator, I would like to know is there a better annotator available to segmenting the documents. It is very difficult to accommodate all the patterns in CDASegmentAnnotator as we are dealing with Medical records from multiple service providers. Thanks, Abilash Mathew This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.