Thanks Chen for your response, I read source code of CDA sectionizer and came to same conclusion that it is highly specific to the data on which it would be working.
Unfortunately on my project, I would not know how data would looks like, until I start working with it. I expect that data would be very diverse, so handcrafting cda_section.txt for each dataset would be too expensive for me. I would like to have some sort of sectionizer which recognize 95%-99% of sections, without mapping to LOINC/HL7 initially. I need such precision since I will use section name down the pipeline to narrow search of conditions in the each section. For example if I found section 'Family history', I could narrow search of SNOMED concepts only related to family history and throw others. What I try to find out, what performance existing CDA sectionizer in numbers? Does anybody able to create custom cda_section.txt file which works well across diverse set of clinical notes? What size of datasets CDA sectionzer was tested on? I expect that current implementation would not meet my goals on wide range of clinical notes from different domains since at some point it very likely start producing regressions. But I would like that somebody prove that my assumptions are wrong. Also I interested what are the process to improve CDA sectionizer? Right now there no test cases for it, dataset on which it was tested unknown to me, and if I made some change which work for me, likely it break something for somebody which is bad. Does anybody has and idea how this could be handled? Best regards, Andrey 2017-02-22 22:25 GMT+06:00 Lin, Chen <chen....@childrens.harvard.edu>: > Hi Andrey, > > The CDA sectionizer is a rule/RegEx based method for section header > matching. It follows the consolidated CDA/HL7 standard for defining a > section header template. The template format is: > HL7 template id, LOINC Section Code, and a list of n header names (case > insensitive, n can be as many as possible) > > For example, a history related section-header template can be defined as: > history,1,brief history of physical illness,history of present > illness,history of the present illness > > ³history² is the entry id (named by yourself), > ³1² is the Section code (named by yourself), > The rest are the permutation of history-section headers that appear in a > dataset. Note it is very specific, if you only list ³history of present > illness², it will not find ³history of [the] present illness² unless you > list both. > > As you can see it¹s a strict template matching algorithm, so if you know > your data, especially all the section headers, it can surely do the job. I > have used CDA sectionizer for two projects. Those notes I processed were > with standard section header format so the performance was acceptable. > > Hope it is helpful. > > Best, > Chen > > > > On 2/22/17, 3:23 AM, "Andrey Kurdumov" <kant2...@googlemail.com> wrote: > > >Does anybody know what expected performance of the current CDA section > >finder in cTakes? > > > >How it was created, since I don't see any test cases for it? Does > >it was created on public or private dataset? > > > >Best regards, > >Andrey Kurdyumov > >