Hi Andrey,

I gave up on the CDA Sectionizer and created the RegexSectionizer in core.  It 
is incredibly simple - it just takes in a list of regexes that fit section 
headers  - and footers if they exist.  If you can create a regex that fits most 
situations, like a blank line and some short all-caps string, then you are all 
set.  I don't think that I checked in unit tests as for my project I am using 
regressions.

Have a look if you like.

Sean

-----Original Message-----
From: Andrey Kurdumov [mailto:kant2...@googlemail.com] 
Sent: Wednesday, February 22, 2017 12:40 PM
To: cTakes developers list
Subject: Re: Section finder performance characteristics

Thanks Chen for your response,

I read source code of CDA sectionizer and came to same conclusion that it is 
highly specific to the data on which it would be working.

Unfortunately on my project, I would not know how data would looks like, until 
I start working with it. I expect that data would be very diverse, so 
handcrafting cda_section.txt for each dataset would be too expensive for me.
I would like to have some sort of sectionizer which recognize 95%-99% of 
sections, without mapping to LOINC/HL7 initially. I need such precision since I 
will use section name down the pipeline to narrow search of conditions in the 
each section. For example if I found section 'Family history', I could narrow 
search of SNOMED concepts only related to family history and throw others.

What I try to find out, what performance existing CDA sectionizer in numbers?
Does anybody able to create custom cda_section.txt file which works well across 
diverse set of clinical notes?
What size of datasets CDA sectionzer was tested on?
I expect that current implementation would not meet my goals on wide range of 
clinical notes from different domains since at some point it very likely start 
producing regressions. But I would like that somebody prove that my assumptions 
are wrong.

Also I interested what are the process to improve CDA sectionizer? Right now 
there no test cases for it, dataset on which it was tested unknown to me, and 
if I made some change which work for me, likely it break something for somebody 
which is bad. Does anybody has and idea how this could be handled?

Best regards,
Andrey



2017-02-22 22:25 GMT+06:00 Lin, Chen <chen....@childrens.harvard.edu>:

> Hi Andrey,
>
> The CDA sectionizer is a rule/RegEx based method for section header 
> matching. It follows the consolidated CDA/HL7 standard for defining a 
> section header template. The template format is:
> HL7 template id, LOINC Section Code, and a list of n header names 
> (case insensitive, n can be as many as possible)
>
> For example, a history related section-header template can be defined as:
> history,1,brief history of physical illness,history of present 
> illness,history of the present illness
>
> ³history² is the entry id (named by yourself), ³1² is the Section code 
> (named by yourself), The rest are the permutation of history-section 
> headers that appear in a dataset. Note it is very specific, if you 
> only list ³history of present illness², it will not find ³history of 
> [the] present illness² unless you list both.
>
> As you can see it¹s a strict template matching algorithm, so if you 
> know your data, especially all the section headers, it can surely do 
> the job. I have used CDA sectionizer for two projects. Those notes I 
> processed were with standard section header format so the performance was 
> acceptable.
>
> Hope it is helpful.
>
> Best,
> Chen
>
>
>
> On 2/22/17, 3:23 AM, "Andrey Kurdumov" <kant2...@googlemail.com> wrote:
>
> >Does anybody know what expected performance of the current CDA 
> >section finder in cTakes?
> >
> >How it was created, since I don't see any test cases for it? Does it 
> >was created on public or private dataset?
> >
> >Best regards,
> >Andrey Kurdyumov
>
>

Reply via email to