[ https://issues.apache.org/jira/browse/CTAKES-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Finan closed CTAKES-155. ----------------------------- Assignee: Sean Finan Resolution: Workaround There are newer sectionizers that can be used instead of that old engine. > SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters > ------------------------------------------------------------------------- > > Key: CTAKES-155 > URL: https://issues.apache.org/jira/browse/CTAKES-155 > Project: cTAKES > Issue Type: Bug > Components: ctakes-core > Affects Versions: 3.0-incubating > Reporter: Steven Bethard > Assignee: Sean Finan > Priority: Major > Fix For: future enhancement > > > The code in SimpleSegmentWithTagsAnnotator is a bit hard to follow, but I > believe it assumes all sections are 5 characters long here: > {code:java} > fileReader.read(sectIdArr, 0, 5); > {code} > As a result, when the section name is longer than that, some part of the > section heading (e.g. for a 6 letter section name, the final "]") is left in > the text of the next section. This results, for example, in the dependency > parser choking: > {code:java} > Caused by: java.lang.NullPointerException > at clear.pos.PosEnLib.isNoun(PosEnLib.java:56) > at clear.morph.MorphEnAnalyzer.getException(MorphEnAnalyzer.java:273) > at clear.morph.MorphEnAnalyzer.getLemma(MorphEnAnalyzer.java:247) > {code} > I would fix this but: > (1) There are no tests for SimpleSegmentWithTagsAnnotator and it's > documentation actually says "Creates a single segment annotation that spans > the entire document" which is just untrue, so I'm not really sure what this > annotator is intended to do. > (2) Even if I make some assumptions about what it's intended to do, the code > is written in an extremely brittle fashion, and I'm afraid to make changes to > that. For what it's worth, here's what I think the annotator should really > look like: > {code:java} > public static class SegmentsFromBracketedSectionTagsAnnotator extends > JCasAnnotator_ImplBase { > private static Pattern SECTION_PATTERN = > Pattern.compile("(\\[start section id=\"?(.*?)\"?\\]).*?(\\[end > section id=\"?(.*?)\"?\\])", Pattern.DOTALL); > @Override > public void process(JCas jCas) throws AnalysisEngineProcessException { > Matcher matcher = SECTION_PATTERN.matcher(jCas.getDocumentText()); > while (matcher.find()) { > Segment segment = new Segment(jCas); > segment.setBegin(matcher.start() + matcher.group(1).length()); > segment.setEnd(matcher.end() - matcher.group(3).length()); > segment.setId(matcher.group(2)); > segment.addToIndexes(); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)