RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Finan, Sean Thu, 01 Mar 2018 13:22:22 -0800

Hi Sean,

It looks like you are not using the standard regex file:
> SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv


Is it possible that there is a poorly-formed regex?
The sectionizer should time out any regex that takes longer than a few seconds 
to complete, but it is possible that something in the timeout isn't working.  

As an aside, I don't see a sentence annotator.  A whole lot of downstream 
annotators depend upon sentences, so you should add one.

Sean

-----Original Message-----
From: Mullane, Sean *HS [mailto:sp...@hscmail.mcc.virginia.edu] 
Sent: Thursday, March 01, 2018 3:59 PM
To: dev@ctakes.apache.org
Subject: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

I am finding that the addition of BsvRegexSectionizer to my pipeline (below) 
has slowed it basically to a halt. Without the sectionizer added, I get ~1000 
documents/minute. With that line added, I ran the pipeline for an hour and got 
no documents annotated. Can anyone suggest what's going wrong here and how to 
fix it?

FWIW I created a .xml descriptor file and tested the sectionizer and .bsv in 
the CVD and it worked as expected with only a small-moderate decrease in speed.

Thanks,
Sean

//---------------------------------------------------------------------------------------------------------------
// Description: Commands and parameters to create a default plaintext document 
processing pipeline with UMLS lookup. Used for back-annotation of existing 
documents. This takes the top x documents not already existing in the 
ytex.dbo.document table.
//  Database Reader
//  Read documents from a database.
reader org.apache.ctakes.ytex.uima.DBCollectionReader 
queryGetDocumentKeys="EXECUTE 
Ytex.Rptg.uspSrc_cTAKES_get_rad_notes_from_batch_backanno /*@pipelineCount*/ 
_pipelineCount_ ,/*@pipelineNumber*/ _pipelineNumber_", queryGetDocument="EXEC 
YTEX.Rptg.uspSrc_cTAKES_single_rad_note /*@note_id*/ :instance_id"
// using stored procedures for flexibility and to work around buggy regex in 
PiperFileReader.java

//  Regex Sectionizer -- added for experiment //  Annotates Document Sections 
by detecting Section Headers using Regular Expressions provided in a 
Bar-Separated-Value (BSV) File.
#   SectionsBsv  path to a BSV file containing a list of regular expressions 
and corresponding section types.
add org.apache.ctakes.core.ae.BsvRegexSectionizer 
SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

// Load a simple token processing pipeline from another pipeline file load 
DefaultTokenizerPipeline.piper

// Add non-core annotators
add ContextDependentTokenizerAnnotator
addDescription POSTagger

// Add Chunkers
load ChunkerSubPipe.piper

// Default fast dictionary lookup
//add DefaultJCasTermAnnotator
// optional: this may improve recall of low-level concepts add 
OverlapJCasTermAnnotator

// Add Cleartk Entity Attribute annotators load AttributeCleartkSubPipe.piper

// Optional: this may allow ctakes to do better with finding specific forms of 
generic terms without needing to add all permutations to dictionary //load 
RelationSubPipe

//  XMI Writer 3
//  Writes XMI files with full representation of input text and all extracted 
information.
add org.apache.ctakes.ytex.uima.annotators.DBConsumer 
analysisBatch="Radiology_test_DefaultFastPipeline7" storeDocText=false 
storeCAS=false  
typesToIgnore=org.apache.ctakes.typesystem.type.textspan.Sentence,org.apache.ctakes.typesystem.type.syntax.ContractionToken,org.apache.ctakes.typesystem.type.syntax.NewlineToken,org.apache.ctakes.typesystem.type.syntax.NumToken,org.apache.ctakes.typesystem.type.syntax.PunctuationToken,org.apache.ctakes.typesystem.type.syntax.SymbolToken,org.apache.ctakes.typesystem.type.syntax.NP,org.apache.ctakes.typesystem.type.syntax.VP,org.apache.ctakes.typesystem.type.textsem.RomanNumeralAnnotation,org.apache.ctakes.typesystem.type.textsem.PersonTitleAnnotation,org.apache.ctakes.typesystem.type.syntax.WordToken,org.apache.ctakes.typesystem.type.syntax.TreebankNode,org.apache.ctakes.typesystem.type.syntax.TopTreebankNode,org.apache.ctakes.typesystem.type.syntax.TerminalTreebankNode

RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Reply via email to