RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Mullane, Sean *HS Mon, 05 Mar 2018 15:19:38 -0800

Sean,

Thanks for looking into this. Somehow the line with that seems to have gotten 
lost. I rechecked and made sure I had exactly one segment annotator in the 
pipeline (and a sentence annotator) and it seems that it was able to complete. 
So that's good!


However, I am getting only null segmentID values for the 
anno_disease_disorder_mention table (and the other similar tables) in the 
output database. This may be a DBConsumer-specific issue, as I was able to see 
segmentID values in the CVD. Still any suggestions on how to remedy this would 
be much appreciated. I have been stepping through the eclipse debugger to try 
to see what's going on but it's hard for me to make much sense of it, not being 
particularly familiar with Java.

Thanks,
Sean

-----Original Message-----
From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] 
Sent: Thursday, March 01, 2018 4:22 PM
To: dev@ctakes.apache.org
Subject: RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Hi Sean,

It looks like you are not using the standard regex file:
> SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

Is it possible that there is a poorly-formed regex?
The sectionizer should time out any regex that takes longer than a few seconds 
to complete, but it is possible that something in the timeout isn't working.  

As an aside, I don't see a sentence annotator.  A whole lot of downstream 
annotators depend upon sentences, so you should add one.

Sean

-----Original Message-----
From: Mullane, Sean *HS [mailto:sp...@hscmail.mcc.virginia.edu] 
Sent: Thursday, March 01, 2018 3:59 PM
To: dev@ctakes.apache.org
Subject: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

I am finding that the addition of BsvRegexSectionizer to my pipeline (below) 
has slowed it basically to a halt. Without the sectionizer added, I get ~1000 
documents/minute. With that line added, I ran the pipeline for an hour and got 
no documents annotated. Can anyone suggest what's going wrong here and how to 
fix it?

FWIW I created a .xml descriptor file and tested the sectionizer and .bsv in 
the CVD and it worked as expected with only a small-moderate decrease in speed.

Thanks,
Sean

//---------------------------------------------------------------------------------------------------------------
// Description: Commands and parameters to create a default plaintext document 
processing pipeline with UMLS lookup. Used for back-annotation of existing 
documents. This takes the top x documents not already existing in the 
ytex.dbo.document table.
//  Database Reader
//  Read documents from a database.
reader org.apache.ctakes.ytex.uima.DBCollectionReader 
queryGetDocumentKeys="EXECUTE 
Ytex.Rptg.uspSrc_cTAKES_get_rad_notes_from_batch_backanno /*@pipelineCount*/ 
_pipelineCount_ ,/*@pipelineNumber*/ _pipelineNumber_", queryGetDocument="EXEC 
YTEX.Rptg.uspSrc_cTAKES_single_rad_note /*@note_id*/ :instance_id"
// using stored procedures for flexibility and to work around buggy regex in 
PiperFileReader.java

//  Regex Sectionizer -- added for experiment //  Annotates Document Sections 
by detecting Section Headers using Regular Expressions provided in a 
Bar-Separated-Value (BSV) File.
#   SectionsBsv  path to a BSV file containing a list of regular expressions 
and corresponding section types.
add org.apache.ctakes.core.ae.BsvRegexSectionizer 
SectionsBsv=E:\ctakes\apache-ctakes-4.0.0\DefaultSectionRegex.bsv

// Load a simple token processing pipeline from another pipeline file load 
DefaultTokenizerPipeline.piper

// Add non-core annotators
add ContextDependentTokenizerAnnotator
addDescription POSTagger

// Add Chunkers
load ChunkerSubPipe.piper

// Default fast dictionary lookup
//add DefaultJCasTermAnnotator
// optional: this may improve recall of low-level concepts add 
OverlapJCasTermAnnotator

// Add Cleartk Entity Attribute annotators load AttributeCleartkSubPipe.piper

// Optional: this may allow ctakes to do better with finding specific forms of 
generic terms without needing to add all permutations to dictionary //load 
RelationSubPipe

//  XMI Writer 3
//  Writes XMI files with full representation of input text and all extracted 
information.
add org.apache.ctakes.ytex.uima.annotators.DBConsumer 
analysisBatch="Radiology_test_DefaultFastPipeline7" storeDocText=false 
storeCAS=false  
typesToIgnore=org.apache.ctakes.typesystem.type.textspan.Sentence,org.apache.ctakes.typesystem.type.syntax.ContractionToken,org.apache.ctakes.typesystem.type.syntax.NewlineToken,org.apache.ctakes.typesystem.type.syntax.NumToken,org.apache.ctakes.typesystem.type.syntax.PunctuationToken,org.apache.ctakes.typesystem.type.syntax.SymbolToken,org.apache.ctakes.typesystem.type.syntax.NP,org.apache.ctakes.typesystem.type.syntax.VP,org.apache.ctakes.typesystem.type.textsem.RomanNumeralAnnotation,org.apache.ctakes.typesystem.type.textsem.PersonTitleAnnotation,org.apache.ctakes.typesystem.type.syntax.WordToken,org.apache.ctakes.typesystem.type.syntax.TreebankNode,org.apache.ctakes.typesystem.type.syntax.TopTreebankNode,org.apache.ctakes.typesystem.type.syntax.TerminalTreebankNode

RE: BsvRegexSectionizer breaks my pipeline [EXTERNAL]

Reply via email to