Hi Greg,
We regularly process documents that are over 5000 pages (not lines).  What 
we've found is that many of the annotators within the standard distribution 
operate at o(n^2).  The standard dependency parser is one example among many.  

The good news is that you can achieve linear results if you convert these 
classes to use TreeMaps.  We actually build the tree maps one time and cache 
them in ThreadLocal variables which allows us to process multiple threads 
simultaneously.

Hope this helps,
John

-----Original Message-----
From: Greg Silverman <g...@umn.edu> 
Sent: Tuesday, September 24, 2019 6:47 PM
To: dev@ctakes.apache.org
Subject: [EXTERNAL] Large files taking forever to process

Any suggestions on how to speed up processing large clinical text notes 
approaching 13K lines? This is a very old corpus culled from EPIC notes back in 
2009. I thought about splitting the notes into smaller chunks, but then I would 
have to deal with the offsets when analyzing system output against manual 
annotations that had been done.

As is, I've tried different garbage collection options (this seemed to have 
worked well with CLAMP on the same set of notes).

TIA!

Greg--

--
Greg M. Silverman
Senior Systems Developer
NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group>
Department of Surgery
University of Minnesota
g...@umn.edu

 ›  evaluate-it.org  ‹

Reply via email to