Hi Greg, We regularly process documents that are over 5000 pages (not lines). What we've found is that many of the annotators within the standard distribution operate at o(n^2). The standard dependency parser is one example among many.
The good news is that you can achieve linear results if you convert these classes to use TreeMaps. We actually build the tree maps one time and cache them in ThreadLocal variables which allows us to process multiple threads simultaneously. Hope this helps, John -----Original Message----- From: Greg Silverman <g...@umn.edu> Sent: Tuesday, September 24, 2019 6:47 PM To: dev@ctakes.apache.org Subject: [EXTERNAL] Large files taking forever to process Any suggestions on how to speed up processing large clinical text notes approaching 13K lines? This is a very old corpus culled from EPIC notes back in 2009. I thought about splitting the notes into smaller chunks, but then I would have to deal with the offsets when analyzing system output against manual annotations that had been done. As is, I've tried different garbage collection options (this seemed to have worked well with CLAMP on the same set of notes). TIA! Greg-- -- Greg M. Silverman Senior Systems Developer NLP/IE <https://healthinformatics.umn.edu/research/nlpie-group> Department of Surgery University of Minnesota g...@umn.edu › evaluate-it.org ‹