I'm working on a solution for running cTAKES in an Amazon EMR environment
with Apache Spark so I can run multiple instances of cTAKES in parallel for
processing a bunch of notes. However, the cTAKES part of it relies on
CTAKES_HOME being set on every machine for locating model files and such.
So I need to store cTAKES in a shared location so every node can set
CTAKES_HOME to that location. Usually, in EMR, you would use S3 for this
but it seems that cTAKES relies on a hierarchical file structure for
loading in files (model files, dictionary files, etc.). My current solution
uses EFS as an alternative. Is there a better alternative to this approach
to getting cTAKES integrated with EMR? I know there are alternative non-EMR
approaches to parallelizing cTAKES, but I may not have those technologies
available. I'm wondering if there is a good way around using EFS such as
storing cTAKES on S3 instead, but it seems like altering cTAKES to work
with a flat file structure using the S3 API may be a pretty big task.