Hi John,

I am not completely sure that I understand what you are asking, and I think 
that this is more of an s3 question than a ctakes question, but here are a 
couple of comments:

> the cTAKES part of it relies on CTAKES_HOME being set
- Is this requirement on your side?   I never bother to set CTAKES_HOME.

> So I need to store cTAKES in a shared location
- I am not sure why you need to do this when it is possible to spin up multiple 
machines, each with its own ctakes "installation."

> Usually, in EMR, you would use S3 for this 
- This seems to be quite a blanket statement

> cTAKES relies on a hierarchical file structure
- ok ...

> such as storing cTAKES on S3 instead
- I have [essentially] done this.  If I remember correctly I didn't need to 
venture too far outside my comfort zone.

> altering cTAKES to work with a flat file structure using the S3
- I haven't touched it for many years, but the flat file structure was 
essentially internal to s3 and files can still be referenced via a complete 
"hierarchical path" - it is just that the filename is "bob/likes/ice.cream"

Again, I haven't needed to work with this for about 5 years, so what I did 
might be completely irrelevant.  I would hope that implementation is now 
simpler, examples more prevalent and documentation better than back in the day.

Sean

________________________________________
From: John Doe <lucanus...@gmail.com>
Sent: Sunday, July 25, 2021 3:28 PM
To: dev@ctakes.apache.org
Subject: Can you store cTAKES in an S3 bucket so you can use it with EMR for 
parallel processing? [EXTERNAL]

* External Email - Caution *


I'm working on a solution for running cTAKES in an Amazon EMR environment
with Apache Spark so I can run multiple instances of cTAKES in parallel for
processing a bunch of notes. However, the cTAKES part of it relies on
CTAKES_HOME being set on every machine for locating model files and such.
So I need to store cTAKES in a shared location so every node can set
CTAKES_HOME to that location. Usually, in EMR, you would use S3 for this
but it seems that cTAKES relies on a hierarchical file structure for
loading in files (model files, dictionary files, etc.). My current solution
uses EFS as an alternative. Is there a better alternative to this approach
to getting cTAKES integrated with EMR? I know there are alternative non-EMR
approaches to parallelizing cTAKES, but I may not have those technologies
available. I'm wondering if there is a good way around using EFS such as
storing cTAKES on S3 instead, but it seems like altering cTAKES to work
with a flat file structure using the S3 API may be a pretty big task.

Reply via email to