Hi John, I am not completely sure that I understand what you are asking, and I think that this is more of an s3 question than a ctakes question, but here are a couple of comments:
> the cTAKES part of it relies on CTAKES_HOME being set - Is this requirement on your side? I never bother to set CTAKES_HOME. > So I need to store cTAKES in a shared location - I am not sure why you need to do this when it is possible to spin up multiple machines, each with its own ctakes "installation." > Usually, in EMR, you would use S3 for this - This seems to be quite a blanket statement > cTAKES relies on a hierarchical file structure - ok ... > such as storing cTAKES on S3 instead - I have [essentially] done this. If I remember correctly I didn't need to venture too far outside my comfort zone. > altering cTAKES to work with a flat file structure using the S3 - I haven't touched it for many years, but the flat file structure was essentially internal to s3 and files can still be referenced via a complete "hierarchical path" - it is just that the filename is "bob/likes/ice.cream" Again, I haven't needed to work with this for about 5 years, so what I did might be completely irrelevant. I would hope that implementation is now simpler, examples more prevalent and documentation better than back in the day. Sean ________________________________________ From: John Doe <lucanus...@gmail.com> Sent: Sunday, July 25, 2021 3:28 PM To: dev@ctakes.apache.org Subject: Can you store cTAKES in an S3 bucket so you can use it with EMR for parallel processing? [EXTERNAL] * External Email - Caution * I'm working on a solution for running cTAKES in an Amazon EMR environment with Apache Spark so I can run multiple instances of cTAKES in parallel for processing a bunch of notes. However, the cTAKES part of it relies on CTAKES_HOME being set on every machine for locating model files and such. So I need to store cTAKES in a shared location so every node can set CTAKES_HOME to that location. Usually, in EMR, you would use S3 for this but it seems that cTAKES relies on a hierarchical file structure for loading in files (model files, dictionary files, etc.). My current solution uses EFS as an alternative. Is there a better alternative to this approach to getting cTAKES integrated with EMR? I know there are alternative non-EMR approaches to parallelizing cTAKES, but I may not have those technologies available. I'm wondering if there is a good way around using EFS such as storing cTAKES on S3 instead, but it seems like altering cTAKES to work with a flat file structure using the S3 API may be a pretty big task.