Hello, Thanks for the response. The reason we are using a shared location for ctakes is so that we have everything in one place. If we need to add our own components, dictionaries, etc., we can do it all in one spot. It also saves us from having to download ctakes on every machine every time we start up a cluster. I didn't know the regular java file API would still work with S3 but will have to give that a try. I am relying on CTAKES_HOME being set since ctakes is stored on EFS so the node wouldn't be able to find it on its own local file system. I'm basically mounting the EFS folder holding ctakes onto each node and setting CTAKES_HOME to that so it can find all the files it needs to. For us anyway, S3 has come up as the primary means of storage for EMR and I'm not sure if EFS will be available, which is why I'm trying to see if I can do it on S3.
On Mon, Aug 2, 2021 at 11:41 AM Finan, Sean < sean.fi...@childrens.harvard.edu> wrote: > Hi John, > > I am not completely sure that I understand what you are asking, and I > think that this is more of an s3 question than a ctakes question, but here > are a couple of comments: > > > the cTAKES part of it relies on CTAKES_HOME being set > - Is this requirement on your side? I never bother to set CTAKES_HOME. > > > So I need to store cTAKES in a shared location > - I am not sure why you need to do this when it is possible to spin up > multiple machines, each with its own ctakes "installation." > > > Usually, in EMR, you would use S3 for this > - This seems to be quite a blanket statement > > > cTAKES relies on a hierarchical file structure > - ok ... > > > such as storing cTAKES on S3 instead > - I have [essentially] done this. If I remember correctly I didn't need > to venture too far outside my comfort zone. > > > altering cTAKES to work with a flat file structure using the S3 > - I haven't touched it for many years, but the flat file structure was > essentially internal to s3 and files can still be referenced via a complete > "hierarchical path" - it is just that the filename is "bob/likes/ice.cream" > > Again, I haven't needed to work with this for about 5 years, so what I did > might be completely irrelevant. I would hope that implementation is now > simpler, examples more prevalent and documentation better than back in the > day. > > Sean > > ________________________________________ > From: John Doe <lucanus...@gmail.com> > Sent: Sunday, July 25, 2021 3:28 PM > To: dev@ctakes.apache.org > Subject: Can you store cTAKES in an S3 bucket so you can use it with EMR > for parallel processing? [EXTERNAL] > > * External Email - Caution * > > > I'm working on a solution for running cTAKES in an Amazon EMR environment > with Apache Spark so I can run multiple instances of cTAKES in parallel for > processing a bunch of notes. However, the cTAKES part of it relies on > CTAKES_HOME being set on every machine for locating model files and such. > So I need to store cTAKES in a shared location so every node can set > CTAKES_HOME to that location. Usually, in EMR, you would use S3 for this > but it seems that cTAKES relies on a hierarchical file structure for > loading in files (model files, dictionary files, etc.). My current solution > uses EFS as an alternative. Is there a better alternative to this approach > to getting cTAKES integrated with EMR? I know there are alternative non-EMR > approaches to parallelizing cTAKES, but I may not have those technologies > available. I'm wondering if there is a good way around using EFS such as > storing cTAKES on S3 instead, but it seems like altering cTAKES to work > with a flat file structure using the S3 API may be a pretty big task. >