[ https://issues.apache.org/jira/browse/CTAKES-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Finan updated CTAKES-374: ------------------------------ Priority: Minor (was: Major) > Scaleout of cTAKES pipeline > --------------------------- > > Key: CTAKES-374 > URL: https://issues.apache.org/jira/browse/CTAKES-374 > Project: cTAKES > Issue Type: New Feature > Affects Versions: future enhancement > Reporter: Selina Chu > Priority: Minor > Fix For: 3.2.1 > > > Currently, cTAKES can't be easily deployed in an asynchronous manner. UIMA > components aren't serializable (and thus cTAKES' components as well). Would > like to come up with better ways to allow cTAKES to be easily run in a > distributed fashion. > For example, for processing a long document (e.g. 10+ pages), cTAKES would > take a long time to process. > I would like to see a feature where we can partition the input to cTAKES, in > a way that won't affect the cTAKES annotation performance, allowing us to > process through a cluster running in distributed mode (e.g. Spark streaming > cTAKES). And then recombine the results such that the word/phrase token > positions will be sequentially ordered. > We have a simple implementation of the ClinicalPipelineFactory with Spark > Streaming. Currently our initial attempt in partitioning is by paragraphs. > For example, we are doing something like: > RDD.map(a_single_paragraph.process_in_ctakes()) > I also wanted to see if there are any better ways of doing this. -- This message was sent by Atlassian Jira (v8.20.10#820010)