[ 
https://issues.apache.org/jira/browse/CTAKES-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Finan updated CTAKES-374:
------------------------------
    Priority: Minor  (was: Major)

> Scaleout of cTAKES pipeline
> ---------------------------
>
>                 Key: CTAKES-374
>                 URL: https://issues.apache.org/jira/browse/CTAKES-374
>             Project: cTAKES
>          Issue Type: New Feature
>    Affects Versions: future enhancement
>            Reporter: Selina Chu
>            Priority: Minor
>             Fix For: 3.2.1
>
>
> Currently, cTAKES can't be easily deployed in an asynchronous manner. UIMA 
> components aren't serializable (and thus cTAKES' components as well).  Would 
> like to come up with better ways to allow cTAKES to be easily run in a 
> distributed fashion.
> For example, for processing a long document (e.g. 10+ pages), cTAKES would 
> take a long time to process.
> I would like to see a feature where we can partition the input to cTAKES, in 
> a way that won't affect the cTAKES annotation performance, allowing us to 
> process through a cluster running in distributed mode (e.g. Spark streaming 
> cTAKES).  And then recombine the results such that the word/phrase token 
> positions will be sequentially ordered.
> We have a simple implementation of the ClinicalPipelineFactory with Spark 
> Streaming.  Currently our initial attempt in partitioning is by paragraphs. 
> For example, we are doing something like:
> RDD.map(a_single_paragraph.process_in_ctakes())
> I also wanted to see if there are any better ways of doing this.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to