Re: cTakes on Apache Spark - Error

Ewan Mellor Tue, 01 May 2018 17:18:05 -0700

There is a point in LvgCmdApiResourceImpl where it changes the working
directory so that LVG can find the config file.  I have no idea how this
would be supposed to work on Spark, but I guess that using relative paths
in your config is going to be a problem.


There is also a point in LvgCmdApiResourceImpl where it is converting a URI
to a File instance.  You should check whether that's ending up with an
hdfs URL, and if so whether it is doing the right thing.

I would make sure that you have all the logging coming out from
LvgCmdApiResourceImpl and check that the paths are correct.  You could
also look at my patch on https://issues.apache.org/jira/browse/CTAKES-501,
which includes some additional logging in this area.

HTH,

Ewan.

On Tue, May 01, 2018 at 04:40:16PM +0000, Eskala, Nagakalyana wrote:

> More update on the issue:
> 
> We have extracted the lvg related files in the exact folder structure, and 
> are copying all the folders recursively in the spark executor working 
> directory using addFiles option. But the LvgAnnotator is not able to find the 
> lvg.properties file in the classpath of the spark executor even though we 
> have set up using the configuration spark.executor.extraClassPath option
> 
> Code snippet:
> sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
> sparkConf.set("spark.executor.extraClassPath", "./resources/");
> sparkConf.set("spark.driver.extraClassPath", "./resources/");
> 
> 
> 
> From: Eskala, Nagakalyana
> Sent: Monday, April 30, 2018 8:50 PM
> To: 'dev@ctakes.apache.org' <dev@ctakes.apache.org>
> Subject: cTakes on Apache Spark - Error
> 
> Background:
> We are trying to run the Apache ctakes Default clinical pipeline in a spark 
> streaming application. We intend to parse all input text sent to a socket on 
> spark streaming by executing a default clinical pipeline in individual 
> executors of a spark application.
> 
> Challenges:
> The ctakes pipeline requires external resources to be available in the 
> classpath. We have used JavaSparkContext.addFiles to provide all the 
> resources (dictionaries) recursively from HDFS to each individual executor 
> working directory. Once the addFiles copies the resources to each executor, 
> we try to include it in the classpath of each executor using the 
> configuration.
> 
> sc.addFile("hdfs:///ctakes_4.0.0/resources", true);
> sparkConf.set("spark.executor.extraClassPath", "./resources/");
> sparkConf.set("spark.driver.extraClassPath", "./resources/");
> 
> Error:
> The error occurs in LvgAnnotator class which tries to access the 
> lvg.properties file through the lookup. It is not able to locate the file and 
> hence there is an error.
> 
> 18/04/30 15:55:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 4744 bytes)
> 18/04/30 15:55:50 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 
> 1)
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: URL==null
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Unable to find 
> org/apache/ctakes/lvg/data/config/lvg.properties.
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying files and directories to 
> under /tmp/
> 18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying lvg-related file to 
> /tmp/data/config/lvg.properties
> 18/04/30 15:55:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.NullPointerException
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
> at org.apache.commons.io.FileUtils.copyInputStreamToFile(FileUtils.java:1512)
> at org.apache.ctakes.lvg.ae.LvgAnnotator.copyLvgFiles(LvgAnnotator.java:620)
> at 
> org.apache.ctakes.lvg.ae.LvgAnnotator.createAnnotatorDescription(LvgAnnotator.java:649)
> at 
> org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getTokenProcessingPipeline(ClinicalPipelineFactory.java:110)
> at 
> org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getDefaultPipeline(ClinicalPipelineFactory.java:68)
> 
> 
> Question:
> Ideally, since the resources folder has been recursively added to each 
> executor node and the classpath has been set, the internal executor should be 
> able to locate the properties and other resource files. However, that is not 
> the case. Is there something we should be differently doing (configuration, 
> classpath, etc) so that the ctakes pipeline can be run in a spark executor 
> with all the resources and classpath set appropriately.
> 
> Thanks for the help.
> 
> 
> CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is
> for the sole use of the intended recipient(s) and may contain confidential
> and privileged information or may otherwise be protected by law. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply e-mail
> and destroy all copies of the original message and any attachment thereto.

Re: cTakes on Apache Spark - Error

Reply via email to