There is a point in LvgCmdApiResourceImpl where it changes the working directory so that LVG can find the config file. I have no idea how this would be supposed to work on Spark, but I guess that using relative paths in your config is going to be a problem.
There is also a point in LvgCmdApiResourceImpl where it is converting a URI to a File instance. You should check whether that's ending up with an hdfs URL, and if so whether it is doing the right thing. I would make sure that you have all the logging coming out from LvgCmdApiResourceImpl and check that the paths are correct. You could also look at my patch on https://issues.apache.org/jira/browse/CTAKES-501, which includes some additional logging in this area. HTH, Ewan. On Tue, May 01, 2018 at 04:40:16PM +0000, Eskala, Nagakalyana wrote: > More update on the issue: > > We have extracted the lvg related files in the exact folder structure, and > are copying all the folders recursively in the spark executor working > directory using addFiles option. But the LvgAnnotator is not able to find the > lvg.properties file in the classpath of the spark executor even though we > have set up using the configuration spark.executor.extraClassPath option > > Code snippet: > sc.addFile("hdfs:///ctakes_4.0.0/resources", true); > sparkConf.set("spark.executor.extraClassPath", "./resources/"); > sparkConf.set("spark.driver.extraClassPath", "./resources/"); > > > > From: Eskala, Nagakalyana > Sent: Monday, April 30, 2018 8:50 PM > To: 'dev@ctakes.apache.org' <dev@ctakes.apache.org> > Subject: cTakes on Apache Spark - Error > > Background: > We are trying to run the Apache ctakes Default clinical pipeline in a spark > streaming application. We intend to parse all input text sent to a socket on > spark streaming by executing a default clinical pipeline in individual > executors of a spark application. > > Challenges: > The ctakes pipeline requires external resources to be available in the > classpath. We have used JavaSparkContext.addFiles to provide all the > resources (dictionaries) recursively from HDFS to each individual executor > working directory. Once the addFiles copies the resources to each executor, > we try to include it in the classpath of each executor using the > configuration. > > sc.addFile("hdfs:///ctakes_4.0.0/resources", true); > sparkConf.set("spark.executor.extraClassPath", "./resources/"); > sparkConf.set("spark.driver.extraClassPath", "./resources/"); > > Error: > The error occurs in LvgAnnotator class which tries to access the > lvg.properties file through the lookup. It is not able to locate the file and > hence there is an error. > > 18/04/30 15:55:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 4744 bytes) > 18/04/30 15:55:50 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID > 1) > 18/04/30 15:55:51 INFO ae.LvgAnnotator: URL==null > 18/04/30 15:55:51 INFO ae.LvgAnnotator: Unable to find > org/apache/ctakes/lvg/data/config/lvg.properties. > 18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying files and directories to > under /tmp/ > 18/04/30 15:55:51 INFO ae.LvgAnnotator: Copying lvg-related file to > /tmp/data/config/lvg.properties > 18/04/30 15:55:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.NullPointerException > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744) > at org.apache.commons.io.FileUtils.copyInputStreamToFile(FileUtils.java:1512) > at org.apache.ctakes.lvg.ae.LvgAnnotator.copyLvgFiles(LvgAnnotator.java:620) > at > org.apache.ctakes.lvg.ae.LvgAnnotator.createAnnotatorDescription(LvgAnnotator.java:649) > at > org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getTokenProcessingPipeline(ClinicalPipelineFactory.java:110) > at > org.apache.ctakes.clinicalpipeline.ClinicalPipelineFactory.getDefaultPipeline(ClinicalPipelineFactory.java:68) > > > Question: > Ideally, since the resources folder has been recursively added to each > executor node and the classpath has been set, the internal executor should be > able to locate the properties and other resource files. However, that is not > the case. Is there something we should be differently doing (configuration, > classpath, etc) so that the ctakes pipeline can be run in a spark executor > with all the resources and classpath set appropriately. > > Thanks for the help. > > > CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is > for the sole use of the intended recipient(s) and may contain confidential > and privileged information or may otherwise be protected by law. Any > unauthorized review, use, disclosure or distribution is prohibited. If you > are not the intended recipient, please contact the sender by reply e-mail > and destroy all copies of the original message and any attachment thereto.