Re: spark streaming, kafka, SPARK_CLASSPATH

Gino Bustelo Mon, 16 Jun 2014 18:52:24 -0700

+1 for this issue. Documentation for spark-submit are misleading. Among many 
issues, the jar support is bad. HTTP urls do not work. This is because spark is 
using hadoop's FileSystem class. You have to specify the jars twice to get 
things to work. Once for the DriverWrapper to laid your classes and a 2nd time 
in the Context to distribute to workers.


I would like to see some contrib response to this issue. 

Gino B.

> On Jun 16, 2014, at 1:49 PM, Luis Ángel Vicente Sánchez 
> <langel.gro...@gmail.com> wrote:
> 
> Did you manage to make it work? I'm facing similar problems and this a 
> serious blocker issue. spark-submit seems kind of broken to me if you can use 
> it for spark-streaming.
> 
> Regards,
> 
> Luis
> 
> 
> 2014-06-11 1:48 GMT+01:00 lannyripple <lanny.rip...@gmail.com>:
>> I am using Spark 1.0.0 compiled with Hadoop 1.2.1.
>> 
>> I have a toy spark-streaming-kafka program.  It reads from a kafka queue and
>> does
>> 
>>     stream
>>       .map {case (k, v) => (v, 1)}
>>       .reduceByKey(_ + _)
>>       .print()
>> 
>> using a 1 second interval on the stream.
>> 
>> The docs say to make Spark and Hadoop jars 'provided' but this breaks for
>> spark-streaming.  Including spark-streaming (and spark-streaming-kafka) as
>> 'compile' to sweep them into our assembly gives collisions on javax.*
>> classes.  To work around this I modified
>> $SPARK_HOME/bin/compute-classpath.sh to include spark-streaming,
>> spark-streaming-kafka, and zkclient.  (Note that kafka is included as
>> 'compile' in my project and picked up in the assembly.)
>> 
>> I have set up conf/spark-env.sh as needed.  I have copied my assembly to
>> /tmp/myjar.jar on all spark hosts and to my hdfs /tmp/jars directory.  I am
>> running spark-submit from my spark master.  I am guided by the information
>> here https://spark.apache.org/docs/latest/submitting-applications.html
>> 
>> Well at this point I was going to detail all the ways spark-submit fails to
>> follow it's own documentation.  If I do not invoke sparkContext.setJars()
>> then it just fails to find the driver class.  This is using various
>> combinations of absolute path, file:, hdfs: (Warning: Skip remote jar)??,
>> and local: prefixes on the application-jar and --jars arguments.
>> 
>> If I invoke sparkContext.setJars() and include my assembly jar I get
>> further.  At this point I get a failure from
>> kafka.consumer.ConsumerConnector not being found.  I suspect this is because
>> spark-streaming-kafka needs the Kafka dependency it but my assembly jar is
>> too late in the classpath.
>> 
>> At this point I try setting spark.files.userClassPathfirst to 'true' but
>> this causes more things to blow up.
>> 
>> I finally found something that works.  Namely setting environment variable
>> SPARK_CLASSPATH=/tmp/myjar.jar  But silly me, this is deprecated and I'm
>> helpfully informed to
>> 
>>   Please instead use:
>>    - ./spark-submit with --driver-class-path to augment the driver classpath
>>    - spark.executor.extraClassPath to augment the executor classpath
>> 
>> which when put into a file and introduced with --properties-file does not
>> work.  (Also tried spark.files.userClassPathFirst here.)  These fail with
>> the kafka.consumer.ConsumerConnector error.
>> 
>> At a guess what's going on is that using SPARK_CLASSPATH I have my assembly
>> jar in the classpath at SparkSubmit invocation
>> 
>>   Spark Command: java -cp
>> /tmp/myjar.jar::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.0.0-hadoop1.2.1.jar:/opt/spark/lib/spark-streaming_2.10-1.0.0.jar:/opt/spark/lib/spark-streaming-kafka_2.10-1.0.0.jar:/opt/spark/lib/zkclient-0.4.jar
>> -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
>> org.apache.spark.deploy.SparkSubmit --class me.KafkaStreamingWC
>> /tmp/myjar.jar
>> 
>> but using --properties-file then the assembly is not available for
>> SparkSubmit.
>> 
>> I think the root cause is either spark-submit not handling the
>> spark-streaming libraries so they can be 'provided' or the inclusion of
>> org.elicpse.jetty.orbit in the streaming libraries which cause
>> 
>>   [error] (*:assembly) deduplicate: different file contents found in the
>> following:
>>   [error]
>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
>>   [error]
>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
>>   [error]
>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
>>   [error]
>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA
>> 
>> I've tried applying mergeStategy in assembly for my assembly.sbt but then I
>> get
>> 
>>   Invalid signature file digest for Manifest main attributes
>> 
>> If anyone knows the magic to get this working a reply would be greatly
>> appreciated.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-SPARK-CLASSPATH-tp7356.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: spark streaming, kafka, SPARK_CLASSPATH

Reply via email to