Sorry to flog this dead horse, but this is something every python user is going to run into as we *cannot* build the dependencies onto our app. There is no way to do that with a python script.
As I see it, this is not a third party integration. The package missing its dependencies is built by the spark team I believe? org.apache.spark:spark-streaming-kafka_2.10:1.3.1 is the problem package, if I remove the cassandra package I still run into the same error. If there really is no way to add these dependencies to the kafka package? I tried to add these dependencies in a number of ways but the maze of pom.xml files makes that difficult for those not familiar with java. Thanks again. I really don't want other python users to run into the same brick wall I did as I nearly gave up on Spark over what turns out to be a relatively simple thing. On Tue, May 12, 2015, 1:11 AM Sean Owen <[email protected]> wrote: > The question is really whether all the third-party integrations should > be built into Spark's main assembly. I think reasonable people could > disagree, but I think the current state (not built in) is reasonable. > It means you have to bring the integration with you. > > That is, no, third-party queue integrations aren't built in out of the box. > > the way you got it to work is one way, but not the preferred way: > build this into your app and your packaging tool would have resolved > the dependencies. > > I agree with resolving this as basically working-as-intended. > > On Tue, May 12, 2015 at 3:19 AM, Lee McFadden <[email protected]> wrote: > > I opened a ticket on this (without posting here first - bad etiquette, > > apologies) which was closed as 'fixed'. > > > > https://issues.apache.org/jira/browse/SPARK-7538 > > > > I don't believe that because I have my script running means this is > fixed, I > > think it is still an issue. > > > > I downloaded the spark source, ran `mvn -DskipTests clean package `, then > > simply launched my python script (which shouldn't be introducing > additional > > *java* dependencies itself?). > > > > Doesn't this mean these dependencies are missing from the spark build, > since > > I didn't modify any files within the distribution and my application > itself > > can't be introducing java dependency clashes? > > > > > > On Mon, May 11, 2015, 4:34 PM Lee McFadden <[email protected]> wrote: > >> > >> Ted, many thanks. I'm not used to Java dependencies so this was a real > >> head-scratcher for me. > >> > >> Downloading the two metrics packages from the maven repository > >> (metrics-core, metrics-annotation) and supplying it on the spark-submit > >> command line worked. > >> > >> My final spark-submit for a python project using Kafka as an input > source: > >> > >> /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \ > >> --packages > >> > TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1 > >> \ > >> --jars > >> > /home/ubuntu/jars/metrics-core-2.2.0.jar,/home/ubuntu/jars/metrics-annotation-2.2.0.jar > >> \ > >> --conf > >> > spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \ > >> --master spark://127.0.0.1:7077 \ > >> affected_hosts.py > >> > >> Now we're seeing data from the stream. Thanks again! > >> > >> On Mon, May 11, 2015 at 2:43 PM Sean Owen <[email protected]> wrote: > >>> > >>> Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd > >>> have to provide it and all its dependencies with your app. You could > >>> also build this into your own app jar. Tools like Maven will add in > >>> the transitive dependencies. > >>> > >>> On Mon, May 11, 2015 at 10:04 PM, Lee McFadden <[email protected]> > >>> wrote: > >>> > Thanks Ted, > >>> > > >>> > The issue is that I'm using packages (see spark-submit definition) > and > >>> > I do > >>> > not know how to add com.yammer.metrics:metrics-core to my classpath > so > >>> > Spark > >>> > can see it. > >>> > > >>> > Should metrics-core not be part of the > >>> > org.apache.spark:spark-streaming-kafka_2.10:1.3.1 package so it can > >>> > work > >>> > correctly? > >>> > > >>> > If not, any clues as to how I can add metrics-core to my project > >>> > (bearing in > >>> > mind that I'm using Python, not a JVM language) would be much > >>> > appreciated. > >>> > > >>> > Thanks, and apologies for my newbness with Java/Scala. > >>> > > >>> > On Mon, May 11, 2015 at 1:42 PM Ted Yu <[email protected]> wrote: > >>> >> > >>> >> com.yammer.metrics.core.Gauge is in metrics-core jar > >>> >> e.g., in master branch: > >>> >> [INFO] | \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile > >>> >> [INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile > >>> >> > >>> >> Please make sure metrics-core jar is on the classpath. > >>> >> > >>> >> On Mon, May 11, 2015 at 1:32 PM, Lee McFadden <[email protected]> > >>> >> wrote: > >>> >>> > >>> >>> Hi, > >>> >>> > >>> >>> We've been having some issues getting spark streaming running > >>> >>> correctly > >>> >>> using a Kafka stream, and we've been going around in circles trying > >>> >>> to > >>> >>> resolve this dependency. > >>> >>> > >>> >>> Details of our environment and the error below, if anyone can help > >>> >>> resolve this it would be much appreciated. > >>> >>> > >>> >>> Submit command line: > >>> >>> > >>> >>> /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \ > >>> >>> --packages > >>> >>> > >>> >>> > TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1 > >>> >>> \ > >>> >>> --conf > >>> >>> > >>> >>> > spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \ > >>> >>> --master spark://127.0.0.1:7077 \ > >>> >>> affected_hosts.py > >>> >>> > >>> >>> When we run the streaming job everything starts just fine, then we > >>> >>> see > >>> >>> the following in the logs: > >>> >>> > >>> >>> 15/05/11 19:50:46 WARN TaskSetManager: Lost task 0.0 in stage 2.0 > >>> >>> (TID > >>> >>> 70, ip-10-10-102-53.us-west-2.compute.internal): > >>> >>> java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge > >>> >>> at > >>> >>> > >>> >>> > kafka.consumer.ZookeeperConsumerConnector.createFetcher(ZookeeperConsumerConnector.scala:151) > >>> >>> at > >>> >>> > >>> >>> > kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:115) > >>> >>> at > >>> >>> > >>> >>> > kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:128) > >>> >>> at > >>> >>> kafka.consumer.Consumer$.create(ConsumerConnector.scala:89) > >>> >>> at > >>> >>> > >>> >>> > org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100) > >>> >>> at > >>> >>> > >>> >>> > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) > >>> >>> at > >>> >>> > >>> >>> > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) > >>> >>> at > >>> >>> > >>> >>> > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:298) > >>> >>> at > >>> >>> > >>> >>> > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:290) > >>> >>> at > >>> >>> > >>> >>> > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) > >>> >>> at > >>> >>> > >>> >>> > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) > >>> >>> at > >>> >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > >>> >>> at org.apache.spark.scheduler.Task.run(Task.scala:64) > >>> >>> at > >>> >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > >>> >>> at > >>> >>> > >>> >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > >>> >>> at > >>> >>> > >>> >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > >>> >>> at java.lang.Thread.run(Thread.java:745) > >>> >>> Caused by: java.lang.ClassNotFoundException: > >>> >>> com.yammer.metrics.core.Gauge > >>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372) > >>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > >>> >>> at java.security.AccessController.doPrivileged(Native > Method) > >>> >>> at > java.net.URLClassLoader.findClass(URLClassLoader.java:360) > >>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > >>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > >>> >>> ... 17 more > >>> >>> > >>> >>> > >>> >> > >>> > >
