The question is really whether all the third-party integrations should be built into Spark's main assembly. I think reasonable people could disagree, but I think the current state (not built in) is reasonable. It means you have to bring the integration with you.
That is, no, third-party queue integrations aren't built in out of the box. the way you got it to work is one way, but not the preferred way: build this into your app and your packaging tool would have resolved the dependencies. I agree with resolving this as basically working-as-intended. On Tue, May 12, 2015 at 3:19 AM, Lee McFadden <[email protected]> wrote: > I opened a ticket on this (without posting here first - bad etiquette, > apologies) which was closed as 'fixed'. > > https://issues.apache.org/jira/browse/SPARK-7538 > > I don't believe that because I have my script running means this is fixed, I > think it is still an issue. > > I downloaded the spark source, ran `mvn -DskipTests clean package `, then > simply launched my python script (which shouldn't be introducing additional > *java* dependencies itself?). > > Doesn't this mean these dependencies are missing from the spark build, since > I didn't modify any files within the distribution and my application itself > can't be introducing java dependency clashes? > > > On Mon, May 11, 2015, 4:34 PM Lee McFadden <[email protected]> wrote: >> >> Ted, many thanks. I'm not used to Java dependencies so this was a real >> head-scratcher for me. >> >> Downloading the two metrics packages from the maven repository >> (metrics-core, metrics-annotation) and supplying it on the spark-submit >> command line worked. >> >> My final spark-submit for a python project using Kafka as an input source: >> >> /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \ >> --packages >> TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1 >> \ >> --jars >> /home/ubuntu/jars/metrics-core-2.2.0.jar,/home/ubuntu/jars/metrics-annotation-2.2.0.jar >> \ >> --conf >> spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 \ >> --master spark://127.0.0.1:7077 \ >> affected_hosts.py >> >> Now we're seeing data from the stream. Thanks again! >> >> On Mon, May 11, 2015 at 2:43 PM Sean Owen <[email protected]> wrote: >>> >>> Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd >>> have to provide it and all its dependencies with your app. You could >>> also build this into your own app jar. Tools like Maven will add in >>> the transitive dependencies. >>> >>> On Mon, May 11, 2015 at 10:04 PM, Lee McFadden <[email protected]> >>> wrote: >>> > Thanks Ted, >>> > >>> > The issue is that I'm using packages (see spark-submit definition) and >>> > I do >>> > not know how to add com.yammer.metrics:metrics-core to my classpath so >>> > Spark >>> > can see it. >>> > >>> > Should metrics-core not be part of the >>> > org.apache.spark:spark-streaming-kafka_2.10:1.3.1 package so it can >>> > work >>> > correctly? >>> > >>> > If not, any clues as to how I can add metrics-core to my project >>> > (bearing in >>> > mind that I'm using Python, not a JVM language) would be much >>> > appreciated. >>> > >>> > Thanks, and apologies for my newbness with Java/Scala. >>> > >>> > On Mon, May 11, 2015 at 1:42 PM Ted Yu <[email protected]> wrote: >>> >> >>> >> com.yammer.metrics.core.Gauge is in metrics-core jar >>> >> e.g., in master branch: >>> >> [INFO] | \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile >>> >> [INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile >>> >> >>> >> Please make sure metrics-core jar is on the classpath. >>> >> >>> >> On Mon, May 11, 2015 at 1:32 PM, Lee McFadden <[email protected]> >>> >> wrote: >>> >>> >>> >>> Hi, >>> >>> >>> >>> We've been having some issues getting spark streaming running >>> >>> correctly >>> >>> using a Kafka stream, and we've been going around in circles trying >>> >>> to >>> >>> resolve this dependency. >>> >>> >>> >>> Details of our environment and the error below, if anyone can help >>> >>> resolve this it would be much appreciated. >>> >>> >>> >>> Submit command line: >>> >>> >>> >>> /home/ubuntu/spark/spark-1.3.1/bin/spark-submit \ >>> >>> --packages >>> >>> >>> >>> TargetHolding/pyspark-cassandra:0.1.4,org.apache.spark:spark-streaming-kafka_2.10:1.3.1 >>> >>> \ >>> >>> --conf >>> >>> >>> >>> spark.cassandra.connection.host=10.10.103.172,10.10.102.160,10.10.101.79 >>> >>> \ >>> >>> --master spark://127.0.0.1:7077 \ >>> >>> affected_hosts.py >>> >>> >>> >>> When we run the streaming job everything starts just fine, then we >>> >>> see >>> >>> the following in the logs: >>> >>> >>> >>> 15/05/11 19:50:46 WARN TaskSetManager: Lost task 0.0 in stage 2.0 >>> >>> (TID >>> >>> 70, ip-10-10-102-53.us-west-2.compute.internal): >>> >>> java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge >>> >>> at >>> >>> >>> >>> kafka.consumer.ZookeeperConsumerConnector.createFetcher(ZookeeperConsumerConnector.scala:151) >>> >>> at >>> >>> >>> >>> kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:115) >>> >>> at >>> >>> >>> >>> kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:128) >>> >>> at >>> >>> kafka.consumer.Consumer$.create(ConsumerConnector.scala:89) >>> >>> at >>> >>> >>> >>> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100) >>> >>> at >>> >>> >>> >>> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) >>> >>> at >>> >>> >>> >>> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) >>> >>> at >>> >>> >>> >>> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:298) >>> >>> at >>> >>> >>> >>> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:290) >>> >>> at >>> >>> >>> >>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) >>> >>> at >>> >>> >>> >>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) >>> >>> at >>> >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) >>> >>> at org.apache.spark.scheduler.Task.run(Task.scala:64) >>> >>> at >>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) >>> >>> at >>> >>> >>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> >>> at >>> >>> >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> >>> at java.lang.Thread.run(Thread.java:745) >>> >>> Caused by: java.lang.ClassNotFoundException: >>> >>> com.yammer.metrics.core.Gauge >>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372) >>> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361) >>> >>> at java.security.AccessController.doPrivileged(Native Method) >>> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:360) >>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >>> >>> ... 17 more >>> >>> >>> >>> >>> >> >>> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
