Using Spark Streaming with Kafka 0.7.2

2014-07-25 Thread maddenpj
Hi all, Currently we have Kafka 0.7.2 running in production and can't upgrade for external reasons however spark streaming (1.0.1) was built with Kafka 0.8.0. What is the best way to use spark streaming with older versions of Kafka. Currently I'm investigating trying to build spark streaming mysel

Re: memory issue on standalone master

2014-08-07 Thread maddenpj
It looks like your Java heap space is too low: -Xmx512m. It's only using .5G of RAM, try bumping this up -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/memory-issue-on-standalone-master-tp11610p11711.html Sent from the Apache Spark User List mailing list ar

Spark Streaming worker underutilized?

2014-08-08 Thread maddenpj
jI currently have a 4 node spark setup, 1 master and 3 workers running in spark standalone mode. I am currently stress testing a spark application I wrote that reads data from kafka and puts it into redshift. I'm pretty happy with the performance (Reading about 6k messages per second out of kafka)

Re: Kafka - streaming from multiple topics

2014-08-13 Thread maddenpj
Can you link to the JIRA issue? I'm having to work around this bug and it would be nice to monitor the JIRA so I can change my code when it's fixed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p12053.html Sent

Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
id of that piece of data (so if we've already seen that particular piece we just update the existing total in mysql with the total spark just computed in the current window. https://gist.github.com/maddenpj/74a4c8ce372888ade92d <https://gist.github.com/maddenpj/74a4c8ce372888ade92d>

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Oh I should add I've tried a range of batch durations and reduce by window durations to no effect. I'm not too sure how to choose these? Currently today I've been testing with batch duration of 1 minute - 10 minute and reduce window duration of 10 minute or 20 minutes. -- View this message in

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Another update, actually it just hit me my problem is probably right here: https://gist.github.com/maddenpj/74a4c8ce372888ade92d#file-gistfile1-scala-L22 I'm creating a JDBC connection on every record, that's probably whats killing the performance. I assume the fix is just bro

Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
ng with several different values but it looks like only one worker is actually doing the writing to MySQL. Obviously this is not ideal because I need the parallelism to insert this data in a timely manner. Here's the code https://gist.github.com/maddenpj/5032c76aeb330371a6e6 <https://g

Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
Update for posterity, so once again I solved the problem shortly after posting to the mailing list. So updateStateByKey uses the default partitioner, which in my case seemed like it was set to one. Changing my call from .updateStateByKey[Long](updateFn) -> .updateStateByKey[Long](updateFn, numPart

Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
Yup it's all in the gist: https://gist.github.com/maddenpj/5032c76aeb330371a6e6 Lines 6-9 deal with setting up the driver specifically. This sets the driver up on each partition which keeps the connection pool around per record. -- View this message in context: http://apache-spark-user

Re: Build spark with Intellij IDEA 13

2014-09-27 Thread maddenpj
I actually got this same exact issue compiling a unrelated project (not using spark). Maybe it's a protobuf issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-tp9904p15284.html Sent from the Apache Spark User List maili

Re: shuffle memory requirements

2014-09-29 Thread maddenpj
Hey Ameet, Thanks for the info, I'm running into the same issue myself and my last attempt crashed and my ulimit was 16834. I'm going to up it and try again, but yea I would like to know the best practice for computing this. Can you talk about the worker nodes, what are their specs? At least 45 gi

Re: Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-10-02 Thread maddenpj
I am seeing this same issue. Bumping for visibility. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-job-has-an-issue-when-the-worker-reading-from-Kafka-is-killed-tp12595p15611.html Sent from the Apache Spark User List mailing list arch

Block removal causes Akka timeouts

2014-10-02 Thread maddenpj
I'm seeing a lot of Akka timeouts which eventually lead to job failure in spark streaming when removing blocks (Example stack trace below). It appears to be related to these issues: SPARK-3015 and SPARK-3139