Re: Improving performance of a kafka spark streaming app

2016-06-24 Thread Cody Koeninger
Unless I'm misreading the image you posted, it does show event counts for the single batch that is still running, with 1.7 billion events in it. The recent batches do show 0 events, but I'm guessing that's because they're actually empty. When you said you had a kafka topic with 1.7 billion events

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
Streaming UI tab showing empty events and very different metrics than on 1.5.2 On Thu, Jun 23, 2016 at 5:06 AM, Colin Kincaid Williams wrote: > After a bit of effort I moved from a Spark cluster running 1.5.2, to a > Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The > completed b

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
After a bit of effort I moved from a Spark cluster running 1.5.2, to a Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The completed batches are no longer showing the number of events processed in the Streaming UI tab . I'm getting around 4k inserts per second in hbase, but I haven't

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
Thanks @Cody, I will try that out. In the interm, I tried to validate my Hbase cluster by running a random write test and see 30-40K writes per second. This suggests there is noticeable room for improvement. On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger wrote: > Take HBase out of the equation a

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Cody Koeninger
Take HBase out of the equation and just measure what your read performance is by doing something like createDirectStream(...).foreach(_.println) not take() or print() On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams wrote: > @Cody I was able to bring my processing time down to a second b

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
@Cody I was able to bring my processing time down to a second by setting maxRatePerPartition as discussed. My bad that I didn't recognize it as the cause of my scheduling delay. Since then I've tried experimenting with a larger Spark Context duration. I've been trying to get some noticeable improv

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Colin Kincaid Williams
I'll try dropping the maxRatePerPartition=400, or maybe even lower. However even at application starts up I have this large scheduling delay. I will report my progress later on. On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger wrote: > If your batch time is 1 second and your average processing tim

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Cody Koeninger
If your batch time is 1 second and your average processing time is 1.16 seconds, you're always going to be falling behind. That would explain why you've built up an hour of scheduling delay after eight hours of running. On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams wrote: > Hi Mich aga

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
Hi Mich again, Regarding batch window, etc. I have provided the sources, but I'm not currently calling the window function. Did you see the program source? It's only 100 lines. https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877 Then I would expect I'm using defaults, other than wha

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
Ok What is the set up for these please? batch window window length sliding interval And also in each batch window how much data do you get in (no of messages in the topic whatever)? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
I believe you have an issue with performance? have you checked spark GUI (default 4040) for details including shuffles etc? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I'm attaching a picture from the streaming UI. On Sat, Jun 18, 2016 at 7:59 PM, Colin Kincaid Williams wrote: > There are 25 nodes in the spark cluster. > > On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh > wrote: >> how many nodes are in your cluster? >> >> --num-executors 6 \ >> --driver-mem

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
There are 25 nodes in the spark cluster. On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh wrote: > how many nodes are in your cluster? > > --num-executors 6 \ > --driver-memory 4G \ > --executor-memory 2G \ > --total-executor-cores 12 \ > > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.l

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Mich Talebzadeh
how many nodes are in your cluster? --num-executors 6 \ --driver-memory 4G \ --executor-memory 2G \ --total-executor-cores 12 \ Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I updated my app to Spark 1.5.2 streaming so that it consumes from Kafka using the direct api and inserts content into an hbase cluster, as described in this thread. I was away from this project for awhile due to events in my family. Currently my scheduling delay is high, but the processing time i

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
Thanks Cody, I can see that the partitions are well distributed... Then I'm in the process of using the direct api. On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger wrote: > 60 partitions in and of itself shouldn't be a big performance issue > (as long as producers are distributing across partition

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Cody Koeninger
60 partitions in and of itself shouldn't be a big performance issue (as long as producers are distributing across partitions evenly). On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams wrote: > Thanks again Cody. Regarding the details 66 kafka partitions on 3 > kafka servers, likely 8 core sy

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
Thanks again Cody. Regarding the details 66 kafka partitions on 3 kafka servers, likely 8 core systems with 10 disks each. Maybe the issue with the receiver was the large number of partitions. I had miscounted the disks and so 11*3*2 is how I decided to partition my topic on insertion, ( by my own,

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
print() isn't really the best way to benchmark things, since it calls take(10) under the covers, but 380 records / second for a single receiver doesn't sound right in any case. Am I understanding correctly that you're trying to process a large number of already-existing kafka messages, not keep up

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hello again. I searched for "backport kafka" in the list archives but couldn't find anything but a post from Spark 0.7.2 . I was going to use accumulators to make a counter, but then saw on the Streaming tab the Receiver Statistics. Then I removed all other "functionality" except: JavaPairRec

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
Hi Cody, I'm going to use an accumulator right now to get an idea of the throughput. Thanks for mentioning the back ported module. Also it looks like I missed this section: https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch from the do

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
me extent. > > David Krieg | Enterprise Software Engineer > Early Warning > Direct: 480.426.2171 | Fax: 480.483.4628 | Mobile: 859.227.6173 > > > -Original Message- > From: Colin Kincaid Williams [mailto:disc...@uw.edu] > Sent: Monday, May 02, 2016 10:55 AM &g

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Cody Koeninger
Have you tested for read throughput (without writing to hbase, just deserialize)? Are you limited to using spark 1.2, or is upgrading possible? The kafka direct stream is available starting with 1.3. If you're stuck on 1.2, I believe there have been some attempts to backport it, search the maili

Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
I've written an application to get content from a kafka topic with 1.7 billion entries, get the protobuf serialized entries, and insert into hbase. Currently the environment that I'm running in is Spark 1.2. With 8 executors and 2 cores, and 2 jobs, I'm only getting between 0-2500 writes / second