print() isn't really the best way to benchmark things, since it calls take(10) under the covers, but 380 records / second for a single receiver doesn't sound right in any case.
Am I understanding correctly that you're trying to process a large number of already-existing kafka messages, not keep up with an incoming stream? Can you give any details (e.g. hardware, number of topicpartitions, etc)? Really though, I'd try to start with spark 1.6 and direct streams, or even just kafkacat, as a baseline. On Mon, May 2, 2016 at 7:01 PM, Colin Kincaid Williams <disc...@uw.edu> wrote: > Hello again. I searched for "backport kafka" in the list archives but > couldn't find anything but a post from Spark 0.7.2 . I was going to > use accumulators to make a counter, but then saw on the Streaming tab > the Receiver Statistics. Then I removed all other "functionality" > except: > > > JavaPairReceiverInputDStream<byte[], byte[]> dstream = KafkaUtils > //createStream(JavaStreamingContext jssc,Class<K> > keyTypeClass,Class<V> valueTypeClass, Class<U> keyDecoderClass, > Class<T> valueDecoderClass, java.util.Map<String,String> kafkaParams, > java.util.Map<String,Integer> topics, StorageLevel storageLevel) > .createStream(jssc, byte[].class, byte[].class, > kafka.serializer.DefaultDecoder.class, > kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap, > StorageLevel.MEMORY_AND_DISK_SER()); > > dstream.print(); > > Then in the Recieiver Stats for the single receiver, I'm seeing around > 380 records / second. Then to get anywhere near my 10% mentioned > above, I'd need to run around 21 receivers, assuming 380 records / > second, just using the print output. This seems awfully high to me, > considering that I wrote 80000+ records a second to Kafka from a > mapreduce job, and that my bottleneck was likely Hbase. Again using > the 380 estimate, I would need 200+ receivers to reach a similar > amount of reads. > > Even given the issues with the 1.2 receivers, is this the expected way > to use the Kafka streaming API, or am I doing something terribly > wrong? > > My application looks like > https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877 > > On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger <c...@koeninger.org> wrote: >> Have you tested for read throughput (without writing to hbase, just >> deserialize)? >> >> Are you limited to using spark 1.2, or is upgrading possible? The >> kafka direct stream is available starting with 1.3. If you're stuck >> on 1.2, I believe there have been some attempts to backport it, search >> the mailing list archives. >> >> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams <disc...@uw.edu> >> wrote: >>> I've written an application to get content from a kafka topic with 1.7 >>> billion entries, get the protobuf serialized entries, and insert into >>> hbase. Currently the environment that I'm running in is Spark 1.2. >>> >>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between >>> 0-2500 writes / second. This will take much too long to consume the >>> entries. >>> >>> I currently believe that the spark kafka receiver is the bottleneck. >>> I've tried both 1.2 receivers, with the WAL and without, and didn't >>> notice any large performance difference. I've tried many different >>> spark configuration options, but can't seem to get better performance. >>> >>> I saw 80000 requests / second inserting these records into kafka using >>> yarn / hbase / protobuf / kafka in a bulk fashion. >>> >>> While hbase inserts might not deliver the same throughput, I'd like to >>> at least get 10%. >>> >>> My application looks like >>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877 >>> >>> This is my first spark application. I'd appreciate any assistance. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org