I've written an application to get content from a kafka topic with 1.7 billion entries, get the protobuf serialized entries, and insert into hbase. Currently the environment that I'm running in is Spark 1.2.
With 8 executors and 2 cores, and 2 jobs, I'm only getting between 0-2500 writes / second. This will take much too long to consume the entries. I currently believe that the spark kafka receiver is the bottleneck. I've tried both 1.2 receivers, with the WAL and without, and didn't notice any large performance difference. I've tried many different spark configuration options, but can't seem to get better performance. I saw 80000 requests / second inserting these records into kafka using yarn / hbase / protobuf / kafka in a bulk fashion. While hbase inserts might not deliver the same throughput, I'd like to at least get 10%. My application looks like https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877 This is my first spark application. I'd appreciate any assistance. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org