I've written an application to get content from a kafka topic with 1.7
billion entries,  get the protobuf serialized entries, and insert into
hbase. Currently the environment that I'm running in is Spark 1.2.

With 8 executors and 2 cores, and 2 jobs, I'm only getting between
0-2500 writes / second. This will take much too long to consume the
entries.

I currently believe that the spark kafka receiver is the bottleneck.
I've tried both 1.2 receivers, with the WAL and without, and didn't
notice any large performance difference. I've tried many different
spark configuration options, but can't seem to get better performance.

I saw 80000 requests / second inserting these records into kafka using
yarn / hbase / protobuf / kafka in a bulk fashion.

While hbase inserts might not deliver the same throughput, I'd like to
at least get 10%.

My application looks like
https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

This is my first spark application. I'd appreciate any assistance.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to