+1
We're on CDH, and it will probably be a while before they support Kafka
0.10. At the same time, we don't use their Spark and we're looking forward
to upgrading to 2.0.x and using structured streaming.
I was just going to write our own Kafka Source implementation which uses
the existing KafkaRD
Take a look at how the messages are actually distributed across the
partitions. If the message keys have a low cardinality, you might get poor
distribution (i.e. all the messages are actually only in two of the five
partitions, leading to what you see in Spark).
If you take a look at the Kafka dat
I found a similar issue happens when there is a memory leak in the spark
application (or, in my case, one of the libraries that's used in the spark
application). Gradually, unclaimed objects make their way into old or
permanent generation space, reducing the available heap. It causes GC
overhead