Question regarding cached partitions

2017-10-30 Thread Alex Sulimanov
Hi, I started Spark Streaming job with 96 executors which reads from 96 Kafka partitions and applies mapWithState on the incoming DStream. Why would it cache only 77 partitions? Do I have to allocate more memory? Currently each executor gets 10 GB and it is not clear why it can't cache all 96

Distinct for Avro Key/Value PairRDD

2017-03-09 Thread Alex Sulimanov
Good day everyone! Have you tried to de-duplicated records based on Avro generated classes? These classes extend SpecificRecord which has equals and hashCode implementation, although when i try to use .distinct on my PairRDD (both key and value are Avro classes), it eliminates records which are