Hi,
I would just do a repartition on the initial direct DStream since otherwise
each RDD in the stream has exactly as many partitions as you have
partitions in the Kafka topic (in your case 1). Like that receiving is
still done in only 1 thread but at least the processing further down is
done in p
Hi,
I have a Spark 1.6.2 streaming job with multiple output operations (jobs)
doing idempotent changes in different repositories.
The problem is that I want to somehow pass errors from one output operation
to another such that in the current output operation I only update
previously successful m
Hi all,
It is currently difficult to understand from the Spark docs or the
materials online that I came across, how the updateStateByKey
and mapWithState operators in Spark Streaming scale with the size of the
state and how to reason about sizing the cluster appropriately.
According to this artic
Hi everyone,
Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I
manage to process approx 1TB of strings inside gzipped parquet in about 50
mins on a 20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec
per node.
This seems sub optimal.
The processing is very basic