Re: Spark streaming takes longer time to read json into dataframes

2016-07-16 Thread Martin Eden
Hi, I would just do a repartition on the initial direct DStream since otherwise each RDD in the stream has exactly as many partitions as you have partitions in the Kafka topic (in your case 1). Like that receiving is still done in only 1 thread but at least the processing further down is done in p

SparkStreaming multiple output operations failure semantics / error propagation

2016-07-14 Thread Martin Eden
Hi, I have a Spark 1.6.2 streaming job with multiple output operations (jobs) doing idempotent changes in different repositories. The problem is that I want to somehow pass errors from one output operation to another such that in the current output operation I only update previously successful m

How does Spark Streaming updateStateByKey or mapWithState scale with state size?

2016-06-23 Thread Martin Eden
Hi all, It is currently difficult to understand from the Spark docs or the materials online that I came across, how the updateStateByKey and mapWithState operators in Spark Streaming scale with the size of the state and how to reason about sizing the cluster appropriately. According to this artic

S3n performance (@AaronDavidson)

2016-04-12 Thread Martin Eden
Hi everyone, Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage to process approx 1TB of strings inside gzipped parquet in about 50 mins on a 20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec per node. This seems sub optimal. The processing is very basic