I use spark-streaming reading messages from a Kafka, the producer creates
messages about 1500 per second
def hash(x: String): Int = {
MurmurHash3.stringHash(x)
}
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap,
StorageLevel.MEMORY_ONLY_SER).map(_._2
sliding window is just 3 seconds, so you will
> process each 60 second's data in 3 seconds, if processing latency is larger
> than the sliding window, so maybe you computation power cannot reach to the
> qps you wanted.
>
>
>
> I think you need to identify the bottleneck
Thanks, Shao
On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai wrote:
> Yeah, as I said your job processing time is much larger than the sliding
> window, and streaming job is executed one by one in sequence, so the next
> job will wait until the first job is finished, so the total latency will be
I've already done that:
>From SparkUI Environment Spark properties has:
spark.shuffle.spillfalse
On Wed, Mar 18, 2015 at 6:34 PM, Akhil Das
wrote:
> I think you can disable it with spark.shuffle.spill=false
>
> Thanks
> Best Regards
>
> On Wed, Mar 18, 2015 at
On Wed, Mar 18, 2015 at 8:31 PM, Shao, Saisai wrote:
> From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar
> 18 16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the
> final shuffle result.
>
why the shuffle result is written to disk?
> As I said, did you think
val aDstream = ...
val distinctStream = aDstream.transform(_.distinct())
but the elements in distinctStream are not distinct.
Did I use it wrong?