Re: Optimize the first map reduce of DStream

2015-03-24 Thread Zoltán Zvara
​AFAIK Spark Streaming can not work in a way like this. Transformations are made on DStreams, where DStreams are basically hold (time, allocatedBlocksForBatch) pairs.​ Allocated blocks are allocated by the JobGenerator, unallocated blocks (infos) are collected by ReceivedBlockTracker. In Spark Stre

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Bin Wang
I'm not looking for limit the block size. Here is another example. Say we want to count the lines from the stream in one hour. In a normal program, we may write it like this: int sum = 0 while (line = getFromStream()) { store(line) // store the line into storage instead of memory. sum++ }

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Zoltán Zvara
There is a BlockGenerator on each worker node next to the ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in each interval (block_interval). These Blocks are passed to ReceiverSupervisorImpl, which throws these blocks to into the BlockManager for storage. BlockInfos are passed

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Arush Kharbanda
The block size is configurable and that way I think you can reduce the block interval, to keep the block in memory only for the limiter interval? Is that what you are looking for? On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang wrote: > Hi, > > I'm learning Spark and I find there could be some optimiz

Optimize the first map reduce of DStream

2015-03-24 Thread Bin Wang
Hi, I'm learning Spark and I find there could be some optimize for the current streaming implementation. Correct me if I'm wrong. The current streaming implementation put the data of one batch into memory (as RDD). But it seems not necessary. For example, if I want to count the lines which conta