Re: Optimize the first map reduce of DStream

2015-03-24 Thread Zoltán Zvara
​AFAIK Spark Streaming can not work in a way like this. Transformations are made on DStreams, where DStreams are basically hold (time, allocatedBlocksForBatch) pairs.​ Allocated blocks are allocated by the JobGenerator, unallocated blocks (infos) are collected by ReceivedBlockTracker. In Spark Stre

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Bin Wang
I'm not looking for limit the block size. Here is another example. Say we want to count the lines from the stream in one hour. In a normal program, we may write it like this: int sum = 0 while (line = getFromStream()) { store(line) // store the line into storage instead of memory. sum++ }

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Zoltán Zvara
There is a BlockGenerator on each worker node next to the ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in each interval (block_interval). These Blocks are passed to ReceiverSupervisorImpl, which throws these blocks to into the BlockManager for storage. BlockInfos are passed

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Arush Kharbanda
The block size is configurable and that way I think you can reduce the block interval, to keep the block in memory only for the limiter interval? Is that what you are looking for? On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang wrote: > Hi, > > I'm learning Spark and I find there could be some optimiz