AFAIK Spark Streaming can not work in a way like this. Transformations are
made on DStreams, where DStreams are basically hold (time,
allocatedBlocksForBatch) pairs. Allocated blocks are allocated by the
JobGenerator, unallocated blocks (infos) are collected by
ReceivedBlockTracker. In Spark Stre
I'm not looking for limit the block size.
Here is another example. Say we want to count the lines from the stream in
one hour. In a normal program, we may write it like this:
int sum = 0
while (line = getFromStream()) {
store(line) // store the line into storage instead of memory.
sum++
}
There is a BlockGenerator on each worker node next to the
ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in
each interval (block_interval). These Blocks are passed to
ReceiverSupervisorImpl, which throws these blocks to into the BlockManager
for storage. BlockInfos are passed
The block size is configurable and that way I think you can reduce the
block interval, to keep the block in memory only for the limiter interval?
Is that what you are looking for?
On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang wrote:
> Hi,
>
> I'm learning Spark and I find there could be some optimiz
Hi,
I'm learning Spark and I find there could be some optimize for the current
streaming implementation. Correct me if I'm wrong.
The current streaming implementation put the data of one batch into memory
(as RDD). But it seems not necessary.
For example, if I want to count the lines which conta