AFAIK Spark Streaming can not work in a way like this. Transformations are made on DStreams, where DStreams are basically hold (time, allocatedBlocksForBatch) pairs. Allocated blocks are allocated by the JobGenerator, unallocated blocks (infos) are collected by ReceivedBlockTracker. In Spark Streaming you define transformations and actions on DStreams. The operators define RDD chains, tasks are created by spark-core. You manipulate DStreams, not single unit of data. Flink for example uses a continuous model. It optimizes for memory usage and latency. Read the Spark Streaming paper and Spark paper for more reference.
Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-00000021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE) 2015-03-24 15:03 GMT+01:00 Bin Wang <wbi...@gmail.com>: > I'm not looking for limit the block size. > > Here is another example. Say we want to count the lines from the stream in > one hour. In a normal program, we may write it like this: > > int sum = 0 > while (line = getFromStream()) { > store(line) // store the line into storage instead of memory. > sum++ > } > > This could be seen as a reduce. The only memory used here is just the > variable named "line", need not store all the lines into memory (if lines > would not use in other places). If we want to provide fault tolerance, we > may just store lines into storage instead of in the memory. Could Spark > streaming work like this way? Dose Flink work like this? > > > > > > On Tue, Mar 24, 2015 at 7:04 PM Zoltán Zvara <zoltan.zv...@gmail.com> > wrote: > >> There is a BlockGenerator on each worker node next to the >> ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in >> each interval (block_interval). These Blocks are passed to >> ReceiverSupervisorImpl, which throws these blocks to into the BlockManager >> for storage. BlockInfos are passed to the driver. Mini-batches are created >> by the JobGenerator component on the driver each batch_interval. I guess >> what you are looking for is provided by a continuous model like Flink's. We >> are creating mini-batches to provide fault tolerance. >> >> Zvara Zoltán >> >> >> >> mail, hangout, skype: zoltan.zv...@gmail.com >> >> mobile, viber: +36203129543 >> >> bank: 10918001-00000021-50480008 >> >> address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a >> >> elte: HSKSJZ (ZVZOAAI.ELTE) >> >> 2015-03-24 11:55 GMT+01:00 Arush Kharbanda <ar...@sigmoidanalytics.com>: >> >>> The block size is configurable and that way I think you can reduce the >>> block interval, to keep the block in memory only for the limiter >>> interval? >>> Is that what you are looking for? >>> >>> On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang <wbi...@gmail.com> wrote: >>> >>> > Hi, >>> > >>> > I'm learning Spark and I find there could be some optimize for the >>> current >>> > streaming implementation. Correct me if I'm wrong. >>> > >>> > The current streaming implementation put the data of one batch into >>> memory >>> > (as RDD). But it seems not necessary. >>> > >>> > For example, if I want to count the lines which contains word "Spark", >>> I >>> > just need to map every line to see if it contains word, then reduce it >>> with >>> > a sum function. After that, this line is no longer useful to keep it in >>> > memory. >>> > >>> > That is said, if the DStream only have one map and/or reduce operation >>> on >>> > it. It is not necessary to keep all the batch data in the memory. >>> Something >>> > like a pipeline should be OK. >>> > >>> > Is it difficult to implement on top of the current implementation? >>> > >>> > Thanks. >>> > >>> > --- >>> > Bin Wang >>> > >>> >>> >>> >>> -- >>> >>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> >>> >>> *Arush Kharbanda* || Technical Teamlead >>> >>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com >>> >>