See responses inline. On Wed, Mar 11, 2015 at 6:58 AM, Zoltán Zvara <zoltan.zv...@gmail.com> wrote:
> I'm trying to understand the block allocation mechanism Spark uses to > generate batch jobs and a JobSet. > > The JobGenerator.generateJobs tries to allocate received blocks to batch, > effectively in ReceivedBlockTracker.allocateBlocksToBatch creates > a streamIdToBlocks, where steam ID's (Int) mapped to Seq[ReceivedBlockInfo] > using getReceivedBlockQueue. This is where it gets tricky for me. > > getReceivedBlockQueue of class ReceivedBlockTracker reads > streamIdToUnallocatedBlockQueues > that should be populated with ReceivedBlockQueues? Who inserts these > ReceivedBlockQueues into streamIdToUnallocatedBlockQueues and where does it > get written? I've found only usages of 'effectively' value read. > > Inserted here. https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala#L84 > At a point streamIdToBlocks get packed into a case class > of AllocatedBlocks. Why is it necessary? > Just a container that captures all the blocks allocated to a batch. Used for both tracking in memory as well as writing it out to the write ahead log. > > Also, at JobGenerator.generateJobs the line where receivedBlockInfos > created, > shouldn't it be empty, because streamIdToUnallocatedBlockQueues never got > written to? Where do I miss the point? How does the > JobGenerator.generateJobs > able to retrieve the received block infos? > > I think the above line number answers that. > Thanks, > > ZZ >