Thanks for the reply Chris.  I got solution 1 more or less implemented while 
getting my bearings.  I then started looking into solution 2 and made some 
progress, but now I'm starting to wonder how well the shared state store fits 
our particular use-case.  As I mentioned, we need to use a bootstrap stream 
rather than a changelog to build the shared store because we need to manipulate 
the data first.  But the shared state store design calls for it to be 
read-only.  So my flow now looks like this:

1) Some task gets assigned the single-partition bootstrap stream and consumes 
it.
2) Since the shared state store is read-only, the "bootstrap task" has to 
instead write data to a changelog topic that backs the shared store.
3) Something is continually reading the changelog and writing to the shared 
store to make updates available to StreamTasks.

Number 3 is where I'm kinda stuck now.  Given the desire to keep the 
SamzaContainer single-threaded, I'm not sure how best to continually consume 
the changelog, short of routing it through the RunLoop like all other streams 
and using a dedicated TaskInstance to consume it. That seems like a big change 
though, all to get to a flow that feels pretty convoluted.  Can you think of 
better way to implement that?


On 04/07/2015 05:32 PM, Chris Riccomini wrote:

Hey Tommy,

Your summary sounds pretty accurate. One other way, which requires no
change to Samza, would be to repartition the input topic properly for each
task. This is kind of hacky, though.

(2) is the ideal solution. It is a bit of work, but it might not be so bad.
I think most of the changes would be isolated to the TaskStorageManager.
We'd also need to make the KV store read-only, which is pretty easy to do.
If you're not comfortable with it, though, then (1) would be your next-best
bet.

Cheers,
Chris

On Tue, Apr 7, 2015 at 10:16 AM, Tommy Becker 
<tobec...@tivo.com><mailto:tobec...@tivo.com> wrote:



We have a Kafka topic containing data needed by several Samza jobs. These
jobs will essentially read the data and build up state that will be used
for processing their inputs. Ideally, we would use the topic as a bootstrap
stream to build up this state. The problem with that is the topic
containing the data has a single partition but the topics these jobs are
processing as input have multiple partitions. So my understanding is that
only one task instance in the job would actually process the bootstrap
stream, and therefore any state it built up would be local to that task. So
I'm thinking my options are the following:

1) Implement SAMZA-353 and allow the bootstrap SSP to be assigned to each
task instance
2) Implement the shared state store component of SAMZA-402
3) Layer the shared state on top of Samza in our tasks themselves, maybe
by using something like RocksDB directly.

Number 1 seems easiest to implement at the cost of having the entire state
duplicated for each task.  I'd prefer not to do number 3 given the
existence of this feature on Samza's roadmap, but I am a bit concerned
about the scope of work with number 2, and the fact that this is mostly
Scala code.

Are there any alternatives that I'm missing?  Note that we need to process
the data stream as a bootstrap stream.  Using it as a changelog is
insufficient because we need to be able to manipulate the data before
building up the state store.

--
Tommy Becker
Senior Software Engineer

Digitalsmiths
A TiVo Company

www.digitalsmiths.com<http://www.digitalsmiths.com><http://www.digitalsmiths.com><http://www.digitalsmiths.com>
tobec...@tivo.com<mailto:tobec...@tivo.com><mailto:tobec...@tivo.com><mailto:tobec...@tivo.com>

________________________________

This email and any attachments may contain confidential and privileged
material for the sole use of the intended recipient. Any review, copying,
or distribution of this email (or any attachments) by others is prohibited.
If you are not the intended recipient, please contact the sender
immediately and permanently delete this email and any attachments. No
employee or agent of TiVo Inc. is authorized to conclude any binding
agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
Inc. may only be made by a signed written agreement.






--
Tommy Becker
Senior Software Engineer

Digitalsmiths
A TiVo Company

www.digitalsmiths.com<http://www.digitalsmiths.com>
tobec...@tivo.com<mailto:tobec...@tivo.com>

________________________________

This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.

Reply via email to