Hi Yan, thanks for the reply. So yes, you are correct it would not be random which partition a message hits. We would use a partition key (sorry I missed that).
The "data" I was referring to is the local KV-store data for each task. Is there a way to synchronize or replicate the data from the KV-store across the tasks, so that each tasks contains the same information in their respective KV-store? Specifically I'm referring to the page on State Management: "If you have some data that you want to share between tasks (across partition boundaries), you need to go to some additional effort to repartition and distribute the data. Each task will need its own copy of the data, so this may use more space overall." Is there a simple means to ensure each task gets a copy of the data? Thanks! On Tue, May 5, 2015 at 2:44 PM, Yan Fang <yanfang...@gmail.com> wrote: > Hi Andreas, > > Not quite understand this part > > "Because the messages coming into the input stream are random (i.e. can hit > any partition and therefore any task), each task will need its own copy of > the data (i.e. the data needs to be duplicated across each task)." > > Messages come into the input stream based on the partition key (not totally > random). Why does each task need its own copy of the data? Do you mean the > copy of the data in other partitions? > > Cheers, > > Fang, Yan > yanfang...@gmail.com > > On Tue, May 5, 2015 at 11:47 AM, Andreas Simanowski <aesim...@gmail.com> > wrote: > > > Hello Samza community: > > > > I am very new to Samza and currently looking at how to use Samza and its > > key-value store. I have run into the following and was hoping someone > could > > point me in the right direction. > > > > Say we have an input stream being consumed by more than one task (one > task > > per partition). Each task has a local key-value store which it will > > reference when processing the messages. Because the messages coming into > > the input stream are random (i.e. can hit any partition and therefore any > > task), each task will need its own copy of the data (i.e. the data needs > to > > be duplicated across each task). From time-to-time this local data would > > also need to be updated with changes. What approaches are there to share > > data between the tasks to keep them up to date? > > > > Thanks for the help! > > > > -Andreas > > >