Hi, Andreas, Are you describing a use case where the *same* copy of data is shared among all tasks? That will depend on a lot factors: 1. is your data size huge? 2. Can your data be partitioned to work with a single partition of input stream? 3. Do you have a means to bootstrap the data from a stream? And how often do you need to do the bootstrap?
The first question to be answered actually seems to be the question 3. If you have a way to bootstrap the data from a stream, Samza can always bootstrap the local stores from that ingestion stream. Then, based on your total data size, how often you need to bootstrap, and whether your "shared" data can be partitioned to work with a single partition of the input stream, we can find out different solutions to suite your use case the best. On Tue, May 5, 2015 at 3:38 PM, Andreas Simanowski <aesim...@gmail.com> wrote: > Hi Yan, thanks for the reply. > > So yes, you are correct it would not be random which partition a message > hits. We would use a partition key (sorry I missed that). > > The "data" I was referring to is the local KV-store data for each task. Is > there a way to synchronize or replicate the data from the KV-store across > the tasks, so that each tasks contains the same information in their > respective KV-store? Specifically I'm referring to the page on State > Management: > > "If you have some data that you want to share between tasks (across > partition boundaries), you need to go to some additional effort to > repartition and distribute the data. Each task will need its own copy of > the data, so this may use more space overall." > > Is there a simple means to ensure each task gets a copy of the data? > > Thanks! > > On Tue, May 5, 2015 at 2:44 PM, Yan Fang <yanfang...@gmail.com> wrote: > > > Hi Andreas, > > > > Not quite understand this part > > > > "Because the messages coming into the input stream are random (i.e. can > hit > > any partition and therefore any task), each task will need its own copy > of > > the data (i.e. the data needs to be duplicated across each task)." > > > > Messages come into the input stream based on the partition key (not > totally > > random). Why does each task need its own copy of the data? Do you mean > the > > copy of the data in other partitions? > > > > Cheers, > > > > Fang, Yan > > yanfang...@gmail.com > > > > On Tue, May 5, 2015 at 11:47 AM, Andreas Simanowski <aesim...@gmail.com> > > wrote: > > > > > Hello Samza community: > > > > > > I am very new to Samza and currently looking at how to use Samza and > its > > > key-value store. I have run into the following and was hoping someone > > could > > > point me in the right direction. > > > > > > Say we have an input stream being consumed by more than one task (one > > task > > > per partition). Each task has a local key-value store which it will > > > reference when processing the messages. Because the messages coming > into > > > the input stream are random (i.e. can hit any partition and therefore > any > > > task), each task will need its own copy of the data (i.e. the data > needs > > to > > > be duplicated across each task). From time-to-time this local data > would > > > also need to be updated with changes. What approaches are there to > share > > > data between the tasks to keep them up to date? > > > > > > Thanks for the help! > > > > > > -Andreas > > > > > >