Yi thanks for the input. I've been out sick, so please excuse the delayed response. I am still working out the use case with my team and will report back next week.
Thanks! On Tue, May 5, 2015 at 4:01 PM, Yi Pan <nickpa...@gmail.com> wrote: > Hi, Andreas, > > Are you describing a use case where the *same* copy of data is shared among > all tasks? That will depend on a lot factors: > 1. is your data size huge? > 2. Can your data be partitioned to work with a single partition of input > stream? > 3. Do you have a means to bootstrap the data from a stream? And how often > do you need to do the bootstrap? > > The first question to be answered actually seems to be the question 3. If > you have a way to bootstrap the data from a stream, Samza can always > bootstrap the local stores from that ingestion stream. Then, based on your > total data size, how often you need to bootstrap, and whether your "shared" > data can be partitioned to work with a single partition of the input > stream, we can find out different solutions to suite your use case the > best. > > On Tue, May 5, 2015 at 3:38 PM, Andreas Simanowski <aesim...@gmail.com> > wrote: > > > Hi Yan, thanks for the reply. > > > > So yes, you are correct it would not be random which partition a message > > hits. We would use a partition key (sorry I missed that). > > > > The "data" I was referring to is the local KV-store data for each task. > Is > > there a way to synchronize or replicate the data from the KV-store across > > the tasks, so that each tasks contains the same information in their > > respective KV-store? Specifically I'm referring to the page on State > > Management: > > > > "If you have some data that you want to share between tasks (across > > partition boundaries), you need to go to some additional effort to > > repartition and distribute the data. Each task will need its own copy of > > the data, so this may use more space overall." > > > > Is there a simple means to ensure each task gets a copy of the data? > > > > Thanks! > > > > On Tue, May 5, 2015 at 2:44 PM, Yan Fang <yanfang...@gmail.com> wrote: > > > > > Hi Andreas, > > > > > > Not quite understand this part > > > > > > "Because the messages coming into the input stream are random (i.e. can > > hit > > > any partition and therefore any task), each task will need its own copy > > of > > > the data (i.e. the data needs to be duplicated across each task)." > > > > > > Messages come into the input stream based on the partition key (not > > totally > > > random). Why does each task need its own copy of the data? Do you mean > > the > > > copy of the data in other partitions? > > > > > > Cheers, > > > > > > Fang, Yan > > > yanfang...@gmail.com > > > > > > On Tue, May 5, 2015 at 11:47 AM, Andreas Simanowski < > aesim...@gmail.com> > > > wrote: > > > > > > > Hello Samza community: > > > > > > > > I am very new to Samza and currently looking at how to use Samza and > > its > > > > key-value store. I have run into the following and was hoping someone > > > could > > > > point me in the right direction. > > > > > > > > Say we have an input stream being consumed by more than one task (one > > > task > > > > per partition). Each task has a local key-value store which it will > > > > reference when processing the messages. Because the messages coming > > into > > > > the input stream are random (i.e. can hit any partition and therefore > > any > > > > task), each task will need its own copy of the data (i.e. the data > > needs > > > to > > > > be duplicated across each task). From time-to-time this local data > > would > > > > also need to be updated with changes. What approaches are there to > > share > > > > data between the tasks to keep them up to date? > > > > > > > > Thanks for the help! > > > > > > > > -Andreas > > > > > > > > > >