Yi thanks for the input. I've been out sick, so please excuse the delayed
response. I am still working out the use case with my team and will report
back next week.

Thanks!

On Tue, May 5, 2015 at 4:01 PM, Yi Pan <nickpa...@gmail.com> wrote:

> Hi, Andreas,
>
> Are you describing a use case where the *same* copy of data is shared among
> all tasks? That will depend on a lot factors:
> 1. is your data size huge?
> 2. Can your data be partitioned to work with a single partition of input
> stream?
> 3. Do you have a means to bootstrap the data from a stream? And how often
> do you need to do the bootstrap?
>
> The first question to be answered actually seems to be the question 3. If
> you have a way to bootstrap the data from a stream, Samza can always
> bootstrap the local stores from that ingestion stream. Then, based on your
> total data size, how often you need to bootstrap, and whether your "shared"
> data can be partitioned to work with a single partition of the input
> stream, we can find out different solutions to suite your use case the
> best.
>
> On Tue, May 5, 2015 at 3:38 PM, Andreas Simanowski <aesim...@gmail.com>
> wrote:
>
> > Hi Yan, thanks for the reply.
> >
> > So yes, you are correct it would not be random which partition a message
> > hits. We would use a partition key (sorry I missed that).
> >
> > The "data" I was referring to is the local KV-store data for each task.
> Is
> > there a way to synchronize or replicate the data from the KV-store across
> > the tasks, so that each tasks contains the same information in their
> > respective KV-store? Specifically I'm referring to the page on State
> > Management:
> >
> > "If you have some data that you want to share between tasks (across
> > partition boundaries), you need to go to some additional effort to
> > repartition and distribute the data. Each task will need its own copy of
> > the data, so this may use more space overall."
> >
> > Is there a simple means to ensure each task gets a copy of the data?
> >
> > Thanks!
> >
> > On Tue, May 5, 2015 at 2:44 PM, Yan Fang <yanfang...@gmail.com> wrote:
> >
> > > Hi Andreas,
> > >
> > > Not quite understand this part
> > >
> > > "Because the messages coming into the input stream are random (i.e. can
> > hit
> > > any partition and therefore any task), each task will need its own copy
> > of
> > > the data (i.e. the data needs to be duplicated across each task)."
> > >
> > > Messages come into the input stream based on the partition key (not
> > totally
> > > random). Why does each task need its own copy of the data? Do you mean
> > the
> > > copy of the data in other partitions?
> > >
> > > Cheers,
> > >
> > > Fang, Yan
> > > yanfang...@gmail.com
> > >
> > > On Tue, May 5, 2015 at 11:47 AM, Andreas Simanowski <
> aesim...@gmail.com>
> > > wrote:
> > >
> > > > Hello Samza community:
> > > >
> > > > I am very new to Samza and currently looking at how to use Samza and
> > its
> > > > key-value store. I have run into the following and was hoping someone
> > > could
> > > > point me in the right direction.
> > > >
> > > > Say we have an input stream being consumed by more than one task (one
> > > task
> > > > per partition). Each task has a local key-value store which it will
> > > > reference when processing the messages. Because the messages coming
> > into
> > > > the input stream are random (i.e. can hit any partition and therefore
> > any
> > > > task), each task will need its own copy of the data (i.e. the data
> > needs
> > > to
> > > > be duplicated across each task). From time-to-time this local data
> > would
> > > > also need to be updated with changes. What approaches are there to
> > share
> > > > data between the tasks to keep them up to date?
> > > >
> > > > Thanks for the help!
> > > >
> > > > -Andreas
> > > >
> > >
> >
>

Reply via email to