Re: [DISCUSS] SEP-5: Enable partition expansion of input streams

Jacob Maes Mon, 05 Jun 2017 11:18:29 -0700

Hey Dong,

Thanks for the SEP. Supporting partition changes is critically important
for stateful Samza jobs, so it's great to see some ideas on that front!


Sorry for the late feedback, but I have a few thoughts to contribute.

Big +1 on Navina's comment:

> My biggest gripe with this SEP is that it seems like a tailor-made
> solution
> that relies on the semantics of the Kafka system and yet, we are trying to
> masquerade that as operational requirements for other systems interacting
> with Samza. (Not to say that this is the first time such a choice is being
> made in the Samza design). I am not seeing how this can a "general"
> solution for all input systems. That's my two cents. I would like to hear
> alternative points of view for this from other devs.


Two examples of this:
1. This is mostly a hypothetical, but some message brokers may use key
range assignment rather than hash+modulo.
2. Kafka can't reduce the number of partitions, but it can happen on other
systems. For example, it may be cheaper to reduce the number of partitions
on a hosted service where the cost model depends on the number of
partitions/shards.

It seems to me that a solution which doesn't depend on partition key
assignment in the message broker. Here are a few alternatives that weren't
discussed and I think should be considered:

Alternatives in order of increasing preference:
1. Samza manages the partition hash (via some new contract with the
brokers) and guarantees correct routing of keys among the new partitions.
2. Samza detects a task count change, creates a new changelog with correct
partitions, and *somehow* reshuffles all existing changelog data into the
new topic and then uses the new topic from then on. (doesn't work without
changelog, but in that case durability isn't paramount, so we can just wipe)
3. Use RPC in between stages and samza fully manages key assignment among
tasks. No on-disk topic data to clean up. Mandatory repartitioning in the
first stage to pre-scaled tasks in next stage.
4. Combined 2-3 solution

Finally, some questions about the current proposal:
1. "An alternative solution is to allow task number to increase after
partition expansion and uses a proper task-to-container assignment to make
sure the Samza output is correct." What does the container have to do with
stateful processing or output in general?
2. When you use "Join" as an example, you basically mean multiple
co-partitioned streams, right? This is opposed to multiple,
independently-partitioned streams or a single stream. Would be nice to
formulate the proposal in these more general terms.
3. When switching SSP groupers, how will the users avoid the
org.apache.samza.checkpoint.kafka.DifferingSystemStreamPartitionGrouperFactoryValues
exception?
4. Partition to task assignment is meaningless without key to partition
mapping. The real semantics are captured in the external requirement for
partitioning via hash+modulo. But in that case, iiuc, only the partition
count matters. So why not just store the original partition count rather
than the whole mapping?

-Jake

On Sun, Jun 4, 2017 at 12:32 PM, Dong Lin <[email protected]> wrote:

> Hey Yi, Navina,
>
> I have updated the SEP-5 document based on our discussion. The difference
> can be found here
> <https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?
> pageId=70255476&selectedPageVersions=14&selectedPageVersions=15>.
> Here is the summary of changes:
>
> - Add new interface that extends the existing interface
> SystemStreamPartitionGrouper. Newly-added grouper class should implement
> this interface.
> - Explained in the Rejected Alternative Section why we don't add new method
> in the existing interface
> - Explained in the Rejected Alternative Section why we don't config/class
> for user to specify new-partition to old-partition mapping.
>
> Can you take another look at the proposal and let me know if there is any
> concern?
>
> Cheers,
> Dong
>
>
> On Thu, Jun 1, 2017 at 12:58 AM, Dong Lin <[email protected]> wrote:
>
> > Hey Yi,
> >
> > Thanks much for the comment. I have updated the doc to address all your
> > comments except the one related to the interface. I am not sure I
> > understand your suggestion of the new interface. Will discuss tomorrow.
> >
> > Thanks,
> > Dong
> >
> > On Wed, May 31, 2017 at 4:29 PM, Yi Pan <[email protected]> wrote:
> >
> >> Hi, Don,
> >>
> >> Thanks for the detailed design doc for a long-waited feature in Samza!
> >> Really appreciate it! I did a quick pass and have the following
> comments:
> >>
> >> - minor: "limit the maximum size of partition" ==> "limit the maximum
> size
> >> of each partition"
> >> - "However, Samza currently is not able to handle partition expansion of
> >> the input streams"==>better point out "for stateful jobs". For stateless
> >> jobs, simply bouncing the job now can pick up the new partitions.
> >> - "it is possible (e.g. with Kafka) that messages with a given key
> exists
> >> in both partition 1 an 3. Because GroupByPartition will assign
> partition 1
> >> and 3 to different tasks, messages with the same key may be handled by
> >> different task/container/process and their state will be stored in
> >> different changelog partition." The problem statement is not super clear
> >> here. The issues with stateful jobs is: after GroupByPartition assign
> >> partition 1 and 3 to different tasks, the new task handling partition 3
> >> does not have the previous state to resume the work. e.g. a page-key
> based
> >> counter would start from 0 in the new task for a specific key, instead
> of
> >> resuming the previous count 50 held by task 1.
> >> - minor rewording: "the first solution in this doc" ==> "the solution
> >> proposed in this doc"
> >> - "Thus additional development work is needed in Kafka to meet this
> >> requirement" It would be good to link to a KIP if and when it exists
> >> - Instead of touching/deprecating the interface
> >> SystemStreamPartitionGrouper, I would recommend to have a different
> >> implementation class of the interface, which in the constructor of the
> >> grouper, takes two parameters: a) the previous task number read from the
> >> coordinator stream; b) the configured new-partition to old-partition
> >> mapping policy. Then, the grouper's interface method stays the same and
> >> the
> >> behavior of the grouper is more configurable which is good to support a
> >> broader set of use cases in addition to Kafka's built-in partition
> >> expansion policies.
> >> - Minor renaming suggestion to the new grouper class names:
> >> GroupByPartitionWithFixedTaskNum
> >> and GroupBySystemStreamPartitionWithFixedTaskNum
> >>
> >> Thanks!
> >>
> >> - Yi
> >>
> >> On Wed, May 31, 2017 at 10:33 AM, Dong Lin <[email protected]> wrote:
> >>
> >> > Hey Navina,
> >> >
> >> > Thanks much for the comment. Please see my response below.
> >> >
> >> > Regarding your biggest gripe with the SEP, I personally think the
> >> > operational requirement proposed in the KIP are pretty general and
> >> could be
> >> > easily enforced by other systems. The reason is that the module
> >> operation
> >> > is pretty standard and the default option when we choose partition.
> And
> >> > usually the underlying system allows user to select arbitrary
> partition
> >> > number if it supports partition expansion. Do you know any system that
> >> does
> >> > not meet these two requirement?
> >> >
> >> > Regarding your comment of the Motivation section, I renamed the first
> >> > section as "Problem and Goal" and specified that "*The goal of this
> >> > proposal is to enable partition expansion of the input streams*.". I
> >> also
> >> > put a sentence at the end of the Motivation section that "*The feature
> >> of
> >> > task expansion is out of the scope of this proposal and will be
> >> addressed
> >> > in a future SEP*". The second paragraph in the Motivation section is
> >> mainly
> >> > used to explain the thinking process that we have gone through, what
> >> other
> >> > alternative we have considered, and we plan to do in Samza in the nex
> >> step.
> >> >
> >> > To answer your question why increasing the partition number will
> >> increase
> >> > the throughput of the kafka consumer in the container, Kafka consumer
> >> can
> >> > potentially fetch more data in one FetchResponse with more partitions
> in
> >> > the FetchRequest. This is because we limit the maximum amount of data
> >> that
> >> > can be fetch for a given partition in the FetchResponse. This by
> >> default is
> >> > set to 1 MB. And there is reason that we can not arbitrarily bump up
> >> this
> >> > limit.
> >> >
> >> > To answer your question how partition expansion in Kafka impacts the
> >> > clients, Kafka consumer is able to automatically detect new partition
> of
> >> > the topic and reassign all (both old and new) partitions across
> >> consumers
> >> > in the consumer group IF you tell consumer the topic to be subscribed.
> >> But
> >> > consumer in Samza's container uses another way of subscription.
> Instead
> >> of
> >> > subscribing to the topic, the consumer in Samza's container subscribes
> >> to
> >> > the specific partitions of the topic. In this case, if new partitions
> >> have
> >> > been added, Samza will need to explicitly subscribe to the new
> >> partitions
> >> > of the topic. The "Handle partition expansion while tasks are running"
> >> > section in the SEP addresses this issue in Samza -- it recalculates
> the
> >> job
> >> > model and restart container so that consumer can subscribe to the new
> >> > partitions.
> >> >
> >> > I will ask other dev to take a look at the proposal. I will start the
> >> > voting thread tomorrow if there is no further concern with the SEP.
> >> >
> >> > Thanks!
> >> > Dong
> >> >
> >> >
> >> > On Wed, May 31, 2017 at 12:01 AM, Navina Ramesh (Apache) <
> >> > [email protected]>
> >> > wrote:
> >> >
> >> > > Hey Dong,
> >> > >
> >> > > >  I have updated the motivation section to clarify this.
> >> > >
> >> > > Thanks for updating the motivation. Couple of notes here:
> >> > >
> >> > > 1.
> >> > > > "The motivation of increasing partition number of Kafka topic
> >> includes
> >> > 1)
> >> > > limit the maximum size of a partition in order to improve broker
> >> > > performance and 2) increase throughput of Kafka consumer in the
> Samza
> >> > > container."
> >> > >
> >> > > It's unclear to me how increasing the partition number will increase
> >> the
> >> > > throughput of the kafka consumer in the container? Theoretically,
> you
> >> > will
> >> > > still be consuming the same amount of data in the container,
> >> irrespective
> >> > > of whether it is coming from one partition or more than one expanded
> >> > > partitions. Can you please explain it for me here, what you mean by
> >> that?
> >> > >
> >> > > 2. I believe the second paragraph under motivation is simply talking
> >> > about
> >> > > the scope of the current SEP. It will be easier to read if what
> >> solution
> >> > is
> >> > > included in this SEP and what is left out as not in scope. (for
> >> example,
> >> > > expansions for stateful jobs is supported or not).
> >> > >
> >> > > > We need to persist the task-to-sspList mapping in the
> >> > > coordinator stream so that the job can derive the original number of
> >> > > partitions of each input stream regardless of how many times the
> >> > partition
> >> > > has expanded. Does this make sense?
> >> > >
> >> > > Yes. It does!
> >> > >
> >> > > > I am not sure how this is related to the locality though. Can you
> >> > clarify
> >> > > your question if I haven't answered your question?
> >> > >
> >> > > It's not related. I just meant to give an example of yet another
> >> > > coordinator message that is persisted. Your ssp-to-task mapping is
> >> > > following a similar pattern for persisting. Just wanted to clarify
> >> that.
> >> > >
> >> > > > Can you let me know if this, together with the answers in the
> >> previous
> >> > > email, addresses all your questions?
> >> > >
> >> > > Yes. I believe you have addressed most of my questions. Thanks for
> >> taking
> >> > > time to do that.
> >> > >
> >> > > > Is there specific question you have regarding partition
> >> > > expansion in Kafka?
> >> > >
> >> > > I guess my questions are on how partition expansion in Kafka impacts
> >> the
> >> > > clients. Iiuc, partition expansions are done manually in Kafka based
> >> on
> >> > the
> >> > > bytes-in rate of the partition. Do the existing kafka clients handle
> >> this
> >> > > expansion automatically? if yes, how does it work? If not, are there
> >> > plans
> >> > > to support it in the future?
> >> > >
> >> > > > Thus user's job should not need to bootstrap key/value store from
> >> the
> >> > > changelog topic.
> >> > >
> >> > > Why is this discussion relevant here? Key/value store / changelog
> >> topic
> >> > > partition is scoped with the context of a task. Since we are not
> >> changing
> >> > > the number of tasks, I don't think it is required to mention it
> here.
> >> > >
> >> > > > The new method takes the SystemStreamPartition-to-Task assignment
> >> from
> >> > > the previous job model which can be read from the coordinator
> stream.
> >> > >
> >> > > Jobmodel is currently not persisted to coordinator stream. In your
> >> > design,
> >> > > you talk about writing separate coordinator messages for ssp-to-task
> >> > > assignments. Hence, please correct this statement. It is kind of
> >> > misleading
> >> > > to the reader.
> >> > >
> >> > > My biggest gripe with this SEP is that it seems like a tailor-made
> >> > solution
> >> > > that relies on the semantics of the Kafka system and yet, we are
> >> trying
> >> > to
> >> > > masquerade that as operational requirements for other systems
> >> interacting
> >> > > with Samza. (Not to say that this is the first time such a choice is
> >> > being
> >> > > made in the Samza design). I am not seeing how this can a "general"
> >> > > solution for all input systems. That's my two cents. I would like to
> >> hear
> >> > > alternative points of view for this from other devs.
> >> > >
> >> > > Please make sure you have enough eyes on this SEP. If you do, please
> >> > start
> >> > > a VOTE thread to approve this SEP.
> >> > >
> >> > > Thanks!
> >> > > Navina
> >> > >
> >> > >
> >> > > On Mon, May 29, 2017 at 12:32 AM, Dong Lin <[email protected]>
> >> wrote:
> >> > >
> >> > > > Hey Navina,
> >> > > >
> >> > > > I have updated the wiki based on your suggestion. More
> >> specifically, I
> >> > > have
> >> > > > made the following changes:
> >> > > >
> >> > > > - Improved Problem section and Motivation section to describe why
> we
> >> > use
> >> > > > the solution in this proposal instead of tackling the problem of
> >> task
> >> > > > expansion directly.
> >> > > >
> >> > > > - Illustrate the design in a way that doesn't bind to Kafka. Kafka
> >> is
> >> > > only
> >> > > > used as example to illustrate why we want to expand partition
> >> expansion
> >> > > and
> >> > > > whether the operational requirement can be supported when Kafka is
> >> used
> >> > > as
> >> > > > the input system. Note that the proposed solution should work for
> >> any
> >> > > input
> >> > > > system that meets the operational requirement described in the
> wiki.
> >> > > >
> >> > > > - Fixed the problem in the figure.
> >> > > >
> >> > > > - Added a new class GroupBySystemStreamPartitionFixedTaskNum to
> the
> >> > > wiki.
> >> > > > Together with GroupByPartitionFixedTaskNum, it should ensure that
> we
> >> > > have a
> >> > > > solution to enable partition expansion for all users that are
> using
> >> > > > pre-defined grouper in Samza. Note that those users who use custom
> >> > > grouper
> >> > > > would need to update their implementation.
> >> > > >
> >> > > > Can you let me know if this, together with the answers in the
> >> previous
> >> > > > email, addresses all your questions? Thanks for taking time to
> >> review
> >> > the
> >> > > > proposal.
> >> > > >
> >> > > > Regards,
> >> > > > Dong
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Wed, May 24, 2017 at 11:15 PM, Dong Lin <[email protected]>
> >> > wrote:
> >> > > >
> >> > > > > Hey Navina,
> >> > > > >
> >> > > > > Thanks much for your comments. Please see my reply inline.
> >> > > > >
> >> > > > > On Wed, May 24, 2017 at 10:22 AM, Navina Ramesh (Apache) <
> >> > > > > [email protected]> wrote:
> >> > > > >
> >> > > > >> Thanks for the SEP, Dong. I have a couple of questions to
> >> understand
> >> > > > your
> >> > > > >> proposal better:
> >> > > > >>
> >> > > > >> * Under motivation, you mention that "_We expect this solution
> to
> >> > work
> >> > > > >> similarly with other input system as well._", yet I don't see
> any
> >> > > > >> discussion on how it will work with other input systems. That
> is,
> >> > what
> >> > > > >> kind
> >> > > > >> of contract does samza expect from other input systems ? If we
> >> are
> >> > not
> >> > > > >> planning to provide a generic solution, it might be worth
> >> calling it
> >> > > out
> >> > > > >> in
> >> > > > >> the SEP.
> >> > > > >>
> >> > > > >
> >> > > > > I think the contract we expect from other systems are exactly
> the
> >> > > > > operational requirement mentioned in the SEP, i.e. partitions
> >> should
> >> > > > always
> >> > > > > be doubled and the hash algorithm should module the number of
> >> > > partitions.
> >> > > > > SEP-5 should also allow partition expansion of all input systems
> >> that
> >> > > > meet
> >> > > > > these two requirements. I have updated the motivation section to
> >> > > clarify
> >> > > > > this.
> >> > > > >
> >> > > > >
> >> > > > >>
> >> > > > >> * I understand the partition mapping logic you have proposed.
> >> But I
> >> > > > think
> >> > > > >> the example explanation doesn't match the diagram. In the
> >> diagram,
> >> > > after
> >> > > > >> expansion, partiion-0 and partition-1 are pointing to bucket 0
> >> and
> >> > > > >> partition-3 and partition-4 are pointing to bucket 1. I think
> the
> >> > > former
> >> > > > >> has to be partition-0 and partition-2 and the latter, is
> >> partition-1
> >> > > and
> >> > > > >> partition-3. If I am wrong, please help me understand the logic
> >> :)
> >> > > > >>
> >> > > > >
> >> > > > > Good catch. I will update the figure to fix this problem.
> >> > > > >
> >> > > > >
> >> > > > >>
> >> > > > >> * I don't know how partition expansion in Kafka works. I am
> >> familiar
> >> > > > with
> >> > > > >> how shard splitting happens in Kinesis - there is hierarchical
> >> > > relation
> >> > > > >> between the parent and child shards. This way, it will also
> allow
> >> > the
> >> > > > >> shards to be merged back. Iiuc, Kafka only supports partition
> >> > > > "expansion",
> >> > > > >> as opposed to "splits". Can you provide some context or link
> >> related
> >> > > to
> >> > > > >> how
> >> > > > >> partition expansion works in Kafka?
> >> > > > >>
> >> > > > >
> >> > > > > I couldn't find any wiki on partition expansion in Kafka. The
> >> > partition
> >> > > > > expansion logic in Kafka is very simply -- it simply adds new
> >> > partition
> >> > > > to
> >> > > > > the existing topic. Is there specific question you have
> regarding
> >> > > > partition
> >> > > > > expansion in Kafka?
> >> > > > >
> >> > > > >
> >> > > > >>
> >> > > > >> * Are you only recommending that expansion can be supported for
> >> > samza
> >> > > > jobs
> >> > > > >> that use Kafka as input systems **and** configure the
> SSPGrouper
> >> as
> >> > > > >> GroupByPartitionFixedTaskNum? Sounds to me like this only
> applies
> >> > for
> >> > > > >> GroupByPartition. Please correct me if I am wrong. What is the
> >> > > > expectation
> >> > > > >> for custom SSP Groupers?
> >> > > > >>
> >> > > > >
> >> > > > > The expansion can be supported for Samza jobs if the input
> system
> >> > meets
> >> > > > > the operational requirement mentioned above. It doesn't have to
> >> use
> >> > > Kafka
> >> > > > > as input system.
> >> > > > >
> >> > > > > The current proposal provided solution for jobs that currently
> use
> >> > > > > GroupByPartition. The proposal can be extended to support jobs
> >> that
> >> > use
> >> > > > > other grouper that are pre-defined in Samza. The custom SSP
> >> grouper
> >> > > needs
> >> > > > > to handle partition expansion similar to how
> >> > > GroupByPartitionFixedTaskNum
> >> > > > > handles it and it is users' responsibility to update their
> custom
> >> > > grouper
> >> > > > > implementation.
> >> > > > >
> >> > > > >
> >> > > > >>
> >> > > > >> * Regarding storing SSP-to-Task assignment to coordinator
> stream:
> >> > > Today,
> >> > > > >> the JobModel encapsulates the data model in samza which also
> >> > includes
> >> > > > >> **TaskModels**. TaskModel, typically shows the task-to-sspList
> >> > > mapping.
> >> > > > >> What is the reason for using a separate coordinator stream
> >> message
> >> > > > >> *SetSSPTaskMapping*? Is it because the JobModel itself is not
> >> > > persisted
> >> > > > in
> >> > > > >> the coordinator stream today?  The reason locality exists
> >> outside of
> >> > > the
> >> > > > >> jobmodel is because *locality* information is written by each
> >> > > container,
> >> > > > >> where as it is consumed only by the leader jobcoordinator/AM.
> In
> >> > this
> >> > > > >> case,
> >> > > > >> the writer of the mapping information and the reader is still
> the
> >> > > leader
> >> > > > >> jobcoordinator/AM. So, I want to understand the motivation for
> >> this
> >> > > > >> choice.
> >> > > > >>
> >> > > > >
> >> > > > > Yes, the reason for using a separate coordinate stream message
> is
> >> > > because
> >> > > > > the task-to-sspList mapping is not currently persisted in the
> >> > > coordinator
> >> > > > > stream. We wouldn't need to create this new stream message if
> >> > JobModel
> >> > > is
> >> > > > > persisted. We need to persist the task-to-sspList mapping in the
> >> > > > > coordinator stream so that the job can derive the original
> number
> >> of
> >> > > > > partitions of each input stream regardless of how many times the
> >> > > > partition
> >> > > > > has expanded. Does this make sense?
> >> > > > >
> >> > > > > I am not sure how this is related to the locality though. Can
> you
> >> > > clarify
> >> > > > > your question if I haven't answered your question?
> >> > > > >
> >> > > > > Thanks!
> >> > > > > Dong
> >> > > > >
> >> > > > >
> >> > > > >>
> >> > > > >> Cheers!
> >> > > > >> Navina
> >> > > > >>
> >> > > > >> On Tue, May 23, 2017 at 5:45 PM, Dong Lin <[email protected]
> >
> >> > > wrote:
> >> > > > >>
> >> > > > >> > Hi all,
> >> > > > >> >
> >> > > > >> > We created SEP-5: Enable partition expansion of input
> streams.
> >> > > Please
> >> > > > >> find
> >> > > > >> > the SEP wiki in the link
> >> > > > >> > https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> >> > > > >> > 5%3A+Enable+partition+expansion+of+input+streams
> >> > > > >> > .
> >> > > > >> >
> >> > > > >> > You feedback is appreciated!
> >> > > > >> >
> >> > > > >> > Thanks,
> >> > > > >> > Dong
> >> > > > >> >
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSS] SEP-5: Enable partition expansion of input streams

Reply via email to