Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

Colin McCabe Tue, 19 Dec 2017 09:27:01 -0800

On Tue, Dec 19, 2017, at 02:16, Jan Filipiak wrote:
> Sorry for coming back at this so late.
> 
> 
> 
> On 11.12.2017 07:12, Colin McCabe wrote:
> > On Sun, Dec 10, 2017, at 22:10, Colin McCabe wrote:
> >> On Fri, Dec 8, 2017, at 01:16, Jan Filipiak wrote:
> >>> Hi,
> >>>
> >>> sorry for the late reply, busy times :-/
> >>>
> >>> I would ask you one thing maybe. Since the timeout
> >>> argument seems to be settled I have no further argument
> >>> form your side except the "i don't want to".
> >>>
> >>> Can you see that connection.max.idle.max is the exact time
> >>> that expresses "We expect the client to be away for this long,
> >>> and come back and continue"?
> >> Hi Jan,
> >>
> >> Sure, connection.max.idle.max is the exact time that we want to keep
> >> around a TCP session.  TCP sessions are relatively cheap, so we can
> >> afford to keep them around for 10 minutes by default.  Incremental fetch
> >> state is less cheap, so we want to set a shorter timeout for it.  We
> >> also want new TCP sessions to be able to reuse an existing incremental
> >> fetch session rather than creating a new one and waiting for the old one
> >> to time out.
> >>
> >>> also clarified some stuff inline
> >>>
> >>> Best Jan
> >>>
> >>>
> >>>
> >>>
> >>> On 05.12.2017 23:14, Colin McCabe wrote:
> >>>> On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> >>>>> Hi Colin
> >>>>>
> >>>>> Addressing the topic of how to manage slots from the other thread.
> >>>>> With tcp connections all this comes for free essentially.
> >>>> Hi Jan,
> >>>>
> >>>> I don't think that it's accurate to say that cache management "comes for
> >>>> free" by coupling the incremental fetch session with the TCP session.
> >>>> When a new TCP session is started by a fetch request, you still have to
> >>>> decide whether to grant that request an incremental fetch session or
> >>>> not.  If your answer is that you always grant the request, I would argue
> >>>> that you do not have cache management.
> >>> First I would say, the client has a big say in this. If the client
> >>> is not going to issue incremental he shouldn't ask for a cache
> >>> when the client ask for the cache we still have all options to deny.
> >> To put it simply, we have to have some cache management above and beyond
> >> just giving out an incremental fetch session to anyone who has a TCP
> >> session.  Therefore, caching does not become simpler if you couple the
> >> fetch session to the TCP session.
> Simply giving out an fetch session for everyone with a connection is too 
> simple,
> but I think it plays well into the idea of consumers choosing to use the 
> feature
> therefore only enabling where it brings maximum gains 
> (replicas,MirrorMakers)
> >>
> >>>> I guess you could argue that timeouts are cache management, but I don't
> >>>> find that argument persuasive.  Anyone could just create a lot of TCP
> >>>> sessions and use a lot of resources, in that case.  So there is
> >>>> essentially no limit on memory use.  In any case, TCP sessions don't
> >>>> help us implement fetch session timeouts.
> >>> We still have all the options denying the request to keep the state.
> >>> What you want seems like a max connections / ip safeguard.
> >>> I can currently take down a broker with to many connections easily.
> >>>
> >>>
> >>>>> I still would argue we disable it by default and make a flag in the
> >>>>> broker to ask the leader to maintain the cache while replicating and 
> >>>>> also only
> >>>>> have it optional in consumers (default to off) so one can turn it on
> >>>>> where it really hurts.  MirrorMaker and audit consumers prominently.
> >>>> I agree with Jason's point from earlier in the thread.  Adding extra
> >>>> configuration knobs that aren't really necessary can harm usability.
> >>>> Certainly asking people to manually turn on a feature "where it really
> >>>> hurts" seems to fall in that category, when we could easily enable it
> >>>> automatically for them.
> >>> This doesn't make much sense to me.
> >> There are no tradeoffs to think about from the client's point of view:
> >> it always wants an incremental fetch session.  So there is no benefit to
> >> making the clients configure an extra setting.  Updating and managing
> >> client configurations is also more difficult than managing broker
> >> configurations for most users.
> >>
> >>> You also wanted to implement
> >>> a "turn of in case of bug"-knob. Having the client indicate if the
> >>> feauture will be used seems reasonable to me.,
> >> True.  However, if there is a bug, we could also roll back the client,
> >> so having this configuration knob is not strictly required.
> >>
> >>>>> Otherwise I left a few remarks in-line, which should help to understand
> >>>>> my view of the situation better
> >>>>>
> >>>>> Best Jan
> >>>>>
> >>>>>
> >>>>> On 05.12.2017 08:06, Colin McCabe wrote:
> >>>>>> On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> >>>>>>> On 03.12.2017 21:55, Colin McCabe wrote:
> >>>>>>>> On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> >>>>>>>>> Thanks for the explanation, Colin. A few more questions.
> >>>>>>>>>
> >>>>>>>>>> The session epoch is not complex.  It's just a number which 
> >>>>>>>>>> increments
> >>>>>>>>>> on each incremental fetch.  The session epoch is also useful for
> >>>>>>>>>> debugging-- it allows you to match up requests and responses when
> >>>>>>>>>> looking at log files.
> >>>>>>>>> Currently each request in Kafka has a correlation id to help match 
> >>>>>>>>> the
> >>>>>>>>> requests and responses. Is epoch doing something differently?
> >>>>>>>> Hi Becket,
> >>>>>>>>
> >>>>>>>> The correlation ID is used within a single TCP session, to uniquely
> >>>>>>>> associate a request with a response.  The correlation ID is not 
> >>>>>>>> unique
> >>>>>>>> (and has no meaning) outside the context of that single TCP session.
> >>>>>>>>
> >>>>>>>> Keep in mind, NetworkClient is in charge of TCP sessions, and 
> >>>>>>>> generally
> >>>>>>>> tries to hide that information from the upper layers of the code.  So
> >>>>>>>> when you submit a request to NetworkClient, you don't know if that
> >>>>>>>> request creates a TCP session, or reuses an existing one.
> >>>>>>>>>> Unfortunately, this doesn't work.  Imagine the client misses an
> >>>>>>>>>> increment fetch response about a partition.  And then the 
> >>>>>>>>>> partition is
> >>>>>>>>>> never updated after that.  The client has no way to know about the
> >>>>>>>>>> partition, since it won't be included in any future incremental 
> >>>>>>>>>> fetch
> >>>>>>>>>> responses.  And there are no offsets to compare, since the 
> >>>>>>>>>> partition is
> >>>>>>>>>> simply omitted from the response.
> >>>>>>>>> I am curious about in which situation would the follower miss a 
> >>>>>>>>> response
> >>>>>>>>> of a partition. If the entire FetchResponse is lost (e.g. timeout), 
> >>>>>>>>> the
> >>>>>>>>> follower would disconnect and retry. That will result in sending a 
> >>>>>>>>> full
> >>>>>>>>> FetchRequest.
> >>>>>>>> Basically, you are proposing that we rely on TCP for reliable 
> >>>>>>>> delivery
> >>>>>>>> in a distributed system.  That isn't a good idea for a bunch of
> >>>>>>>> different reasons.  First of all, TCP timeouts tend to be very long. 
> >>>>>>>>  So
> >>>>>>>> if the TCP session timing out is your error detection mechanism, you
> >>>>>>>> have to wait minutes for messages to timeout.  Of course, we add a
> >>>>>>>> timeout on top of that after which we declare the connection bad and
> >>>>>>>> manually close it.  But just because the session is closed on one end
> >>>>>>>> doesn't mean that the other end knows that it is closed.  So the 
> >>>>>>>> leader
> >>>>>>>> may have to wait quite a long time before TCP decides that yes,
> >>>>>>>> connection X from the follower is dead and not coming back, even 
> >>>>>>>> though
> >>>>>>>> gremlins ate the FIN packet which the follower attempted to 
> >>>>>>>> translate.
> >>>>>>>> If the cache state is tied to that TCP session, we have to keep that
> >>>>>>>> cache around for a much longer time than we should.
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I see this from a different perspective. The cache expiry time
> >>>>>>> has the same semantic as idle connection time in this scenario.
> >>>>>>> It is the time range we expect the client to come back an reuse
> >>>>>>> its broker side state. I would argue that on close we would get an
> >>>>>>> extra shot at cleaning up the session state early. As opposed to
> >>>>>>> always wait for that duration for expiry to happen.
> >>>>>> Hi Jan,
> >>>>>>
> >>>>>> The idea here is that the incremental fetch cache expiry time can be
> >>>>>> much shorter than the TCP session timeout.  In general the TCP session
> >>>>>> timeout is common to all TCP connections, and very long.  To make these
> >>>>>> numbers a little more concrete, the TCP session timeout is often
> >>>>>> configured to be 2 hours on Linux.  (See
> >>>>>> https://www.cyberciti.biz/tips/linux-increasing-or-decreasing-tcp-sockets-timeouts.html
> >>>>>> )  The timeout I was proposing for incremental fetch sessions was one 
> >>>>>> or
> >>>>>> two minutes at most.
> >>>>> Currently this is taken care of by
> >>>>> connections.max.idle.ms on the broker and defaults to something of few
> >>>>> minutes.
> >>>> It is 10 minutes by default, which is longer than what we want the
> >>>> incremental fetch session timeout to be.  There's no reason to couple
> >>>> these two things.
> >>>>
> >>>>> Also something we could let the client change if we really wanted to.
> >>>>> So there is no need to worry about coupling our implementation to some
> >>>>> timeouts given by the OS, with TCP one always has full control over the 
> >>>>> worst
> >>>>> times + one gets the extra shot cleaning up early when the close comes 
> >>>>> through.
> >>>>> Which is the majority of the cases.
> >>>> In the majority of cases, the TCP session will be re-established.  In
> >>>> that case, we have to send a full fetch request rather than an
> >>>> incremental fetch request.
> >>> I actually have a hard time believing this. Do you have any numbers of
> >>> any existing production system? Is it the virtualisation layer cutting
> >>> all the connections?
> >>> We see this only on application crashes and restarts where the app needs
> >>> todo the full anyways
> >>> as it probably continues with stores offsets.
> >> Yes, TCP connections get dropped.  It happens very often in production
> >> clusters, actually.  When I was working on Hadoop, one of the most
> >> common questions I heard from newcomers was "why do I see so many
> >> EOFException messages in the logs"?  The other thing that happens a lot
> >> is DNS outages or slowness.  Public clouds seem to have even more
> >> unstable networks than the on-premise clusters.  I am not sure why that
> >> is.
> Hadoop has a wiki page on exactly this
> https://wiki.apache.org/hadoop/EOFException
> 
> besides user errors they have servers crashing and actually loss of 
> connection high on their list.
> In the case of "server goes away" the cache goes with it. So nothing to 
> argue about the cache beeing reused by
> a new connection.
> 
> Can you make an argument at which point the epoch would be updated 
> broker side to maximise re-usage of the cache on
> lost connections. In many cases the epoch would go out of sync and we 
> would need a full fetch anyways. Am I mistaken here?


The current proposal is that the server can accept multiple requests in a row 
with the same sequence number.

Colin

> 
> 
> 
> 
> >>
> >>>>>>>> Secondly, from a software engineering perspective, it's not a good 
> >>>>>>>> idea
> >>>>>>>> to try to tightly tie together TCP and our code.  We would have to
> >>>>>>>> rework how we interact with NetworkClient so that we are aware of 
> >>>>>>>> things
> >>>>>>>> like TCP sessions closing or opening.  We would have to be careful
> >>>>>>>> preserve the ordering of incoming messages when doing things like
> >>>>>>>> putting incoming requests on to a queue to be processed by multiple
> >>>>>>>> threads.  It's just a lot of complexity to add, and there's no 
> >>>>>>>> upside.
> >>>>>>> I see the point here. And I had a small chat with Dong Lin already
> >>>>>>> making me aware of this. I tried out the approaches and propose the
> >>>>>>> following:
> >>>>>>>
> >>>>>>> The client start and does a full fetch. It then does incremental 
> >>>>>>> fetches.
> >>>>>>> The connection to the broker dies and is re-established by 
> >>>>>>> NetworkClient
> >>>>>>> under the hood.
> >>>>>>> The broker sees an incremental fetch without having state => returns
> >>>>>>> error:
> >>>>>>> Client sees the error, does a full fetch and goes back to 
> >>>>>>> incrementally
> >>>>>>> fetching.
> >>>>>>>
> >>>>>>> having this 1 additional error round trip is essentially the same as
> >>>>>>> when something
> >>>>>>> with the sessions or epoch changed unexpectedly to the client (say
> >>>>>>> expiry).
> >>>>>>>
> >>>>>>> So its nothing extra added but the conditions are easier to evaluate.
> >>>>>>> Especially since we do everything with NetworkClient. Other 
> >>>>>>> implementers
> >>>>>>> on the
> >>>>>>> protocol are free to optimizes this and do not do the errornours
> >>>>>>> roundtrip on the
> >>>>>>> new connection.
> >>>>>>> Its a great plus that the client can know when the error is gonna
> >>>>>>> happen. instead of
> >>>>>>> the server to always have to report back if something changes
> >>>>>>> unexpectedly for the client
> >>>>>> You are assuming that the leader and the follower agree that the TCP
> >>>>>> session drops at the same time.  When there are network problems, this
> >>>>>> may not be true.  The leader may still think the previous TCP session 
> >>>>>> is
> >>>>>> active.  In that case, we have to keep the incremental fetch session
> >>>>>> state around until we learn otherwise (which could be up to that 2 hour
> >>>>>> timeout I mentioned).  And if we get a new incoming incremental fetch
> >>>>>> request, we can't assume that it replaces the previous one, because the
> >>>>>> IDs will be different (the new one starts a new session).
> >>>>> As mentioned, no reason to fear some time-outs out of our control
> >>>>>>>> Imagine that I made an argument that client IDs are "complex" and 
> >>>>>>>> should
> >>>>>>>> be removed from our APIs.  After all, we can just look at the remote 
> >>>>>>>> IP
> >>>>>>>> address and TCP port of each connection.  Would you think that was a
> >>>>>>>> good idea?  The client ID is useful when looking at logs.  For 
> >>>>>>>> example,
> >>>>>>>> if a rebalance is having problems, you want to know what clients were
> >>>>>>>> having a problem.  So having the client ID field to guide you is
> >>>>>>>> actually much less "complex" in practice than not having an ID.
> >>>>>>> I still cant follow why the correlation idea will not help here.
> >>>>>>> Correlating logs with it usually works great. Even with primitive 
> >>>>>>> tools
> >>>>>>> like grep
> >>>>>> The correlation ID does help somewhat, but certainly not as much as a
> >>>>>> unique 64-bit ID.  The correlation ID is not unique in the broker, just
> >>>>>> unique to a single NetworkClient.  Simiarly, the correlation ID is not
> >>>>>> unique on the client side, if there are multiple Consumers, etc.
> >>>>> Can always bump entropy in correlation IDs, never had a problem
> >>>>> of finding to many duplicates. Would be a different KIP though.
> >>>>>>>> Similarly, if metadata responses had epoch numbers (simple 
> >>>>>>>> incrementing
> >>>>>>>> numbers), we would not have to debug problems like clients 
> >>>>>>>> accidentally
> >>>>>>>> getting old metadata from servers that had been partitioned off from 
> >>>>>>>> the
> >>>>>>>> network for a while.  Clients would know the difference between old 
> >>>>>>>> and
> >>>>>>>> new metadata.  So putting epochs in to the metadata request is much 
> >>>>>>>> less
> >>>>>>>> "complex" operationally, even though it's an extra field in the 
> >>>>>>>> request.
> >>>>>>>>      This has been discussed before on the mailing list.
> >>>>>>>>
> >>>>>>>> So I think the bottom line for me is that having the session ID and
> >>>>>>>> session epoch, while it adds two extra fields, reduces operational
> >>>>>>>> complexity and increases debuggability.  It avoids tightly coupling 
> >>>>>>>> us
> >>>>>>>> to assumptions about reliable ordered delivery which tend to be 
> >>>>>>>> violated
> >>>>>>>> in practice in multiple layers of the stack.  Finally, it  avoids the
> >>>>>>>> necessity of refactoring NetworkClient.
> >>>>>>> So there is stacks out there that violate TCP guarantees? And software
> >>>>>>> still works? How can this be? Can you elaborate a little where this
> >>>>>>> can be violated? I am not very familiar with virtualized environments
> >>>>>>> but they can't really violate TCP contracts.
> >>>>>> TCP's guarantees of reliable, in-order transmission certainly can be
> >>>>>> violated.  For example, I once had to debug a cluster where a certain
> >>>>>> node had a network card which corrupted its transmissions occasionally.
> >>>>>> With all the layers of checksums, you would think that this was not
> >>>>>> possible, but it happened.  We occasionally got corrupted data written
> >>>>>> to disk on the other end because of it.  Even more frustrating, the 
> >>>>>> data
> >>>>>> was not corrupted on disk on the sending node-- it was a bug in the
> >>>>>> network card driver that was injecting the errors.
> >>>>> true, but your broker might aswell read a corrupted 600GB as size from
> >>>>> the network and die with OOM instantly.
> >>>> If you read 600 GB as the size from the network, you will not "die with
> >>>> OOM instantly."  That would be a bug.  Instead, you will notice that 600
> >>>> GB is greater than max.message.bytes, and close the connection.
> >>> We only check max.message.bytes to late to guard against consumer
> >>> stalling.
> >>> we dont have a notion of max.networkpacket.size before we allocate the
> >>> bytebuffer to read it into.
> >> "network packets" are not the same thing as "kafka RPCs."  One Kafka RPC
> >> could take up mutiple ethernet packets.
> >>
> >> Also, max.message.bytes has nothing to do with "consumer stalling" --
> >> you are probably thinking about some of the fetch request
> >> configurations.  max.message.bytes is used by the RPC system to figure
> >> out whether to read the full incoming RP
> > Whoops, this is incorrect.  I was thinking about
> > "socket.request.max.bytes" rather than "max.message.bytes."  Sorry about
> > that.  See Ismael's email as well.
> >
> > best,
> > Colin
> >
> >> best,
> >> Colin
> >>
> >>>>> Optimizing for still having functional
> >>>>> software under this circumstances is not reasonable.
> >>>>> You want to get rid of such a
> >>>>> node ASAP and pray that zookeepers ticks get corrupted often enough
> >>>>> that it finally drops out of the cluster.
> >>>>>
> >>>>> There is a good reason that these kinda things
> >>>>> https://issues.apache.org/jira/browse/MESOS-4105
> >>>>> don't end up as kafka Jiras. In the end you can't run any software in
> >>>>> these containers anymore. Application layer checksums are a neat thing 
> >>>>> to
> >>>>> fail fast but trying to cope with this probably causes more bad than
> >>>>> good.  So I would argue that we shouldn't try this for the fetch 
> >>>>> requests.
> >>>> One of the goals of Apache Kafka is to be "a streaming platform...
> >>>> [that] lets you store streams of records in a fault-tolerant way."  For
> >>>> more information, see https://kafka.apache.org/intro .  Fault-tolerance
> >>>> is explicitly part of the goal of Kafka.  Prayer should be optional, not
> >>>> required, when running the software.
> >>> Yes, we need to fail ASAP when we read corrupted packages. It seemed
> >>> to me like you tried to make the case for pray and try to stay alive.
> >>> Fault
> >>> tolerance here means. I am a fishy box i am going to let a good box
> >>> handle
> >>> it and be silent until i get fixed up.
> >>>> Crashing because someone sent you a bad packet is not reasonable
> >>>> behavior.  It is a bug.  Similarly, bringing down the whole cluster,
> >>>> which could a hundred nodes, because someone had a bad network adapter
> >>>> is not reasonable behavior.  It is perhaps reasonable for the cluster to
> >>>> perform worse when hardware is having problems.  But that's a different
> >>>> discussion.
> >>> See above.
> >>>> best,
> >>>> Colin
> >>>>
> >>>>>> However, my point was not about TCP's guarantees being violated.  My
> >>>>>> point is that TCP's guarantees are only one small building block to
> >>>>>> build a robust distributed system.  TCP basically just says that if you
> >>>>>> get any bytes from the stream, you will get the ones that were sent by
> >>>>>> the sender, in the order they were sent.  TCP does not guarantee that
> >>>>>> the bytes you send will get there.  It does not guarantee that if you
> >>>>>> close the connection, the other end will know about it in a timely
> >>>>>> fashion.
> >>>>> These are very powerful grantees and since we use TCP we should
> >>>>> piggy pack everything that is reasonable on to it. IMO there is no
> >>>>> need to reimplement correct sequencing again if you get that from
> >>>>> your transport layer. It saves you the complexity, it makes
> >>>>> you application behave way more naturally and your api easier to
> >>>>> understand.
> >>>>>
> >>>>> There is literally nothing the Kernel wont let you decide
> >>>>> especially not any timings. Only noticeable exception being TIME_WAIT
> >>>>> of usually 240 seconds but that already has little todo with the broker
> >>>>> itself and
> >>>>> if we are running out of usable ports because of this then expiring
> >>>>> fetch requests
> >>>>> wont help much anyways.
> >>>>>
> >>>>> I hope I could strengthen the trust you have in userland TCP connection
> >>>>> management. It is really powerful and can be exploited for maximum gains
> >>>>> without much risk in my opinion.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> It does not guarantee that the bytes will be received in a
> >>>>>> certain timeframe, and certainly doesn't guarantee that if you send a
> >>>>>> byte on connection X and then on connection Y, that the remote end will
> >>>>>> read a byte on X before reading a byte on Y.
> >>>>> Noone expects this from two independent paths of any kind.
> >>>>>
> >>>>>> best,
> >>>>>> Colin
> >>>>>>
> >>>>>>> Hope this made my view clearer, especially the first part.
> >>>>>>>
> >>>>>>> Best Jan
> >>>>>>>
> >>>>>>>
> >>>>>>>> best,
> >>>>>>>> Colin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> If there is an error such as NotLeaderForPartition is
> >>>>>>>>> returned for some partitions, the follower can always send a full
> >>>>>>>>> FetchRequest. Is there a scenario that only some of the partitions 
> >>>>>>>>> in a
> >>>>>>>>> FetchResponse is lost?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jiangjie (Becket) Qin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Sat, Dec 2, 2017 at 2:37 PM, Colin McCabe<[email protected]>  
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> On Fri, Dec 1, 2017, at 11:46, Dong Lin wrote:
> >>>>>>>>>>> On Thu, Nov 30, 2017 at 9:37 AM, Colin McCabe<[email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>> On Wed, Nov 29, 2017, at 18:59, Dong Lin wrote:
> >>>>>>>>>>>>> Hey Colin,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks much for the update. I have a few questions below:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. I am not very sure that we need Fetch Session Epoch. It 
> >>>>>>>>>>>>> seems that
> >>>>>>>>>>>>> Fetch
> >>>>>>>>>>>>> Session Epoch is only needed to help leader distinguish between 
> >>>>>>>>>>>>> "a
> >>>>>>>>>> full
> >>>>>>>>>>>>> fetch request" and "a full fetch request and request a new
> >>>>>>>>>> incremental
> >>>>>>>>>>>>> fetch session". Alternatively, follower can also indicate "a 
> >>>>>>>>>>>>> full
> >>>>>>>>>> fetch
> >>>>>>>>>>>>> request and request a new incremental fetch session" by setting 
> >>>>>>>>>>>>> Fetch
> >>>>>>>>>>>>> Session ID to -1 without using Fetch Session Epoch. Does this 
> >>>>>>>>>>>>> make
> >>>>>>>>>> sense?
> >>>>>>>>>>>> Hi Dong,
> >>>>>>>>>>>>
> >>>>>>>>>>>> The fetch session epoch is very important for ensuring 
> >>>>>>>>>>>> correctness.  It
> >>>>>>>>>>>> prevents corrupted or incomplete fetch data due to network 
> >>>>>>>>>>>> reordering
> >>>>>>>>>> or
> >>>>>>>>>>>> loss.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For example, consider a scenario where the follower sends a fetch
> >>>>>>>>>>>> request to the leader.  The leader responds, but the response is 
> >>>>>>>>>>>> lost
> >>>>>>>>>>>> because of network problems which affected the TCP session.  In 
> >>>>>>>>>>>> that
> >>>>>>>>>>>> case, the follower must establish a new TCP session and re-send 
> >>>>>>>>>>>> the
> >>>>>>>>>>>> incremental fetch request.  But the leader does not know that the
> >>>>>>>>>>>> follower didn't receive the previous incremental fetch response. 
> >>>>>>>>>>>>  It is
> >>>>>>>>>>>> only the incremental fetch epoch which lets the leader know that 
> >>>>>>>>>>>> it
> >>>>>>>>>>>> needs to resend that data, and not data which comes afterwards.
> >>>>>>>>>>>>
> >>>>>>>>>>>> You could construct similar scenarios with message reordering,
> >>>>>>>>>>>> duplication, etc.  Basically, this is a stateful protocol on an
> >>>>>>>>>>>> unreliable network, and you need to know whether the follower 
> >>>>>>>>>>>> got the
> >>>>>>>>>>>> previous data you sent before you move on.  And you need to 
> >>>>>>>>>>>> handle
> >>>>>>>>>>>> issues like duplicated or delayed requests.  These issues do not 
> >>>>>>>>>>>> affect
> >>>>>>>>>>>> the full fetch request, because it is not stateful-- any full 
> >>>>>>>>>>>> fetch
> >>>>>>>>>>>> request can be understood and properly responded to in isolation.
> >>>>>>>>>>>>
> >>>>>>>>>>> Thanks for the explanation. This makes sense. On the other hand I 
> >>>>>>>>>>> would
> >>>>>>>>>>> be interested in learning more about whether Becket's solution 
> >>>>>>>>>>> can help
> >>>>>>>>>>> simplify the protocol by not having the echo field and whether 
> >>>>>>>>>>> that is
> >>>>>>>>>>> worth doing.
> >>>>>>>>>> Hi Dong,
> >>>>>>>>>>
> >>>>>>>>>> I commented about this in the other thread.  A solution which 
> >>>>>>>>>> doesn't
> >>>>>>>>>> maintain session information doesn't work here.
> >>>>>>>>>>
> >>>>>>>>>>>>> 2. It is said that Incremental FetchRequest will include 
> >>>>>>>>>>>>> partitions
> >>>>>>>>>> whose
> >>>>>>>>>>>>> fetch offset or maximum number of fetch bytes has been changed. 
> >>>>>>>>>>>>> If
> >>>>>>>>>>>>> follower's logStartOffet of a partition has changed, should this
> >>>>>>>>>>>>> partition also be included in the next FetchRequest to the 
> >>>>>>>>>>>>> leader?
> >>>>>>>>>>>> Otherwise, it
> >>>>>>>>>>>>> may affect the handling of DeleteRecordsRequest because leader 
> >>>>>>>>>>>>> may
> >>>>>>>>>> not
> >>>>>>>>>>>> know
> >>>>>>>>>>>>> the corresponding data has been deleted on the follower.
> >>>>>>>>>>>> Yeah, the follower should include the partition if the 
> >>>>>>>>>>>> logStartOffset
> >>>>>>>>>>>> has changed.  That should be spelled out on the KIP.  Fixed.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> 3. In the section "Per-Partition Data", a partition is not 
> >>>>>>>>>>>>> considered
> >>>>>>>>>>>>> dirty if its log start offset has changed. Later in the section
> >>>>>>>>>>>> "FetchRequest
> >>>>>>>>>>>>> Changes", it is said that incremental fetch responses will 
> >>>>>>>>>>>>> include a
> >>>>>>>>>>>>> partition if its logStartOffset has changed. It seems 
> >>>>>>>>>>>>> inconsistent.
> >>>>>>>>>> Can
> >>>>>>>>>>>>> you update the KIP to clarify it?
> >>>>>>>>>>>>>
> >>>>>>>>>>>> In the "Per-Partition Data" section, it does say that 
> >>>>>>>>>>>> logStartOffset
> >>>>>>>>>>>> changes make a partition dirty, though, right?  The first bullet 
> >>>>>>>>>>>> point
> >>>>>>>>>>>> is:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> * The LogCleaner deletes messages, and this changes the log 
> >>>>>>>>>>>>> start
> >>>>>>>>>> offset
> >>>>>>>>>>>> of the partition on the leader., or
> >>>>>>>>>>>>
> >>>>>>>>>>> Ah I see. I think I didn't notice this because statement assumes 
> >>>>>>>>>>> that the
> >>>>>>>>>>> LogStartOffset in the leader only changes due to LogCleaner. In 
> >>>>>>>>>>> fact the
> >>>>>>>>>>> LogStartOffset can change on the leader due to either log 
> >>>>>>>>>>> retention and
> >>>>>>>>>>> DeleteRecordsRequest. I haven't verified whether LogCleaner can 
> >>>>>>>>>>> change
> >>>>>>>>>>> LogStartOffset though. It may be a bit better to just say that a
> >>>>>>>>>>> partition is considered dirty if LogStartOffset changes.
> >>>>>>>>>> I agree.  It should be straightforward to just resend the 
> >>>>>>>>>> partition if
> >>>>>>>>>> logStartOffset changes.
> >>>>>>>>>>
> >>>>>>>>>>>>> 4. In "Fetch Session Caching" section, it is said that each 
> >>>>>>>>>>>>> broker
> >>>>>>>>>> has a
> >>>>>>>>>>>>> limited number of slots. How is this number determined? Does 
> >>>>>>>>>>>>> this
> >>>>>>>>>> require
> >>>>>>>>>>>>> a new broker config for this number?
> >>>>>>>>>>>> Good point.  I added two broker configuration parameters to 
> >>>>>>>>>>>> control
> >>>>>>>>>> this
> >>>>>>>>>>>> number.
> >>>>>>>>>>>>
> >>>>>>>>>>> I am curious to see whether we can avoid some of these new 
> >>>>>>>>>>> configs. For
> >>>>>>>>>>> example, incremental.fetch.session.cache.slots.per.broker is 
> >>>>>>>>>>> probably
> >>>>>>>>>> not
> >>>>>>>>>>> necessary because if a leader knows that a FetchRequest comes 
> >>>>>>>>>>> from a
> >>>>>>>>>>> follower, we probably want the leader to always cache the 
> >>>>>>>>>>> information
> >>>>>>>>>>> from that follower. Does this make sense?
> >>>>>>>>>> Yeah, maybe we can avoid having
> >>>>>>>>>> incremental.fetch.session.cache.slots.per.broker.
> >>>>>>>>>>
> >>>>>>>>>>> Maybe we can discuss the config later after there is agreement on 
> >>>>>>>>>>> how the
> >>>>>>>>>>> protocol would look like.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> What is the error code if broker does
> >>>>>>>>>>>>> not have new log for the incoming FetchRequest?
> >>>>>>>>>>>> Hmm, is there a typo in this question?  Maybe you meant to ask 
> >>>>>>>>>>>> what
> >>>>>>>>>>>> happens if there is no new cache slot for the incoming 
> >>>>>>>>>>>> FetchRequest?
> >>>>>>>>>>>> That's not an error-- the incremental fetch session ID just gets 
> >>>>>>>>>>>> set to
> >>>>>>>>>>>> 0, indicating no incremental fetch session was created.
> >>>>>>>>>>>>
> >>>>>>>>>>> Yeah there is a typo. You have answered my question.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> 5. Can you clarify what happens if follower adds a partition to 
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> ReplicaFetcherThread after receiving LeaderAndIsrRequest? Does 
> >>>>>>>>>>>>> leader
> >>>>>>>>>>>>> needs to generate a new session for this ReplicaFetcherThread or
> >>>>>>>>>> does it
> >>>>>>>>>>>> re-use
> >>>>>>>>>>>>> the existing session?  If it uses a new session, is the old 
> >>>>>>>>>>>>> session
> >>>>>>>>>>>>> actively deleted from the slot?
> >>>>>>>>>>>> The basic idea is that you can't make changes, except by sending 
> >>>>>>>>>>>> a full
> >>>>>>>>>>>> fetch request.  However, perhaps we can allow the client to 
> >>>>>>>>>>>> re-use its
> >>>>>>>>>>>> existing session ID.  If the client sets sessionId = id, epoch = 
> >>>>>>>>>>>> 0, it
> >>>>>>>>>>>> could re-initialize the session.
> >>>>>>>>>>>>
> >>>>>>>>>>> Yeah I agree with the basic idea. We probably want to understand 
> >>>>>>>>>>> more
> >>>>>>>>>>> detail about how this works later.
> >>>>>>>>>> Sounds good.  I updated the KIP with this information.  A
> >>>>>>>>>> re-initialization should be exactly the same as an initialization,
> >>>>>>>>>> except that it reuses an existing ID.
> >>>>>>>>>>
> >>>>>>>>>> best,
> >>>>>>>>>> Colin
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>> BTW, I think it may be useful if the KIP can include the example
> >>>>>>>>>> workflow
> >>>>>>>>>>>>> of how this feature will be used in case of partition change 
> >>>>>>>>>>>>> and so
> >>>>>>>>>> on.
> >>>>>>>>>>>> Yeah, that might help.
> >>>>>>>>>>>>
> >>>>>>>>>>>> best,
> >>>>>>>>>>>> Colin
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Dong
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Nov 29, 2017 at 12:13 PM, Colin 
> >>>>>>>>>>>>> McCabe<[email protected]>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I updated the KIP with the ideas we've been discussing.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> best,
> >>>>>>>>>>>>>> Colin
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Nov 28, 2017, at 08:38, Colin McCabe wrote:
> >>>>>>>>>>>>>>> On Mon, Nov 27, 2017, at 22:30, Jan Filipiak wrote:
> >>>>>>>>>>>>>>>> Hi Colin, thank you  for this KIP, it can become a really
> >>>>>>>>>> useful
> >>>>>>>>>>>> thing.
> >>>>>>>>>>>>>>>> I just scanned through the discussion so far and wanted to
> >>>>>>>>>> start a
> >>>>>>>>>>>>>>>> thread to make as decision about keeping the
> >>>>>>>>>>>>>>>> cache with the Connection / Session or having some sort of 
> >>>>>>>>>>>>>>>> UUID
> >>>>>>>>>>>> indN
> >>>>>>>>>>>>>> exed
> >>>>>>>>>>>>>>>> global Map.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Sorry if that has been settled already and I missed it. In 
> >>>>>>>>>>>>>>>> this
> >>>>>>>>>>>> case
> >>>>>>>>>>>>>>>> could anyone point me to the discussion?
> >>>>>>>>>>>>>>> Hi Jan,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't think anyone has discussed the idea of tying the cache
> >>>>>>>>>> to an
> >>>>>>>>>>>>>>> individual TCP session yet.  I agree that since the cache is
> >>>>>>>>>>>> intended to
> >>>>>>>>>>>>>>> be used only by a single follower or client, it's an 
> >>>>>>>>>>>>>>> interesting
> >>>>>>>>>>>> thing
> >>>>>>>>>>>>>>> to think about.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I guess the obvious disadvantage is that whenever your TCP
> >>>>>>>>>> session
> >>>>>>>>>>>>>>> drops, you have to make a full fetch request rather than an
> >>>>>>>>>>>> incremental
> >>>>>>>>>>>>>>> one.  It's not clear to me how often this happens in practice 
> >>>>>>>>>>>>>>> --
> >>>>>>>>>> it
> >>>>>>>>>>>>>>> probably depends a lot on the quality of the network.  From a
> >>>>>>>>>> code
> >>>>>>>>>>>>>>> perspective, it might also be a bit difficult to access data
> >>>>>>>>>>>> associated
> >>>>>>>>>>>>>>> with the Session from classes like KafkaApis (although we 
> >>>>>>>>>>>>>>> could
> >>>>>>>>>>>> refactor
> >>>>>>>>>>>>>>> it to make this easier).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It's also clear that even if we tie the cache to the session, 
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>> still
> >>>>>>>>>>>>>>> have to have limits on the number of caches we're willing to
> >>>>>>>>>> create.
> >>>>>>>>>>>>>>> And probably we should reserve some cache slots for each
> >>>>>>>>>> follower, so
> >>>>>>>>>>>>>>> that clients don't take all of them.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Id rather see a protocol in which the client is hinting the
> >>>>>>>>>> broker
> >>>>>>>>>>>>>> that,
> >>>>>>>>>>>>>>>> he is going to use the feature instead of a client
> >>>>>>>>>>>>>>>> realizing that the broker just offered the feature 
> >>>>>>>>>>>>>>>> (regardless
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>> protocol version which should only indicate that the feature
> >>>>>>>>>>>>>>>> would be usable).
> >>>>>>>>>>>>>>> Hmm.  I'm not sure what you mean by "hinting."  I do think 
> >>>>>>>>>>>>>>> that
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> server should have the option of not accepting incremental
> >>>>>>>>>> requests
> >>>>>>>>>>>> from
> >>>>>>>>>>>>>>> specific clients, in order to save memory space.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This seems to work better with a per
> >>>>>>>>>>>>>>>> connection/session attached Metadata than with a Map and 
> >>>>>>>>>>>>>>>> could
> >>>>>>>>>>>> allow
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> easier client implementations.
> >>>>>>>>>>>>>>>> It would also make Client-side code easier as there wouldn't
> >>>>>>>>>> be any
> >>>>>>>>>>>>>>>> Cache-miss error Messages to handle.
> >>>>>>>>>>>>>>> It is nice not to have to handle cache-miss responses, I 
> >>>>>>>>>>>>>>> agree.
> >>>>>>>>>>>>>>> However, TCP sessions aren't exposed to most of our 
> >>>>>>>>>>>>>>> client-side
> >>>>>>>>>> code.
> >>>>>>>>>>>>>>> For example, when the Producer creates a message and hands it
> >>>>>>>>>> off to
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> NetworkClient, the NC will transparently re-connect and 
> >>>>>>>>>>>>>>> re-send a
> >>>>>>>>>>>>>>> message if the first send failed.  The higher-level code will
> >>>>>>>>>> not be
> >>>>>>>>>>>>>>> informed about whether the TCP session was re-established,
> >>>>>>>>>> whether an
> >>>>>>>>>>>>>>> existing TCP session was used, and so on.  So overall I would
> >>>>>>>>>> still
> >>>>>>>>>>>> lean
> >>>>>>>>>>>>>>> towards not coupling this to the TCP session...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> best,
> >>>>>>>>>>>>>>> Colin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>       Thank you again for the KIP. And again, if this was 
> >>>>>>>>>>>>>>>> clarified
> >>>>>>>>>>>> already
> >>>>>>>>>>>>>>>> please drop me a hint where I could read about it.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best Jan
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 21.11.2017 22:02, Colin McCabe wrote:
> >>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I created a KIP to improve the scalability and latency of
> >>>>>>>>>>>>>> FetchRequest:
> >>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>> 227%3A+Introduce+Incremental+FetchRequests+to+Increase+
> >>>>>>>>>>>>>> Partition+Scalability
> >>>>>>>>>>>>>>>>> Please take a look.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> cheers,
> >>>>>>>>>>>>>>>>> Colin
>

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

Reply via email to