[jira] [Updated] (KAFKA-6144) Allow serving interactive queries from in-sync Standbys

Navinder Brar (Jira) Mon, 21 Oct 2019 07:45:26 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Navinder Brar updated KAFKA-6144:
---------------------------------
    Description: 
Currently when expanding the KS cluster, the new node's partitions will be 
unavailable during the rebalance, which for large states can take a very long 
time, or for small state stores even more than a few ms can be a deal-breaker 
for micro service use cases.

One workaround is to allow stale data to be read from the state stores when use 
case allows. Adding the use case from KAFKA-8994 as it is more descriptive.

"Consider the following scenario in a three node Streams cluster with node A, 
node S and node R, executing a stateful sub-topology/topic group with 1 
partition and `_num.standby.replicas=1_`  
 * *t0*: A is the active instance owning the partition, B is the standby that 
keeps replicating the A's state into its local disk, R just routes streams IQs 
to active instance using StreamsMetadata
 * *t1*: IQs pick node R as router, R forwards query to A, A responds back to R 
which reverse forwards back the results.
 * *t2:* Active A instance is killed and rebalance begins. IQs start failing to 
A
 * *t3*: Rebalance assignment happens and standby B is now promoted as active 
instance. IQs continue to fail
 * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last 
commit position, IQs continue to fail
 * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs 
start succeeding again

 

Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can 
take few seconds (~10 seconds based on defaults values). Depending on how laggy 
the standby B was prior to A being killed, t4 can take few seconds-minutes. 

While this behavior favors consistency over availability at all times, the long 
unavailability window might be undesirable for certain classes of applications 
(e.g simple caches or dashboards). 

This issue aims to also expose information about standby B to R, during each 
rebalance such that the queries can be routed by an application to a standby to 
serve stale reads, choosing availability over consistency."

  was:
Currently when expanding the KS cluster, the new node's partitions will be 
unavailable during the rebalance, which for large states can take a very long 
time, or for small state stores even more than a few ms can be a deal breaker 
for micro service use cases.

One workaround is to allow stale data to be read from the state stores when use 
case allows.

Relates to KAFKA-6145 - Warm up new KS instances before migrating tasks - 
potentially a two phase rebalance


This is the description from KAFKA-6031 (keeping this JIRA as the title is more 
descriptive):

{quote}
Currently reads for a key are served by single replica, which has 2 drawbacks:
 - if replica is down there is a down time in serving reads for keys it was 
responsible for until a standby replica takes over
 - in case of semantic partitioning some replicas might become hot and there is 
no easy way to scale the read load

If standby replicas would have endpoints that are exposed in StreamsMetadata it 
would enable serving reads from several replicas, which would mitigate the 
above drawbacks. 
Due to the lag between replicas reading from multiple replicas simultaneously 
would have weaker (eventual) consistency comparing to reads from single 
replica. This however should be acceptable tradeoff in many cases.
{quote}


> Allow serving interactive queries from in-sync Standbys
> -------------------------------------------------------
>
>                 Key: KAFKA-6144
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6144
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Antony Stubbs
>            Assignee: Navinder Brar
>            Priority: Major
>              Labels: kip-535
>         Attachments: image-2019-10-09-20-33-37-423.png, 
> image-2019-10-09-20-47-38-096.png
>
>
> Currently when expanding the KS cluster, the new node's partitions will be 
> unavailable during the rebalance, which for large states can take a very long 
> time, or for small state stores even more than a few ms can be a deal-breaker 
> for micro service use cases.
> One workaround is to allow stale data to be read from the state stores when 
> use case allows. Adding the use case from KAFKA-8994 as it is more 
> descriptive.
> "Consider the following scenario in a three node Streams cluster with node A, 
> node S and node R, executing a stateful sub-topology/topic group with 1 
> partition and `_num.standby.replicas=1_`  
>  * *t0*: A is the active instance owning the partition, B is the standby that 
> keeps replicating the A's state into its local disk, R just routes streams 
> IQs to active instance using StreamsMetadata
>  * *t1*: IQs pick node R as router, R forwards query to A, A responds back to 
> R which reverse forwards back the results.
>  * *t2:* Active A instance is killed and rebalance begins. IQs start failing 
> to A
>  * *t3*: Rebalance assignment happens and standby B is now promoted as active 
> instance. IQs continue to fail
>  * *t4*: B fully catches up to changelog tail and rewinds offsets to A's last 
> commit position, IQs continue to fail
>  * *t5*: IQs to R, get routed to B, which is now ready to serve results. IQs 
> start succeeding again
>  
> Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can 
> take few seconds (~10 seconds based on defaults values). Depending on how 
> laggy the standby B was prior to A being killed, t4 can take few 
> seconds-minutes. 
> While this behavior favors consistency over availability at all times, the 
> long unavailability window might be undesirable for certain classes of 
> applications (e.g simple caches or dashboards). 
> This issue aims to also expose information about standby B to R, during each 
> rebalance such that the queries can be routed by an application to a standby 
> to serve stale reads, choosing availability over consistency."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-6144) Allow serving interactive queries from in-sync Standbys

Reply via email to