[ 
https://issues.apache.org/jira/browse/KAFKA-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Bradstreet updated KAFKA-9137:
------------------------------------
    Description: 
We have recently seen cases where brokers end up in a bad state where fetch 
session evictions occur at a high rate (> 16 per second) after a roll. This 
increase in eviction rate included the following pattern in our logs:

 
{noformat}
broker 6: October 31st 2019, 17:52:45.496 Created a new incremental 
FetchContext for session id 2046264334, epoch 9790: added (), updated (), 
removed ()

broker 6: October 31st 2019, 17:52:45.496 Created a new incremental 
FetchContext for session id 2046264334, epoch 9791: added (), updated (), 
removed () broker 6: October 31st 2019, 17:52:45.500 Created a new incremental 
FetchContext for session id 2046264334, epoch 9792: added (), updated 
(lkc-7nv6o_tenant_soak_topic_144p-67), removed () 

broker 6: October 31st 2019, 17:52:45.501 Created a new incremental 
FetchContext for session id 2046264334, epoch 9793: added (), updated 
(lkc-7nv6o_tenant_soak_topic_144p-59, lkc-7nv6o_tenant_soak_topic_144p-123, 
lkc-7nv6o_tenant_soak_topic_144p-11, lkc-7nv6o_tenant_soak_topic_144p-3, 
lkc-7nv6o_tenant_soak_topic_144p-67, lkc-7nv6o_tenant_soak_topic_144p-115), 
removed () 

broker 6: October 31st 2019, 17:52:45.501 Evicting stale FetchSession 
2046264334. 

broker 6: October 31st 2019, 17:52:45.502 Session error for 2046264334: no such 
session ID found. 

broker 4: October 31st 2019, 17:52:45.813 [ReplicaFetcher replicaId=4, 
leaderId=6, fetcherId=0] Node 6 was unable to process the fetch request with 
(sessionId=2046264334, epoch=9793): FETCH_SESSION_ID_NOT_FOUND.  
{noformat}
This pattern appears to be problematic for two reasons. Firstly, the replica 
fetcher for broker 4 was clearly able to send multiple incremental fetch 
requests to broker 6, and receive replies, and did so right up to the point 
where broker 6 evicted its fetch session within milliseconds of multiple fetch 
requests. The second problem is that replica fetchers are considered privileged 
for the fetch session cache, and should not be evicted by consumer fetch 
sessions. This cluster only has 12 brokers and 1000 fetch session cache slots 
(the default for max.incremental.fetch.session.cache.slots), and it thus very 
unlikely that this session should have been evicted by another replica fetcher 
session.

This cluster also appears to be causing cycles of fetch session evictions where 
the cluster never stabilizes into a state where fetch sessions are not evicted. 
The above logs are the best example I could find of a case where a session 
clearly should not have been evicted.

  was:
We have recently seen cases where brokers end up in a bad state where fetch 
session evictions occur at a high rate (> 16 per second) after a roll. This 
increase in eviction rate included the following pattern in our logs:

 
{noformat}
broker 6: October 31st 2019, 17:52:45.496 Created a new incremental 
FetchContext for session id 2046264334, epoch 9790: added (), updated (), 
removed ()

broker 6: October 31st 2019, 17:52:45.496 Created a new incremental 
FetchContext for session id 2046264334, epoch 9791: added (), updated (), 
removed () broker 6: October 31st 2019, 17:52:45.500 Created a new incremental 
FetchContext for session id 2046264334, epoch 9792: added (), updated 
(lkc-7nv6o_tenant_soak_topic_144p-67), removed () 

broker 6: October 31st 2019, 17:52:45.501 Created a new incremental 
FetchContext for session id 2046264334, epoch 9793: added (), updated 
(lkc-7nv6o_tenant_soak_topic_144p-59, lkc-7nv6o_tenant_soak_topic_144p-123, 
lkc-7nv6o_tenant_soak_topic_144p-11, lkc-7nv6o_tenant_soak_topic_144p-3, 
lkc-7nv6o_tenant_soak_topic_144p-67, lkc-7nv6o_tenant_soak_topic_144p-115), 
removed () 

broker 6: October 31st 2019, 17:52:45.501 Evicting stale FetchSession 
2046264334. 

broker 6: October 31st 2019, 17:52:45.502 Session error for 2046264334: no such 
session ID found. 

broker 4: October 31st 2019, 17:52:45.813 [ReplicaFetcher replicaId=4, 
leaderId=6, fetcherId=0] Node 6 was unable to process the fetch request with 
(sessionId=2046264334, epoch=9793): FETCH_SESSION_ID_NOT_FOUND.  
{noformat}
This pattern appears to be problematic for two reasons. Firstly, the replica 
fetcher for broker 4 was clearly able to send multiple incremental fetch 
requests to broker 6, and receive replies, and did so right up to the point 
where broker 6 evicted its fetch session within milliseconds of multiple fetch 
requests. The second problem is that replica fetchers are considered privileged 
for the fetch session cache, and should not be evicted by consumer fetch 
sessions. This cluster only has 12 brokers and 1000 fetch session cache slots 
(the default for max.incremental.fetch.session.cache.slots), and it thus very 
unlikely that this session should have been evicted by another replica fetcher 
session.


> Maintenance of FetchSession cache causing FETCH_SESSION_ID_NOT_FOUND in live 
> sessions
> -------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9137
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9137
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Lucas Bradstreet
>            Priority: Major
>
> We have recently seen cases where brokers end up in a bad state where fetch 
> session evictions occur at a high rate (> 16 per second) after a roll. This 
> increase in eviction rate included the following pattern in our logs:
>  
> {noformat}
> broker 6: October 31st 2019, 17:52:45.496 Created a new incremental 
> FetchContext for session id 2046264334, epoch 9790: added (), updated (), 
> removed ()
> broker 6: October 31st 2019, 17:52:45.496 Created a new incremental 
> FetchContext for session id 2046264334, epoch 9791: added (), updated (), 
> removed () broker 6: October 31st 2019, 17:52:45.500 Created a new 
> incremental FetchContext for session id 2046264334, epoch 9792: added (), 
> updated (lkc-7nv6o_tenant_soak_topic_144p-67), removed () 
> broker 6: October 31st 2019, 17:52:45.501 Created a new incremental 
> FetchContext for session id 2046264334, epoch 9793: added (), updated 
> (lkc-7nv6o_tenant_soak_topic_144p-59, lkc-7nv6o_tenant_soak_topic_144p-123, 
> lkc-7nv6o_tenant_soak_topic_144p-11, lkc-7nv6o_tenant_soak_topic_144p-3, 
> lkc-7nv6o_tenant_soak_topic_144p-67, lkc-7nv6o_tenant_soak_topic_144p-115), 
> removed () 
> broker 6: October 31st 2019, 17:52:45.501 Evicting stale FetchSession 
> 2046264334. 
> broker 6: October 31st 2019, 17:52:45.502 Session error for 2046264334: no 
> such session ID found. 
> broker 4: October 31st 2019, 17:52:45.813 [ReplicaFetcher replicaId=4, 
> leaderId=6, fetcherId=0] Node 6 was unable to process the fetch request with 
> (sessionId=2046264334, epoch=9793): FETCH_SESSION_ID_NOT_FOUND.  
> {noformat}
> This pattern appears to be problematic for two reasons. Firstly, the replica 
> fetcher for broker 4 was clearly able to send multiple incremental fetch 
> requests to broker 6, and receive replies, and did so right up to the point 
> where broker 6 evicted its fetch session within milliseconds of multiple 
> fetch requests. The second problem is that replica fetchers are considered 
> privileged for the fetch session cache, and should not be evicted by consumer 
> fetch sessions. This cluster only has 12 brokers and 1000 fetch session cache 
> slots (the default for max.incremental.fetch.session.cache.slots), and it 
> thus very unlikely that this session should have been evicted by another 
> replica fetcher session.
> This cluster also appears to be causing cycles of fetch session evictions 
> where the cluster never stabilizes into a state where fetch sessions are not 
> evicted. The above logs are the best example I could find of a case where a 
> session clearly should not have been evicted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to