I fixed some ambiguous expression in the output of unit test, and also open a bug in JIRA <https://issues.apache.org/jira/browse/KAFKA-13772> with a pull request <https://issues.apache.org/jira/browse/KAFKA-13772>
Thank you for your help and suggestions! Feiyan Yu <dbtr...@gmail.com> 于2022年3月26日周六 16:18写道: > Thank you, I will make it an issue. > > Luke Chen <show...@gmail.com> 于2022年3月26日周六 16:03写道: > >> Hi Feiyan, >> >> It did look like a bug. Could you open a bug in JIRA here >> < >> https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-13735?filter=allopenissues >> > >> ? >> One thing I don't fully understand from the unit test is that, you have >> output "error distributions: $tpsWithoutResize". >> I thought the bug is about the unevenly partition distribution after >> resizing fetch thread, what do you mean by "error distribution"? >> Could you please also elaborate in JIRA ticket? >> >> Also, welcome to submit a PR to it, since you've already had fix and test >> ready! :) >> >> Thank you. >> Luke >> >> >> >> On Sat, Mar 26, 2022 at 2:56 PM Feiyan Yu <dbtr...@gmail.com> wrote: >> >> > Howdy! >> > >> > I found that the method of resizing fetching thread had one potential >> bug >> > which lead to imbalanced partitions distribution for each thread after >> > changing thread number on dynamic configuration. >> > >> > To figure it out, I added a primitive unit test method "testResize" in >> > "AbstractFetcherThreadTest" to simulate the replica fetchers resizing >> > process from 10 threads to 60 threads with originally 10 topics and 100 >> > partitions each. I designed the test to show that after the resizing >> > process, all partitions should be redistributed correctly based on the >> new >> > thread number. However, the test failed because when I tried to compare >> the >> > "fetcherThreadMap" with the fetcherId for each topic-partition, the >> > fetcherId mismatched! The unit test I added is in this commit ( >> > >> https://github.com/yufeiyan1220/kafka/commit/eb99b7499b416cdeb44c2ccd3ea55a1e38ea3d60 >> ), >> > and the standard output of the unit test showed in attachment. >> > >> > I doubt that maybe it is because the method "addFetcherForPartitions" >> > which maybe adds some new fetchers to "fetcherThreadMap" called in the >> > block of iterating the "fetcherThreadMap", and the iterator ignore some >> > fetchers, which leads to some of the fetchers remain their topic >> > partitions. And it leads to the imbalanced partition distribution >> pattern >> > in resizing. >> > >> > To solve this issue I make a mutable map to store all partitions and its >> > fetch offset, and then add it back once out of the iteration. I make >> > another commit ( >> > >> https://github.com/yufeiyan1220/kafka/commit/0a793dfca2ab9b8e8b45ba5693359960f3e306eb >> ), >> > and the new resize method passed the unit test. >> > >> > I'm not sure whether it is an issue that Kafka Community need to fix. >> But >> > for me, it affects the fetching efficiency when I try to deploy Kafka >> cross >> > regions with high network latency. >> > >> > I'd really appreciate to hear from you! >> > >> > >> >