[ https://issues.apache.org/jira/browse/KAFKA-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484816#comment-14484816 ]
Ewen Cheslack-Postava commented on KAFKA-2102: ---------------------------------------------- Just my 2 cents, but I'm wary of how complicated this makes reasoning about this class. One immediate example is the dual use of lastRefreshMs, which now acts as both the needUpdate flag and the last refresh time. At a minimum, unusual usage like that really needs to be documented clearly. But I think you already have an error in your code -- timeToNextUpdate() uses the value directly, but really needs to take the absolute value or use lastUpdate() to get the correct value. It's not unsurprising that these issues aren't caught by the current unit tests since previously nobody would have thought that whether a metadata update was needed would affect the return value of timeToNextUpdate() in the way it would with this patch. I also found another similar issue, but is a bit trickier. requestUpdate now updates the lastRefreshMs to be negative to indicate we need an update, then returns the version. This can be called by, e.g., KafkaConsumer#subscribe, which would be running on a different thread than the network thread that would call update(), which replaces metadata and bumps the version. Since update sets lastRefreshMs, you can get an incorrect interleaving where requestUpdate toggles lastRefreshMs to be negative, then update() sets lastRefreshMs ignoring the previous value, requestUpdate() returns the version N, and update() sets the version to N+1. The request for an update has been lost and a subsequent call to awaitUpdate never exits since the update request was effectively erased. I wouldn't necessarily be against making these changes given compelling reasons like significant performance improvements, but the complexity is concerning. Do you have info on how big the improvement is? And is it possible to get the improvement by only applying some of the patch? If this is alleviating lock contention, it would be nice to know where the contention was showing up in the first place (what was locked, what was trying to acquire the lock) to better understand if there's a subset of this patch that is simpler to reason about but would still provide most of the benefit. > Remove unnecessary synchronization when managing metadata > --------------------------------------------------------- > > Key: KAFKA-2102 > URL: https://issues.apache.org/jira/browse/KAFKA-2102 > Project: Kafka > Issue Type: Improvement > Reporter: Tim Brooks > Assignee: Tim Brooks > Attachments: KAFKA-2102.patch, KAFKA-2102_2015-04-08_00:20:33.patch > > > Usage of the org.apache.kafka.clients.Metadata class is synchronized. It > seems like the current functionality could be maintained without > synchronizing the whole class. > I have been working on improving this by moving to finer grained locks and > using atomic operations. My initial benchmarking of the producer is that this > will improve latency (using HDRHistogram) on submitting messages. > I have produced an initial patch. I do not necessarily believe this is > complete. And I want to definitely produce some more benchmarks. However, I > wanted to get early feedback because this change could be deceptively tricky. > I am interested in knowing if this is: > 1. Something that is of interest to the maintainers/community. > 2. Along the right track > 3. If there are any gotchas that make my current approach naive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)