[ 
https://issues.apache.org/jira/browse/KAFKA-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484816#comment-14484816
 ] 

Ewen Cheslack-Postava commented on KAFKA-2102:
----------------------------------------------

Just my 2 cents, but I'm wary of how complicated this makes reasoning about 
this class. One immediate example is the dual use of lastRefreshMs, which now 
acts as both the needUpdate flag and the last refresh time. At a minimum, 
unusual usage like that really needs to be documented clearly. But I think you 
already have an error in your code -- timeToNextUpdate() uses the value 
directly, but really needs to take the absolute value or use lastUpdate() to 
get the correct value. It's not unsurprising that these issues aren't caught by 
the current unit tests since previously nobody would have thought that whether 
a metadata update was needed would affect the return value of 
timeToNextUpdate() in the way it would with this patch.

I also found another similar issue, but is a bit trickier. requestUpdate now 
updates the lastRefreshMs to be negative to indicate we need an update, then 
returns the version. This can be called by, e.g., KafkaConsumer#subscribe, 
which would be running on a different thread than the network thread that would 
call update(), which replaces metadata and bumps the version. Since update sets 
lastRefreshMs, you can get an incorrect interleaving where requestUpdate 
toggles lastRefreshMs to be negative, then update() sets lastRefreshMs ignoring 
the previous value, requestUpdate() returns the version N, and update() sets 
the version to N+1. The request for an update has been lost and a subsequent 
call to awaitUpdate never exits since the update request was effectively erased.

I wouldn't necessarily be against making these changes given compelling reasons 
like significant performance improvements, but the complexity is concerning. Do 
you have info on how big the improvement is? And is it possible to get the 
improvement by only applying some of the patch? If this is alleviating lock 
contention, it would be nice to know where the contention was showing up in the 
first place (what was locked, what was trying to acquire the lock) to better 
understand if there's a subset of this patch that is simpler to reason about 
but would still provide most of the benefit.

> Remove unnecessary synchronization when managing metadata
> ---------------------------------------------------------
>
>                 Key: KAFKA-2102
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2102
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Tim Brooks
>            Assignee: Tim Brooks
>         Attachments: KAFKA-2102.patch, KAFKA-2102_2015-04-08_00:20:33.patch
>
>
> Usage of the org.apache.kafka.clients.Metadata class is synchronized. It 
> seems like the current functionality could be maintained without 
> synchronizing the whole class.
> I have been working on improving this by moving to finer grained locks and 
> using atomic operations. My initial benchmarking of the producer is that this 
> will improve latency (using HDRHistogram) on submitting messages.
> I have produced an initial patch. I do not necessarily believe this is 
> complete. And I want to definitely produce some more benchmarks. However, I 
> wanted to get early feedback because this change could be deceptively tricky.
> I am interested in knowing if this is:
> 1. Something that is of interest to the maintainers/community.
> 2. Along the right track
> 3. If there are any gotchas that make my current approach naive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to