[jira] [Comment Edited] (KAFKA-1227) Code dump of new producer

Jay Kreps (JIRA) Thu, 23 Jan 2014 20:54:57 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880717#comment-13880717
 ]


Jay Kreps edited comment on KAFKA-1227 at 1/24/14 4:54 AM:
-----------------------------------------------------------

Hey Jun, quick responses. I'll try to get a patch up with some of the minor 
things, though a few of the others I'd like to do post-commit.

1. WRT closing connections. Yes, this is true. I agree it is needed but decided 
to punt on it for the first pass. It is an important follow-up item. There are 
two cases to handle: metadata fetches and leadership handoffs. 
Obviously the Selector will not handle these special cases which are specific 
to this use case. Theoretically this could all be done in the Sender logic but 
it would be a bit complex. I think the best solution is just to have us time 
out idle connections after some configurable period of disuse (30 seconds, 
say). 

2. I think the VIP problem can be handled by just timing out idle connections. 
Special cases related to metadata won't help because non-metadata related 
connections can also be idle. Not retaining the bootstrap urls is intentional: 
future metadata requests should use the full broker set the producer is 
connecting to. You mention that this will cause the producer to prefer to fetch 
metadata from a broker to which it already has a connection for subsequent 
metadata fetches after the initial bootstrap, but this was the idea--no need to 
setup and then timeout another connection if we already have one.

3. WRT initializing the partitioner to 0: Yeah we can initialize to something 
random. This problem would be a bit pathological as you would have to start all 
your producers the same instant and send exactly the same number of messages 
through them for this to persist.

4. I included the numPartitions even though it is easily computable from 
cluster as all partitioners will need to mod by number of partitions, but the 
vast majority won't need to use the cluster. So it just seemed more intuitive 
rather than the user having to figure out that they can get it by calling into 
cluster and worrying about the underlying performance of that just to give it 
to them.

5.1 Yes, that is a bug.
5.2 It is a bit slangy :-)

6. I don't prevent this: a zero hash code will be recomputed each time, but 
this is an unlikely case and recomputing is what would happen in all cases if 
we didn't cache.

7. Good point, I'll improve the error message.

8.1 I'll try to think of a better name.
8.2 Yes, we can do that. I think that would be good for latency in the case 
where we had to allocate a non-standard size

9. I think you could argue either way in terms of the preferrable sequencing. 
However I wanted to reuse the RecordSend object as the argument to the callback 
rather than introduce another object. However this means I do need to complete 
the record send first otherwise the callback will block trying to access fields 
in the send.

10. Ah, very good point.

11. Thanks

12. I am not two picky. 2 spaces is the recommended style in scala and 4 spaces 
is the classic "sun java style". I would like to get our style formally 
specified in an IDE formatter. That is what I am using for eclipse and it is 
very nice, it does all formatting for you and ensures a very consistent style. 
I will start a thread on this one as likely everyone has an opinion.

13. I had the same thought. I'm not sure if it is better to give the current 
value or the average over the window. Thoughts? Since we mostly look at graphs 
polled every 30 seconds if we do the instantaneous measurement it amounts to 
just a single data point for the whole 30 seconds but that may be okay...

14. I figured we would need to discuss logging so I just punted for now. The 
standard insofar as there is one is really slf4j, which I consider kind of 
silly. There are really just a couple of places that need logging so maybe it 
would be fine to just use java.util.logging which comes with the jvm. I'll 
start a thread on this.

15. In a client I think this is something we should leave to the client. 
Printing lots of messages in their logs is a bit rude. I think it is better to 
give an API to get information about the configs.



was (Author: jkreps):
Hey Jun, quick responses. I'll try to get a patch up with some of the minor 
things, though a few of the others I'd like to do post-commit.

1. WRT closing connections. Yes, this is true. I agree it is needed but decided 
to punt on it for the first pass. It is an important follow-up item. There are 
two cases to handle: metadata fetches and leadership handoffs. 
Obviously the Selector will not handle these special cases which are specific 
to this use case. Theoretically this could all be done in the Sender logic but 
it would be a bit complex. I think the best solution is just to have us time 
out idle connections after some configurable period of disuse (30 seconds, 
say). 

2. I think the VIP problem can be handled by just timing out idle connections. 
Special cases related to metadata won't help because non-metadata related 
connections can also be idle. Not retaining the bootstrap urls is 
intentional--future metadata requests should use the full broker set the 
producer is connecting to. You mention that this will cause the producer to 
prefer to fetch metadata from a broker to which it already has a connection for 
subsequent metadata fetches after the initial bootstrap, but this was the 
idea--no need to setup and then timeout another connection if we already have 
one.

3. WRT initializing the partitioner to 0: Yeah we can initialize to something 
random. This problem would be a bit pathological as you would have to start all 
your producers the same instant and send exactly the same number of messages 
through them for this to persist.

4. I included the numPartitions even though it is easily computable from 
cluster as all partitioners will need to mod by number of partitions, but the 
vast majority won't need to use the cluster. So it just seemed more intuitive 
rather than the user having to figure out that they can get it by calling into 
cluster and worrying about the underlying performance of that just to give it 
to them.

5.1 Yes, that is a bug.
5.2 It is a bit slangy :-)

6. I don't prevent this: a zero hash code will be recomputed each time, but 
this is an unlikely case and recomputing is what would happen in all cases if 
we didn't cache.

7. Good point, I'll improve the error message.

8.1 I'll try to think of a better name.
8.2 Yes, we can do that. I think that would be good for latency in the case 
where we had to allocate a non-standard size

9. I think you could argue either way in terms of the preferrable sequencing. 
However I wanted to reuse the RecordSend object as the argument to the callback 
rather than introduce another object. However this means I do need to complete 
the record send first otherwise the callback will block trying to access fields 
in the send.

10. Ah, very good point.

11. Thanks

12. I am not two picky. 2 spaces is the recommended style in scala and 4 spaces 
is the classic "sun java style". I would like to get our style formally 
specified in an IDE formatter. That is what I am using for eclipse and it is 
very nice, it does all formatting for you and ensures a very consistent style. 
I will start a thread on this one as likely everyone has an opinion.

13. I had the same thought. I'm not sure if it is better to give the current 
value or the average over the window. Thoughts? Since we mostly look at graphs 
polled every 30 seconds if we do the instantaneous measurement it amounts to 
just a single data point for the whole 30 seconds but that may be okay...

14. I figured we would need to discuss logging so I just punted for now. The 
standard insofar as there is one is really slf4j, which I consider kind of 
silly. There are really just a couple of places that need logging so maybe it 
would be fine to just use java.util.logging which comes with the jvm. I'll 
start a thread on this.

15. In a client I think this is something we should leave to the client. 
Printing lots of messages in their logs is a bit rude. I think it is better to 
give an API to get information about the configs.


> Code dump of new producer
> -------------------------
>
>                 Key: KAFKA-1227
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1227
>             Project: Kafka
>          Issue Type: New Feature
>            Reporter: Jay Kreps
>            Assignee: Jay Kreps
>         Attachments: KAFKA-1227.patch
>
>
> The plan is to take a dump of the producer code "as is" and then do a series 
> of post-commit reviews to get it into shape. This bug tracks just the code 
> dump.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (KAFKA-1227) Code dump of new producer

Reply via email to