Re: [DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Matthias J. Sax Thu, 23 Apr 2020 14:05:10 -0700

Thanks Andi!

I buy your argument. For exposing topics names via `TopologyDescription`
I create a JIRA for tracking:
https://issues.apache.org/jira/browse/KAFKA-9913



-Matthias

On 4/20/20 8:07 AM, Andy Coates wrote:
> Hey all,
> 
>> One other point, I’m actually mildly concerned about what you say
> regarding the outputs to the repartition and changelog topics being
> captured and verified later. I would consider the data in these topics to
> be a private interface, and would strongly discourage user code to depend
> on it in any way, including in tests. The reason is that we need to
> maintain the flexibility to change how those topics are used at any time,
> and frequently do to implement features, fix bugs, etc. if we needed a KIP
> ad deprecation period for that data, we would be in for a lot of trouble.
> The only reason I wouldn’t consider actually removing those topics from TTD
> is that they’re exposed in the brokers anyway. But seeing this in the
> motivation gives me a little heartburn.
> 
> I'd disagree strongly here. If Streams was smart enough, (and this is not a
> criticism of Streams!), to be able to migrate from one set of internal
> topics to another after an upgrade or a change in the topology, then I'd be
> more inclined to agree that they are an internal implementation
> detail. However, that is not the case. Changing your topology can change
> the name or contents of these 'internal' topics, and so it's crucial that
> users can test which are written to and what they contain. Otherwise, how
> can users make changes to their topologies or upgrade Streams with any
> confidence the whole thing won't go up in smoke when they release to
> Production?
> 
> For each change we make to KsqlDB we need to ensure we've not
> inadvertently changed the names of internal topics or changed the schema or
> format of the data being produced to them. If we had and didn't catch it,
> then anyone upgrading to this new version of ksqlDB would likely see issues
> with any queries updating materialized views. This may 'just' be data loss,
> due to a topic name change, with data left orphaned in the old internal
> topic, but could equally be data corruption due to a change in the schema
> of the data being persisted.
> 
> *Function naming*
> I honestly don't care what it's called. Don't get me wrong - I know naming
> is important. What I mean is that you know better than I. Let me know the
> name and I'll update.
> 
> However, I did add a justification for the naming last week, which it looks
> like you haven't read.  Namely that the set of names returned is the exact
> same set of names you can use to call `createOutputTopic`. So it seems
> symmetrical to me that the method would be called `getOutputTopicNames` or
> `outputTopicNames`.  But, like I said... whatever you want.
> 
> *TopologyDescription vs TTD*
> 
> Personally, as a user who's working with TTD, I'd prefer this method on the
> TDD. This makes the API of TDD more complete IMHO.
> 
> That said, TDD could internally delegate to TopologyDescription, except it
> doesn't handle the case where there are dynamic bindings, right?  So, it
> seems the more complete solution is also the TDD approach, if I understand
> correctly.
> 
> Personally, I'm happy with purely the first phase mentioned in the KIP,
> i.e. just exposing from TDD, and I'm happy to remove the second.  It feels
> to me like the enhancement to TopologyDescription is complimentary by
> separation to this KIP.
> 
> Thoughts?
> 
> Andy
> 
> 
> On Fri, 17 Apr 2020 at 02:17, John Roesler <j...@vvcephei.org> wrote:
> 
>> Thanks Matthias,
>>
>> It sounds like your proposal would be instead to add methods to
>> TopologyDescription: getOuputTopicNames(), getInternalTopicNames(), and
>> getInputTopicNames(). This sounds good to me. For one thing, it nicely
>> resolved the ambiguity with the method name.
>>
>> What do you think about this idea, Andy?
>>
>> Thanks,
>> John
>>
>> On Thu, Apr 16, 2020, at 19:22, Matthias J. Sax wrote:
>>> John,
>>>
>>> the "dynamic routing" case is certainly interesting. However, your
>>> argument about "simulating brokers" and verify which topics are created
>>> does not really hold up IMHO, because the names of all internal topics
>>> are contained in the TopologyDescription, too.
>>>
>>> To be fair, the `TopologyDescription` does not make is easy to get all
>>> those topic names, as one would need to iterator over all
>>> sub-topologies, all nodes (filter sources for
>>> output/internal-repartition topics) and all nodes to check if they have
>>> a state store etc.
>>>
>>> However, if this is the main use case, we could simply add methods to
>>> `TopologyDescription` like `inputTopics(), `outputTopics()`, and
>>> `internalTopics()` (or similar). Details could be discussed further.
>>>
>>>
>>> I agree that the motivation of the KIP is unclear: which topics should
>>> actually be returned? The proposed method name is
>>> `getOutputTopicNames()` while the current PR exposes all topics the
>>> application did write into. Adding the input topic names is even more
>>> confusing. Maybe Andy can clarify.
>>>
>>> The method name "allProducedTopics()" might work, but it would (by its
>>> name) exclude input topic (with the exception that one has a loop in the
>>> topology and for "through()" topics). It would also not contain "unused"
>>> topics (eg, a filter() prevents writing into an output topic). Not sure
>>> if there is a use case to get only the topic names of "used topics"?
>>>
>>> Overall, besides the "dynamic routing" case, I still no 100% sure if we
>>> should expose topic names at the TTD level but would rather extend the
>>> TopologyDescription. For the dynamic routing case, we could add a method
>>> to TTD that _only_ exposes those dynamic topic names, like
>>> `dynamicOutputTopic()`?
>>>
>>>
>>> But I guess it's best to wait for Andy to clarify. The discussion is
>>> getting somewhat speculative :)
>>>
>>>
>>>
>>> -Matthias
>>>
>>>
>>>
>>>
>>> On 4/15/20 9:13 PM, John Roesler wrote:
>>>> Hi Andy,
>>>>
>>>> Thanks for the KIP!
>>>>
>>>> To Matthias’s concern, I have to agree that the motivation doesn’t
>> build a particularly strong case for the KIP, and people often look back to
>> these documents to understand why something was done, and done a particular
>> way. So, it would be nice to flesh it out a bit more. I have some thoughts
>> below on how to do this.
>>>>
>>>> Matthias also raised a good point that many applications can get the
>> same information just by looking in the TopologyDescription. At the least,
>> this alternative deserves to be called out in the “Rejected Alternatives”
>> section, with an explanation of why the chosen approach is preferable.
>>>>
>>>> For my part, I can think of a couple of reasons to add the method to
>> TopologyTestDriver instead of TopologyDescription.
>>>>
>>>> First, although it’s rare, some output topics are determined
>> dynamically by the data. In particular, we offer sink nodes with a
>> TopicNameExtractor. Such topics could only be verified by looking at the
>> actual topics used, rather than the ones specified in the topology.
>>>>
>>>> Second is a more philosophical justification, that the
>> TopologyTestDriver performs some functions of KafkaStreams (processing,
>> serving IQ), but also some functions of the brokers (simulating input and
>> output topics). A common integration testing strategy is to run an
>> application and then query the brokers for the list of topics that the app
>> creates/uses, to make sure the app won’t run afoul of ACLs in production,
>> or just to make sure it doesn’t contain a bug that produces to unexpected
>> topics, or any number of other reasons. Your proposal allows the TTD to
>> partly fill this role, which seems natural considering its present
>> broker-esque capabilities.
>>>>
>>>> As for my own feedback, I’m mostly concerned about the method name and
>> contract. The proposed name mentions “output” topics, but I think most
>> people would associate this with sink topics, probably not with repartition
>> topics, and certainly not with changelog topics. I’m sure this also crossed
>> your mind, and I’m afraid I don’t have a brilliant suggestion. Maybe
>> getProducedTopicNames? Or, we could include the input topics as well and go
>> with getTopicNames() as a listing of all the topics that the app “touches”?
>>>>
>>>> One other point, I’m actually mildly concerned about what you say
>> regarding the outputs to the repartition and changelog topics being
>> captured and verified later. I would consider the data in these topics to
>> be a private interface, and would strongly discourage user code to depend
>> on it in any way, including in tests. The reason is that we need to
>> maintain the flexibility to change how those topics are used at any time,
>> and frequently do to implement features, fix bugs, etc. if we needed a KIP
>> ad deprecation period for that data, we would be in for a lot of trouble.
>> The only reason I wouldn’t consider actually removing those topics from TTD
>> is that they’re exposed in the brokers anyway. But seeing this in the
>> motivation gives me a little heartburn.
>>>>
>>>> Finally, a minor point: the Javadoc you proposed contradicts your
>> proposal for Phase 2, in that the doc says it prints all the topics to
>> which records have been output, but Phase 2 says we’ll include all the
>> topics from the description up front, before any outputs happen.
>>>>
>>>> Anyway, thanks again for the KIP. It does seem useful to me, and I
>> hope my feedback helps speed you to a successful vote!
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>> On Tue, Apr 14, 2020, at 19:49, Matthias J. Sax wrote:
>>>>> Andy,
>>>>>
>>>>> thanks for the KIP. The motivation is a little unclear to me:
>>>>>
>>>>>> This information allows all the outputs of a test run to be captured
>> without prior knowledge of the output topics.
>>>>>
>>>>> Given that the TTD users writes the `Topology` themselves, they should
>>>>> always have prior knowledge what output topics they use. So why would
>>>>> this be useful?
>>>>>
>>>>> Also, there is `Topology#describe()` to get all topic names (even the
>>>>> name of internal topics -- to be fair, changelog topic names are not
>>>>> exposed, only store names, but those can we used to infer changelog
>>>>> topic names, too).
>>>>>
>>>>> Can you elaborate about the motivation? So far, it's not convincing
>> to me.
>>>>>
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>> On 4/14/20 8:09 AM, Andy Coates wrote:
>>>>>> Hey all,
>>>>>> I would like to start off the discussion for KIP-594:
>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver
>>>>>>
>>>>>> This KIP proposes to expose the names of the topics a topology
>> produces
>>>>>> records during a test run from the TopologyTestDriver class.
>>>>>>
>>>>>> Let me know your thoughts!
>>>>>> Andy
>>>>>>
>>>>>
>>>>>
>>>>> Attachments:
>>>>> * signature.asc
>>>
>>>
>>> Attachments:
>>> * signature.asc
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Reply via email to