Thanks Andi! I buy your argument. For exposing topics names via `TopologyDescription` I create a JIRA for tracking: https://issues.apache.org/jira/browse/KAFKA-9913
-Matthias On 4/20/20 8:07 AM, Andy Coates wrote: > Hey all, > >> One other point, I’m actually mildly concerned about what you say > regarding the outputs to the repartition and changelog topics being > captured and verified later. I would consider the data in these topics to > be a private interface, and would strongly discourage user code to depend > on it in any way, including in tests. The reason is that we need to > maintain the flexibility to change how those topics are used at any time, > and frequently do to implement features, fix bugs, etc. if we needed a KIP > ad deprecation period for that data, we would be in for a lot of trouble. > The only reason I wouldn’t consider actually removing those topics from TTD > is that they’re exposed in the brokers anyway. But seeing this in the > motivation gives me a little heartburn. > > I'd disagree strongly here. If Streams was smart enough, (and this is not a > criticism of Streams!), to be able to migrate from one set of internal > topics to another after an upgrade or a change in the topology, then I'd be > more inclined to agree that they are an internal implementation > detail. However, that is not the case. Changing your topology can change > the name or contents of these 'internal' topics, and so it's crucial that > users can test which are written to and what they contain. Otherwise, how > can users make changes to their topologies or upgrade Streams with any > confidence the whole thing won't go up in smoke when they release to > Production? > > For each change we make to KsqlDB we need to ensure we've not > inadvertently changed the names of internal topics or changed the schema or > format of the data being produced to them. If we had and didn't catch it, > then anyone upgrading to this new version of ksqlDB would likely see issues > with any queries updating materialized views. This may 'just' be data loss, > due to a topic name change, with data left orphaned in the old internal > topic, but could equally be data corruption due to a change in the schema > of the data being persisted. > > *Function naming* > I honestly don't care what it's called. Don't get me wrong - I know naming > is important. What I mean is that you know better than I. Let me know the > name and I'll update. > > However, I did add a justification for the naming last week, which it looks > like you haven't read. Namely that the set of names returned is the exact > same set of names you can use to call `createOutputTopic`. So it seems > symmetrical to me that the method would be called `getOutputTopicNames` or > `outputTopicNames`. But, like I said... whatever you want. > > *TopologyDescription vs TTD* > > Personally, as a user who's working with TTD, I'd prefer this method on the > TDD. This makes the API of TDD more complete IMHO. > > That said, TDD could internally delegate to TopologyDescription, except it > doesn't handle the case where there are dynamic bindings, right? So, it > seems the more complete solution is also the TDD approach, if I understand > correctly. > > Personally, I'm happy with purely the first phase mentioned in the KIP, > i.e. just exposing from TDD, and I'm happy to remove the second. It feels > to me like the enhancement to TopologyDescription is complimentary by > separation to this KIP. > > Thoughts? > > Andy > > > On Fri, 17 Apr 2020 at 02:17, John Roesler <j...@vvcephei.org> wrote: > >> Thanks Matthias, >> >> It sounds like your proposal would be instead to add methods to >> TopologyDescription: getOuputTopicNames(), getInternalTopicNames(), and >> getInputTopicNames(). This sounds good to me. For one thing, it nicely >> resolved the ambiguity with the method name. >> >> What do you think about this idea, Andy? >> >> Thanks, >> John >> >> On Thu, Apr 16, 2020, at 19:22, Matthias J. Sax wrote: >>> John, >>> >>> the "dynamic routing" case is certainly interesting. However, your >>> argument about "simulating brokers" and verify which topics are created >>> does not really hold up IMHO, because the names of all internal topics >>> are contained in the TopologyDescription, too. >>> >>> To be fair, the `TopologyDescription` does not make is easy to get all >>> those topic names, as one would need to iterator over all >>> sub-topologies, all nodes (filter sources for >>> output/internal-repartition topics) and all nodes to check if they have >>> a state store etc. >>> >>> However, if this is the main use case, we could simply add methods to >>> `TopologyDescription` like `inputTopics(), `outputTopics()`, and >>> `internalTopics()` (or similar). Details could be discussed further. >>> >>> >>> I agree that the motivation of the KIP is unclear: which topics should >>> actually be returned? The proposed method name is >>> `getOutputTopicNames()` while the current PR exposes all topics the >>> application did write into. Adding the input topic names is even more >>> confusing. Maybe Andy can clarify. >>> >>> The method name "allProducedTopics()" might work, but it would (by its >>> name) exclude input topic (with the exception that one has a loop in the >>> topology and for "through()" topics). It would also not contain "unused" >>> topics (eg, a filter() prevents writing into an output topic). Not sure >>> if there is a use case to get only the topic names of "used topics"? >>> >>> Overall, besides the "dynamic routing" case, I still no 100% sure if we >>> should expose topic names at the TTD level but would rather extend the >>> TopologyDescription. For the dynamic routing case, we could add a method >>> to TTD that _only_ exposes those dynamic topic names, like >>> `dynamicOutputTopic()`? >>> >>> >>> But I guess it's best to wait for Andy to clarify. The discussion is >>> getting somewhat speculative :) >>> >>> >>> >>> -Matthias >>> >>> >>> >>> >>> On 4/15/20 9:13 PM, John Roesler wrote: >>>> Hi Andy, >>>> >>>> Thanks for the KIP! >>>> >>>> To Matthias’s concern, I have to agree that the motivation doesn’t >> build a particularly strong case for the KIP, and people often look back to >> these documents to understand why something was done, and done a particular >> way. So, it would be nice to flesh it out a bit more. I have some thoughts >> below on how to do this. >>>> >>>> Matthias also raised a good point that many applications can get the >> same information just by looking in the TopologyDescription. At the least, >> this alternative deserves to be called out in the “Rejected Alternatives” >> section, with an explanation of why the chosen approach is preferable. >>>> >>>> For my part, I can think of a couple of reasons to add the method to >> TopologyTestDriver instead of TopologyDescription. >>>> >>>> First, although it’s rare, some output topics are determined >> dynamically by the data. In particular, we offer sink nodes with a >> TopicNameExtractor. Such topics could only be verified by looking at the >> actual topics used, rather than the ones specified in the topology. >>>> >>>> Second is a more philosophical justification, that the >> TopologyTestDriver performs some functions of KafkaStreams (processing, >> serving IQ), but also some functions of the brokers (simulating input and >> output topics). A common integration testing strategy is to run an >> application and then query the brokers for the list of topics that the app >> creates/uses, to make sure the app won’t run afoul of ACLs in production, >> or just to make sure it doesn’t contain a bug that produces to unexpected >> topics, or any number of other reasons. Your proposal allows the TTD to >> partly fill this role, which seems natural considering its present >> broker-esque capabilities. >>>> >>>> As for my own feedback, I’m mostly concerned about the method name and >> contract. The proposed name mentions “output” topics, but I think most >> people would associate this with sink topics, probably not with repartition >> topics, and certainly not with changelog topics. I’m sure this also crossed >> your mind, and I’m afraid I don’t have a brilliant suggestion. Maybe >> getProducedTopicNames? Or, we could include the input topics as well and go >> with getTopicNames() as a listing of all the topics that the app “touches”? >>>> >>>> One other point, I’m actually mildly concerned about what you say >> regarding the outputs to the repartition and changelog topics being >> captured and verified later. I would consider the data in these topics to >> be a private interface, and would strongly discourage user code to depend >> on it in any way, including in tests. The reason is that we need to >> maintain the flexibility to change how those topics are used at any time, >> and frequently do to implement features, fix bugs, etc. if we needed a KIP >> ad deprecation period for that data, we would be in for a lot of trouble. >> The only reason I wouldn’t consider actually removing those topics from TTD >> is that they’re exposed in the brokers anyway. But seeing this in the >> motivation gives me a little heartburn. >>>> >>>> Finally, a minor point: the Javadoc you proposed contradicts your >> proposal for Phase 2, in that the doc says it prints all the topics to >> which records have been output, but Phase 2 says we’ll include all the >> topics from the description up front, before any outputs happen. >>>> >>>> Anyway, thanks again for the KIP. It does seem useful to me, and I >> hope my feedback helps speed you to a successful vote! >>>> >>>> Thanks, >>>> John >>>> >>>> On Tue, Apr 14, 2020, at 19:49, Matthias J. Sax wrote: >>>>> Andy, >>>>> >>>>> thanks for the KIP. The motivation is a little unclear to me: >>>>> >>>>>> This information allows all the outputs of a test run to be captured >> without prior knowledge of the output topics. >>>>> >>>>> Given that the TTD users writes the `Topology` themselves, they should >>>>> always have prior knowledge what output topics they use. So why would >>>>> this be useful? >>>>> >>>>> Also, there is `Topology#describe()` to get all topic names (even the >>>>> name of internal topics -- to be fair, changelog topic names are not >>>>> exposed, only store names, but those can we used to infer changelog >>>>> topic names, too). >>>>> >>>>> Can you elaborate about the motivation? So far, it's not convincing >> to me. >>>>> >>>>> >>>>> >>>>> -Matthias >>>>> >>>>> >>>>> On 4/14/20 8:09 AM, Andy Coates wrote: >>>>>> Hey all, >>>>>> I would like to start off the discussion for KIP-594: >>>>>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver >>>>>> >>>>>> This KIP proposes to expose the names of the topics a topology >> produces >>>>>> records during a test run from the TopologyTestDriver class. >>>>>> >>>>>> Let me know your thoughts! >>>>>> Andy >>>>>> >>>>> >>>>> >>>>> Attachments: >>>>> * signature.asc >>> >>> >>> Attachments: >>> * signature.asc >> >
signature.asc
Description: OpenPGP digital signature