Hi Kyle, I believe Guozhang has now given you permission to edit the KIP wiki at https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals. Could you see if you can add this there?
Many thanks Eno On Wed, Apr 26, 2017 at 6:00 PM, Kyle Winkelman <winkelman.k...@gmail.com> wrote: > Thank you for your reply. > > I have attached my first attempt at writing a KIP and I was wondering if > you could review it and share your thoughts. > > Going forward I would like to create this KIP. I was wondering whom I > should ask to get the necessary permissions on the wiki. Username: > winkelman.kyle > > > > On Fri, Apr 21, 2017 at 3:15 PM, Eno Thereska <eno.there...@gmail.com> > wrote: > >> Hi Kyle, >> >> Sorry for the delay in replying. I think it's worth doing a KIP for this >> one. One super helpful thing with KIPs is to list a few more scenarios that >> would benefit from this approach. In particular it seems the main benefit >> is from reducing the number of state stores. Does this necessarily reduce >> the number of IOs to the stores (number of puts/gets), or the extra space >> overheads with multiple stores. Quantifying that a bit would help. >> >> To answer your original questions: >> >> >The problem I am having with this approach is understanding if there is >> a race condition. Obviously the source topics would be copartitioned. But >> would it be multithreaded and possibly cause one of the processors to grab >> patient 1 at the same time a different processor has grabbed patient 1? >> >> >> I don't think there will be a problem here. A processor cannot be >> accessed by multiple threads in Kafka Streams. >> >> >> >My understanding is that for each partition there would be a single >> complete set of processors and a new incoming record would go completely >> through the processor topology from a source node to a sink node before the >> next one is sent through. Is this correct? >> >> This is mostly true, however if caching is enabled (for dedupping, see >> KIP-63), then a record may reside in a cache before going to the sink. >> Meanwhile another record can come in. So multiple records can be in the >> topology at the same time. >> >> Thanks >> Eno >> >> >> >> >> >> On Fri, Apr 14, 2017 at 8:16 PM, Kyle Winkelman <winkelman.k...@gmail.com >> > wrote: >> >>> Eno, >>> Thanks for the response. The figure was just a restatement of my >>> questions. I have made an attempt at a low level processor and it appears >>> to work but it isn't very pretty and was hoping for something at the >>> streams api level. >>> >>> I have written some code to show an example of how I see the Cogroup >>> working in kafka. >>> >>> First the KGroupedStream would have a cogroup method that takes the >>> initializer and the aggregator for that specific KGroupedStream. This would >>> return a KCogroupedStream that has 2 methods one to add more >>> KGroupedStream, Aggregator pairs and one to complete the construction and >>> return a KTable. >>> >>> builder.stream("topic").groupByKey ().cogroup(Initializer, Aggregator, >>> aggValueSerde, storeName).cogroup(groupedStream1, >>> Aggregator1).cogroup(groupedStream2, Aggregator2).aggregate(); >>> >>> Behind the scenes we create a KStreamAggregate for each KGroupedStream, >>> Aggregator pair. Then a final pass through processor to pass on the >>> aggregate values. This gives us a KTable backed by a single store that is >>> used in all of the processors. >>> >>> Please let me know if this is something you think would add value to >>> kafka streams. And I will try to create a KIP to foster more communication. >>> >>> You can take a look at what I have. I think it's missing a fair amount >>> but it's a good start. I took the doAggregate method in KGroupedStream as >>> my starting point and expanded on it for multiple streams: >>> https://github.com/KyleWinkelman/kafka/tree/cogroup >>> >>> >> >