Re: Kafka streams

Matthias J. Sax Fri, 26 Aug 2016 05:08:37 -0700

Hi,

If you need multiple polls to receive more data before you start
processing, you should disable auto commit (via
`auto.commit.enable=false`). Thus, no commit happens on poll() -- of
course, you need to do commits manually. Kafka Streams also uses this
strategy internally.


About KTable: if a task fails and gets restarted on the same host, and
if the local state store is not corrupted, it will just re-use the local
state store.
Thus, the changelog will only be read (from beginning), if state store
is corrupted or task was move to a different host (ie, application
instance. For both cases, local state store is rebuild on start up
before actual processing begins.


-Matthias



On 08/26/2016 10:52 AM, Abhishek Agarwal wrote:
> Thanks Matthias and Eno. For at least once guarantees to be effective in
> any system, the source (Kafka receiver) needs to know that the message has
> been successfully processed. What I understood is that since processing
> happens in single thread, if another record is polled, it is implicitly
> assumed that previous record has been successfully processed. Given that,
> 
> - When I am batching records in memory for kafka producer, is there a
> possibility of data loss since records have not been committed yet and
> source has already polled for more records.
> 
> - Similarly for operations such as aggregations, records have to be
> buffered and they may be lost if source has already committed the offset.
> The documentation suggests that the buffered memory is actually a KTable
> backed by local state store and each single update to KTable is separately
> into changelog topic (doesn't that add latency)
> 
> 
>    - When the task fails and re-spawns, does it read the change-log topic
>    from beginning?
> 
> Thank you for your patience.
> 
> 
> 
> 
> 
> 
> 
> On Aug 26, 2016 3:35 AM, "Matthias J. Sax" <matth...@confluent.io> wrote:
> 
>> Just want to add something:
>>
>> I you use Kafka Streams DSL, the library is Kafka centric.
>>
>> However, you could use low-level Processor API to get data into your
>> topology from other systems. The problem will be missing fault-tolerance
>> that you would need to code by yourself. When reading from Kafka,
>> fault-tolerance is "for free" because Kafka Streams uses Java
>> KafkaConsumer client internally.
>>
>> So as Eno said, it is recommended to use Kafka Connect to get data in
>> and out from the Kafka cluster. Of course, you can also use a different
>> tool than Connect for data import and export to/from Kafka.
>>
>> From Eno's answer
>>
>>> Multiple threads would make sense to run separate DAGs.
>>
>> If I understand this correctly, he means that you can write multiple
>> application and connect them via a topic to get multiple threads for
>> different processors (ie, the downstream application consumes the output
>> from the upstream application)
>>
>>
>> -Matthias
>>
>> On 08/25/2016 07:45 PM, Eno Thereska wrote:
>>> Good question. All of them would run in a single thread. That is the
>> model. Multiple threads would make sense to run separate DAGs.
>>>
>>> Eno
>>>
>>>
>>>> On 25 Aug 2016, at 18:32, Abhishek Agarwal <abhishc...@gmail.com>
>> wrote:
>>>>
>>>> Hi Eno,
>>>>
>>>> Thanks for your reply. If my application DAG has three stream
>> processors,
>>>> first of which is source, would all of them run in single thread? There
>> may
>>>> be scenarios wherein I want to have different number of threads for
>>>> different processors since some may be CPU bound and some may be IO
>> bound.
>>>>
>>>> On Thu, Aug 25, 2016 at 10:49 PM, Eno Thereska <eno.there...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Abhishek,
>>>>>
>>>>> - Correct on connecting to external stores. You can use Kafka Connect
>> to
>>>>> get things in or out. (Note that in the 0.10.1 release KIP-67 allows
>> you to
>>>>> directly query Kafka Stream's stores so, for some kind of data you
>> don't
>>>>> need to move it to an external store. This is pushed in trunk.)
>>>>>
>>>>> - You can definitely use more threads than partitions, but that will
>> not
>>>>> buy you much since some threads will be idle. No two threads will work
>> on
>>>>> the same partition, so you don't have to worry about them repeating
>> work.
>>>>>
>>>>> Hope this helps.
>>>>> Eno
>>>>>
>>>>>> On 25 Aug 2016, at 16:50, Abhishek Agarwal <abhishc...@gmail.com>
>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I was reading up on kafka streams for a project and came across this
>> blog
>>>>>> https://softwaremill.com/kafka-streams-how-does-it-fit-strea
>> m-landscape/
>>>>>> I wanted to validate some assertions made in blog, with kafka
>> community
>>>>>>
>>>>>> - Kafka streams is kafka-in, kafka-out application. Does the user need
>>>>>> kafka connect to transfer data from kafka to any external store?
>>>>>> - No support for asynchronous processing - Can I use more threads than
>>>>>> number of partitions for processors without sacrificing at-least once
>>>>>> guarantees?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Abhishek Agarwal
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Abhishek Agarwal
>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: Kafka streams

Reply via email to