Hi, If you need multiple polls to receive more data before you start processing, you should disable auto commit (via `auto.commit.enable=false`). Thus, no commit happens on poll() -- of course, you need to do commits manually. Kafka Streams also uses this strategy internally.
About KTable: if a task fails and gets restarted on the same host, and if the local state store is not corrupted, it will just re-use the local state store. Thus, the changelog will only be read (from beginning), if state store is corrupted or task was move to a different host (ie, application instance. For both cases, local state store is rebuild on start up before actual processing begins. -Matthias On 08/26/2016 10:52 AM, Abhishek Agarwal wrote: > Thanks Matthias and Eno. For at least once guarantees to be effective in > any system, the source (Kafka receiver) needs to know that the message has > been successfully processed. What I understood is that since processing > happens in single thread, if another record is polled, it is implicitly > assumed that previous record has been successfully processed. Given that, > > - When I am batching records in memory for kafka producer, is there a > possibility of data loss since records have not been committed yet and > source has already polled for more records. > > - Similarly for operations such as aggregations, records have to be > buffered and they may be lost if source has already committed the offset. > The documentation suggests that the buffered memory is actually a KTable > backed by local state store and each single update to KTable is separately > into changelog topic (doesn't that add latency) > > > - When the task fails and re-spawns, does it read the change-log topic > from beginning? > > Thank you for your patience. > > > > > > > > On Aug 26, 2016 3:35 AM, "Matthias J. Sax" <matth...@confluent.io> wrote: > >> Just want to add something: >> >> I you use Kafka Streams DSL, the library is Kafka centric. >> >> However, you could use low-level Processor API to get data into your >> topology from other systems. The problem will be missing fault-tolerance >> that you would need to code by yourself. When reading from Kafka, >> fault-tolerance is "for free" because Kafka Streams uses Java >> KafkaConsumer client internally. >> >> So as Eno said, it is recommended to use Kafka Connect to get data in >> and out from the Kafka cluster. Of course, you can also use a different >> tool than Connect for data import and export to/from Kafka. >> >> From Eno's answer >> >>> Multiple threads would make sense to run separate DAGs. >> >> If I understand this correctly, he means that you can write multiple >> application and connect them via a topic to get multiple threads for >> different processors (ie, the downstream application consumes the output >> from the upstream application) >> >> >> -Matthias >> >> On 08/25/2016 07:45 PM, Eno Thereska wrote: >>> Good question. All of them would run in a single thread. That is the >> model. Multiple threads would make sense to run separate DAGs. >>> >>> Eno >>> >>> >>>> On 25 Aug 2016, at 18:32, Abhishek Agarwal <abhishc...@gmail.com> >> wrote: >>>> >>>> Hi Eno, >>>> >>>> Thanks for your reply. If my application DAG has three stream >> processors, >>>> first of which is source, would all of them run in single thread? There >> may >>>> be scenarios wherein I want to have different number of threads for >>>> different processors since some may be CPU bound and some may be IO >> bound. >>>> >>>> On Thu, Aug 25, 2016 at 10:49 PM, Eno Thereska <eno.there...@gmail.com> >>>> wrote: >>>> >>>>> Hi Abhishek, >>>>> >>>>> - Correct on connecting to external stores. You can use Kafka Connect >> to >>>>> get things in or out. (Note that in the 0.10.1 release KIP-67 allows >> you to >>>>> directly query Kafka Stream's stores so, for some kind of data you >> don't >>>>> need to move it to an external store. This is pushed in trunk.) >>>>> >>>>> - You can definitely use more threads than partitions, but that will >> not >>>>> buy you much since some threads will be idle. No two threads will work >> on >>>>> the same partition, so you don't have to worry about them repeating >> work. >>>>> >>>>> Hope this helps. >>>>> Eno >>>>> >>>>>> On 25 Aug 2016, at 16:50, Abhishek Agarwal <abhishc...@gmail.com> >> wrote: >>>>>> >>>>>> Hi, >>>>>> I was reading up on kafka streams for a project and came across this >> blog >>>>>> https://softwaremill.com/kafka-streams-how-does-it-fit-strea >> m-landscape/ >>>>>> I wanted to validate some assertions made in blog, with kafka >> community >>>>>> >>>>>> - Kafka streams is kafka-in, kafka-out application. Does the user need >>>>>> kafka connect to transfer data from kafka to any external store? >>>>>> - No support for asynchronous processing - Can I use more threads than >>>>>> number of partitions for processors without sacrificing at-least once >>>>>> guarantees? >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Abhishek Agarwal >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Abhishek Agarwal >>> >> >> >
signature.asc
Description: OpenPGP digital signature