[GitHub] samza pull request #349: Initial version of Table API

2017-11-01 Thread weisong44
GitHub user weisong44 opened a pull request: https://github.com/apache/samza/pull/349 Initial version of Table API Initial version of table API, it includes - Core table API (Table, TableDescriptor, TableSpec) - Local table implementation for in-memory and RocksDb - Th

[GitHub] samza pull request #348: SAMZA-1479 : Refactor KafkaCheckpointManager, Kafka...

2017-11-01 Thread vjagadish1989
GitHub user vjagadish1989 opened a pull request: https://github.com/apache/samza/pull/348 SAMZA-1479 : Refactor KafkaCheckpointManager, KafkaCheckpointLogKey and their tests Notable changes: * Rewrite `KafkaCheckpointLogKey` into two classes - an immutable class, and a SerDe

Re: Per task/topic checkpoint?

2017-11-01 Thread Jacob Maes
Ahh, sorry, I misunderstood the problem. Does the application use InMemoryKeyValueStore to accumulate the deltas? If so, I would think you could enable the changelog on that store and commit as often as you like, because having the deltas backed up durably should allow you to decouple commit() fro

Re: Per task/topic checkpoint?

2017-11-01 Thread Gaurav Agarwal
Thanks Jake. In this application we control the checkpointing explicitly to accumulate certain amount of delta in memory before committing them to stores, and checkpointing. This is to reduce the commit counts and some other business case like deduplication of deltas. The scenario that I was tryi

Re: Per task/topic checkpoint?

2017-11-01 Thread Jacob Maes
Hey Gaurav, Samza automatically keeps track of the offsets your job has successfully processed for each SSP. When your task requests a checkpoint, Samza will write the offset of the latest successfully-processed message for each SSP that task consumes. So if task0 consumes partition 0 of two topi

Re: Per task/topic checkpoint?

2017-11-01 Thread Gaurav Agarwal
Thanks, I'll check it out. I have a samza application that is consuming a lot of different types of messages (these messages are related to each other but do not require join - think of these like different configuration and metric information of virtual machines that modify some central sates lik