--> On Thu, Feb 8, 2018 at 2:16 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote:
> Regarding the two hooks you would like to be available: > > > - Provide hook to override discovery (not to hit Kinesis from every > subtask) > > Yes, I think we can easily provide a way, for example setting -1 for > SHARD_DISCOVERY_INTERVAL_MILLIS, to disable shard discovery. > Though, the user would then have to savepoint and restore in order to pick > up new shards after a Kinesis stream reshard (which is in practice the best > way to by-pass the Kinesis API rate limitations). > +1 to provide that. > I'm considering a customization of KinesisDataFetcher with override for discoverNewShardsToSubscribe. We still want shards to be discovered, just not by hitting Kinesis from every subtask. > > > - Provide hook to support custom watermark generation (somewhere > around KinesisDataFetcher.emitRecordAndUpdateState) > > Per-partition watermark generation on the Kinesis side is slightly more > complex than Kafka, due to how Kinesis’s dynamic resharding works. > I think we need to additionally allow new shards to be consumed only after > its parent shard is fully read, otherwise “per-shard time characteristics” > can be broken because of this out-of-orderness consumption across the > boundaries of a closed parent shard and its child. > There theses JIRAs [1][2] which has a bit more details on the topic. > Otherwise, in general I’m also +1 to providing this also in the Kinesis > consumer. > Here I'm thinking to customize emitRecordAndUpdateState (method would need to be made non-final). Using getSubscribedShardsState with additional transient state to keep track of watermark per shard and emit watermark as appropriate. That's the idea - haven't written any code for it yet. Thanks, Thomas