Re: Kinesis getRecords read timeout and retry

Tzu-Li (Gordon) Tai Sun, 15 Apr 2018 20:57:07 -0700

Hi Thomas,

Thanks for your PRs!

I understand and fully agree with both points that you pointed out.

What I'm still a bit torn with is the current proposed solutions for these
issues (and other similar connector issues).

This might actually call for a good opportunity to bring some thoughts up
about connector contributions. My arguments would be the following:

The solutions actually break some fundamental designs of the connector code.
For example, in recent PRs for the Kinesis connector we've been proposing
to relax access of the `KinesisProxy` constructor.
AFAIK, this fix was triggered by an inefficiency in the
`KinesisProxy#getShardsOfStream` method which influences shard discovery
performance.
First of all, such a change breaks the fact that the class is an internal
class (it is marked as @Internal). It was made private as it handles
critical paths such as record fetching and shard listing, and is not
intended to be modified at all.
Second of all, the fix in the end did not fix the inefficiency at all -
only for advanced users who perhaps have saw the corresponding JIRA and
would bother to do the same and override the inefficient implementations by
themselves.
If there is a fix that would have benefited all users of the connector in
general, I would definitely be more in favor of that.
This goes the same for https://github.com/apache/flink/pull/5803 - I'm not
sure that allowing overrides on the retry logic is ideal. For example, we
previously introduced in the Elasticsearch connector a RetryHandler
user-facing API to allow such customizations.

On one hand, I do understand that solving these connector issues properly
would perhaps require a more thorough design-wise ground-work and could be
more time-consuming.
On the other hand, I also understand that we need to find a good balance to
allow production users of these connectors to be able to quickly iterate
what issues the current code has and unblock encountered problems.

My main concern is that our current approach to fixing these issues, IMO,
actually do not encourage good fixes to be contributed back to the
connector code, and they would therefore remain problematic as they are.

What do you think? I may also be missing thing in a bigger picture here, so
feedback would be highly appreciated.

Cheers,
Gordon

On Tue, Apr 3, 2018 at 1:25 PM, Thomas Weise <t...@apache.org> wrote:

> PR to provide the hooks:  https://github.com/apache/flink/pull/5803
>
>
> On Mon, Apr 2, 2018 at 6:14 PM, Thomas Weise <t...@apache.org> wrote:
>
> > Hi,
> >
> > I’m working on implementing retry for getRecords in FlinkKinesisConsumer.
> > We occasionally get transient socket read timeouts. Instead of bubbling
> up
> > the exception and forcing a topology reset to checkpoint, we want to
> retry
> > getRecords. We also want to work with a lower socket read timeout than
> the
> > 50s default.
> >
> > Looking at the current KinesisProxy implementation, I’m aiming to remove
> > some baked in assumptions that get into the way of customizing this:
> >
> > 1) AWSUtil.createKinesisClient - statically wired to use default
> > ClientConfiguration. The user should be able to control all settings that
> > the SDK exposes instead.
> >
> > 2) Retry in KinesisProxy.getRecords limited to AmazonServiceException.
> > Perhaps it is OK as default, but the user should be able to retry on
> other
> > exceptions if desired.
> >
> > For 1) a generic option could be to set properties on ClientConfiguration
> > using reflection (the class isn’t serializable but follows the Java Bean
> > conventions). Something like BeanUtils would make it straightforward to
> > process user supplied properties with a specific prefix. Is there any
> other
> > place in the Flink codebase where this style of configuration approach is
> > used and a preferred alternative to BeanUtils?
> >
> > Thanks
> >
> >
>

Re: Kinesis getRecords read timeout and retry

Reply via email to