Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

Matthias J. Sax Mon, 17 Jun 2024 13:12:46 -0700

Hey,

seems this KIP is very difficult, and we actually had a lot ofbackground discussion about it in the last weeks. I believe the problemwith this KIP is, that we have 3 starting points to look at a problem,but these 3 starting points don't align:

1. Producer API: we want a clean API design or developers using theproducer directly (clean semantics, not footguns...)

2. Kafka Streams: as a power user of the producer, we want to haveadvanced capabilities; given how KS works internally, we need a "powerAPI" on he producer

3. Kafka Connect: also a power user of the producer. However, Connectis a framework not a programming API and thus prefers a config basedapproach

I also think we got one idea wrong: let the user code / handler takecare of retries. (I guess that's on me, I started with the idea to havea third return code RETRY...) -- the handler has not enough contextinformation, and making this information available leads to a veryclumsy interface. (Defeats (1) from above.)

I believe, if we would move forward with the handler, we would need tolet the producer do retries, and only call the handler after allretries/timeout are exhausted. However, for this to work, we need aproducer config for Connect, what basically defeats the purpose for (2)to make it a programmatic solution (it seems somewhat redundant)

Also, the idea to make the handler configurable, in hindsight, seemslike a poor approach / bad compromise to address (3) w/o sacrificing(2), but is a problem for (1).

We also discussed the "missing metadata" case, and actually believe wecan address it w/o a public API change. Alieh put up a PR for thisalready: https://github.com/apache/kafka/pull/16344

This leaves us with the "producer error state problem for EOS" but itmight be better to solve this differently. Alieh started KIP-1059 forthis case now.

Thus, it seems we should DISCARD this KIP, and the Connect team can do afollow up KIP to add the producer configs they need for their own situation.

Splitting the solutions tailored for the different situations seems tolead to an overall cleaner solution to the problem.


Thoughts?


-Matthias



On 5/15/24 12:30 AM, Federico Valeri wrote:

Hello Alieh, thanks for this useful KIP.

There is a typo in the motivation when you talk about the
UnknownTopicOrPartitionException. It's delivery.timeout.ms, not
deliver.timeout.ms.

In the past, I did some work to improve and clean the official Kafka
examples, which I think are useful for new Kafka users. I was
wondering if it is worth to improve them in order to show the correct
usage of this new interface. If you agree, maybe we could mention this
in the proposed changes.

The accepted responses for RecordTooLargeException are FAIL and SWALLOW. 
Therefore, RETRY will be interpreted and executed as FAIL.


Why do we need this javadoc note? I think it's not possible to return
RETRY in the current form.

When we talk about swallowing in the default implementation, I think
we will log an error/warning and drop the record right? If yes, should
we clarify this and improve the DROP_INVALID_LARGE_RECORDS_DOC by
mentioning the logging part?

Should we mention somewhere which logic takes precedence when both the
interface and configs are used?

On Tue, May 14, 2024 at 4:45 PM Chris Egerton <chr...@aiven.io.invalid> wrote:


Hi Alieh,

Thank you for all the updates! One final question--how will the retry
timeout for unknown topic partition errors be implemented? I think it would
be best if this could be done with an implementation of the error handler,
but I don't see a way to track the necessary information with the
current ProducerExceptionHandler interface.

Cheers,

Chris

On Tue, May 14, 2024 at 9:10 AM Alieh Saeedi <asae...@confluent.io.invalid>
wrote:

Thanks Andrew. Done :)

@Chris: I changed the config parameter type from boolean to integer, which
defines the timeout for retrying. I thought reusing `max.block.ms` was not
reasonable as you mentioned.

So if the KIP looks good, let 's skip to the good part ;-) VOTING :)

Bests,
Alieh





On Tue, May 14, 2024 at 12:26 PM Andrew Schofield <
andrew_schofi...@live.com>
wrote:

Hi Alieh,
Just one final comment.

[AJS5] Existing classes use Retriable, not Retryable. For example:

https://kafka.apache.org/21/javadoc/org/apache/kafka/common/errors/RetriableException.html


I suggest RetriableResponse and NonRetriableResponse.

Thanks,
Andrew

On 13 May 2024, at 23:17, Alieh Saeedi <asae...@confluent.io.INVALID>

wrote:


Hi all,


Thanks for all the valid points you listed.


KIP updates and addressing concerns:


1) The KIP now suggests two Response types: `RetryableResponse` and
`NonRetryableResponse`


2) `custom.exception.handler` is changed to

`custom.exception.handler.class`



3) The KIP clarifies that `In the case of an implemented handler for

the

specified exception, the handler takes precedence.`


4)  There is now a `default` implementation for both handle() methods.


5)  @Chris: for `UnknownTopicOrPartition`, the default is already

retrying

for 60s. (In fact, the default value of `max.block.ms`). If the

handler

instructs to FAIL or SWALLOW, there will be no retry, and if the

handler

instructs to RETRY, that will be the default behavior, which follows

the

values in already existing config parameters such as `max.block.ms`.

Does

that make sense?


Hope the changes and explanations are convincing :)


Cheers,

Alieh

On Mon, May 13, 2024 at 6:40 PM Justine Olshan

<jols...@confluent.io.invalid>

wrote:

Oh I see. The type isn't the error type but a newly defined type for

the

response. Makes sense and works for me.

Justine

On Mon, May 13, 2024 at 9:13 AM Chris Egerton <

fearthecel...@gmail.com>

wrote:

If we have dedicated methods for each kind of exception
(handleRecordTooLarge, handleUnknownTopicOrPartition, etc.), doesn't

that

provide sufficient constraint? I'm not suggesting we eliminate these
methods, just that we change their return types to something more

flexible.


On Mon, May 13, 2024, 12:07 Justine Olshan

<jols...@confluent.io.invalid


wrote:

I'm not sure I agree with the Retriable and NonRetriableResponse

comment.

This doesn't limit the blast radius or enforce certain errors are

used.

I think we might disagree on how controlled these interfaces can

be...


Justine

On Mon, May 13, 2024 at 8:40 AM Chris Egerton

<chr...@aiven.io.invalid

wrote:

Hi Alieh,

Thanks for the updates! I just have a few more thoughts:

- I don't think a boolean property is sufficient to dictate retries

for

unknown topic partitions, though. These errors can occur if a topic

has

just been created, which can occur if, for example, automatic topic
creation is enabled for a multi-task connector. This is why I

proposed

timeout instead of a boolean (and see my previous email for why

reducing

max.block.ms for a producer is not a viable alternative). If it

helps,

one

way to reproduce this yourself is to add the line
`fooProps.put(TASKS_MAX_CONFIG, "10");` to the integration test

here:

https://github.com/apache/kafka/blob/5439914c32fa00d634efa7219699f1bc21add839/connect/runtime/src/test/java/org/apache/kafka/connect/integration/SourceConnectorsIntegrationTest.java#L134

and then check the logs afterward for messages like "Error while

fetching

metadata with correlation id <n> :

{foo-topic=UNKNOWN_TOPIC_OR_PARTITION}".


- I also don't think we need custom XxxResponse enums for every

possible

method; it seems like this will lead to a lot of duplication and

cognitive

overhead if we want to expand the error handler in the future.

Something

more flexible like RetriableResponse and NonRetriableResponse could
suffice.

- Finally, the KIP still doesn't state how the handler will or

won't

take

precedence over existing retry properties. If I set `retries` or `
delivery.timeout.ms` or `max.block.ms` to low values, will that

cause

retries to cease even if my custom handler would otherwise keep

returning

RETRY for an error?

Cheers,

Chris

On Mon, May 13, 2024 at 11:02 AM Andrew Schofield <
andrew_schofi...@live.com>
wrote:

Hi Alieh,
Just a few more comments on the KIP. It is looking much less risky

now

the

scope
is tighter.

[AJS1] It would be nice to have default implementations of the

handle

methods
so an implementor would not need to implement both themselves.

[AJS2] Producer configurations which are class names usually end

in

“.class”.
I suggest “custom.exception.handler.class”.

[AJS3] If I implemented a handler, and I set a non-default value

for

one

of the
new configuations, what happens? I would expect that the handler

takes

precedence. I wasn’t quite clear what “the control will follow the

handler

instructions” meant.

[AJS4] Because you now have an enum for the
RecordTooLargeExceptionResponse,
I don’t think you need to state in the comment for
ProducerExceptionHandler that
RETRY will be interpreted as FAIL.

Thanks,
Andrew

On 13 May 2024, at 14:53, Alieh Saeedi

<asae...@confluent.io.INVALID

wrote:


Hi all,


Thanks for the very interesting discussion during my PTO.


KIP updates and addressing concerns:


1) Two handle() methods are defined in ProducerExceptionHandler

for

the

two

exceptions with different input parameters so that we have
handle(RecordTooLargeException e, ProducerRecord record) and
handle(UnknownTopicOrPartitionException e, ProducerRecord record)


2) The ProducerExceptionHandler extends `Closable` as well.


3) The KIP suggests having two more configuration parameters with

boolean

values:

- `drop.invalid.large.records` with a default value of `false`

for

swallowing too large records.

- `retry.unknown.topic.partition` with a default value of `true`

that

performs RETRY for `max.block.ms` ms, encountering the
UnknownTopicOrPartitionException.


Hope the main concerns are addressed so that we can go forward

with

voting.



Cheers,

Alieh

On Thu, May 9, 2024 at 11:25 PM Artem Livshits
<alivsh...@confluent.io.invalid> wrote:

Hi Mathias,

[AL1] While I see the point, I would think having a different

callback

for every exception might not really be elegant?

I'm not sure how to assess the level of elegance of the

proposal,

but

can

comment on the technical characteristics:

1. Having specific interfaces that codify the logic that is

currently

prescribed in the comments reduce the chance of making a

mistake.

Commments may get ignored, misuderstood or etc. but if the

contract

is

codified, the compilier will help to enforce the contract.
2. Given that the logic is trickier than it seems (the

record-too-large

is

an example that can easily confuse someone who's not intimately

familiar

with the nuances of the batching logic), having a little more

hoops

to

jump

would give a greater chance that whoever tries to add a new

cases

pauses

and thinks a bit more.
3. As Justine pointed out, having different method will be a

forcing

function to go through a KIP rather than smuggle new cases

through

implementation.
4. Sort of a consequence of the previous 3 -- all those things

reduce

the

chance of someone writing the code that works with 2 errors and

then

when

more errors are added in the future will suddenly incorrectly

ignore

new

errors (the example I gave in the previous email).

[AL2 cont.] Similar to AL1, I see such a handler to some extend

as

business logic. If a user puts a bad filter condition in their

KS

app,

and

drops messages

I agree that there is always a chance to get a bug and lose

messages,

but

there are generally separation of concerns that has different

risk

profile:

the filtering logic may be more rigorously tested and rarely

changed

(say

an application developer does it), but setting the topics to

produce

may be

done via configuration (e.g. a user of the application does it)

and

it's

generally an expectation that users would get an error when

configuration

is incorrect.

What could be worse is that UnknownTopicOrPartitionException can

be

an

intermittent error, i.e. with a generally correct configuration,

there

could be metadata propagation problem on the cluster and then a

random

set

of records could get lost.

[AL3] Maybe I misunderstand what you are saying, but to me,

checking

the

size of the record upfront is exactly what the KIP proposes? No?

It achieves the same result but solves it differently, my

proposal:


1. Application checks the validity of a record (maybe via a new
validateRecord method) before producing it, and can just exclude

it

or

return an error to the user.
2. Application produces the record -- at this point there are no

records

that could return record too large, they were either skipped at

step 1

or

we didn't get here because step 1 failed.

Vs. KIP's proposal

1. Application produces the record.
2. Application gets a callback.
3. Application returns the action on how to proceed.

The advantage of the former is the clarity of semantics -- the

record

is

invalid (property of the record, not a function of server state

or

server

configuration) and we can clearly know that it is the record

that

is

bad

and can never succeed.

The KIP-proposed way actually has a very tricky point: it

actually

handles

a subset of record-too-large exceptions.  The broker can return
record-too-large and reject the whole batch (but we don't want

to

ignore

those because then we can skip random records that just happened

to

be

in

the same batch), in some sense we use the same error for 2

different

conditions and understanding that requires pretty deep

understanding

of

Kafka internals.

-Artem


On Wed, May 8, 2024 at 9:47 AM Justine Olshan

<jols...@confluent.io.invalid

wrote:

My concern with respect to it being fragile: the code that

ensures

the

error type is internal to the producer. Someone may see it and

say, I

want

to add such and such error. This looks like internal code, so I

don't

need

a KIP, and then they can change it to whatever they want

thinking

it

is

within the typical kafka improvement protocol.

Relying on an internal change to enforce an external API is

fragile

in

my

opinion. That's why I sort of agreed with Artem with enforcing

the

error

in

the method signature -- part of the public API.

Chris's comments on requiring more information to handler again

makes

me

wonder if we are solving a problem of lack of information at

the

application level with a more powerful solution than we need.

(Ie,

if

we

had more information, could the application close and restart

the

transaction rather than having to drop records) But I am happy

to

compromise with a handler that we can agree is sufficiently

controlled

and

documented.

Justine

On Wed, May 8, 2024 at 7:20 AM Chris Egerton

<chr...@aiven.io.invalid

wrote:

Hi Alieh,

Continuing prior discussions:

1) Regarding the "flexibility" discussion, my overarching

point

is

that I

don't see the point in allowing for this kind of pluggable

logic

without

also covering more scenarios. Take example 2 in the KIP: if

we're

going

to

implement retries only on "important" topics when a topic

partition

isn't

found, why wouldn't we also want to be able to do this for

other

errors?

Again, taking authorization errors as an example, why wouldn't

we

want

to

be able to fail when we can't write to "important" topics

because

the

producer principal lacks sufficient ACLs, and drop the record

if

the

topic

isn't "important"? In a security-conscious environment with
runtime-dependent topic routing (which is a common feature of

many

source

connectors, such as the Debezium connectors), this seems

fairly

likely.


2) As far as changing the shape of the API goes, I like

Artem's

idea

of

splitting out the interface based on specific exceptions. This

may

be

little laborious to expand in the future, but if we really

want

to

limit the exceptions that we cover with the handler and move

slowly

and

cautiously, then IMO it'd be reasonable to reflect that in the

interface. I

also acknowledge that there's no way to completely prevent

people

from

shooting themselves in the foot by implementing the API

incorrectly,

but

think it's worth it to do what we can--including leveraging

the

Java

language's type system--to help them, so IMO there's value to

eliminating

the implicit behavior of failing when a policy returns RETRY

for a

non-retriable error. This can take a variety of shapes and I'm

not

going

to

insist on anything specific, but I do want to again raise my

concerns

with

the current proposal and request that we find something a

little

better.


3) Concerning the default implementation--actually, I meant

what I

wrote

:)

I don't want a "second" default, I want an implementation of

this

interface

to be used as the default if no others are specified. The

behavior

of

this

default implementation would be identical to existing behavior

(so

there

would be no backwards compatibility concerns like the ones

raised

by

Matthias), but it would be possible to configure this default

handler

class

to behave differently for a basic set of scenarios. This would

mirror

(pun

intended) the approach we've taken with Mirror Maker 2 and its
ReplicationPolicy interface [1]. There is a default

implementation

available [2] that recognizes a handful of basic configuration

properties

[3] for simple tweaks, but if users want, they can also

implement

their

own

replication policy for more fine-grained logic if those

properties

aren't

flexible enough.

More concretely, I'm imagining something like this for the

producer

exception handler:

- Default implementation class
of

org.apache.kafka.clients.producer.DefaultProducerExceptionHandler

- This class would recognize two properties:
- drop.invalid.large.records: Boolean property, defaults to

false.

If

"false", then causes the handler to return FAIL whenever
a RecordTooLargeException is encountered; if "true", then

causes

SWALLOW/SKIP/DROP to be returned instead
- unknown.topic.partition.retry.timeout.ms: Integer

property,

defaults

to
INT_MAX. Whenever an UnknownTopicOrPartitionException is

encountered,

causes the handler to return FAIL if that record has been

pending

for

more

than the retry timeout; otherwise, causes RETRY to be returned

I think this is worth addressing now instead of later because

it

forces

us

to evaluate the usefulness of this interface and it addresses

long-standing issue not just with Kafka Connect, but with the

Java

producer

in general. For reference, here are a few tickets I collected

after

briefly

skimming our Jira showing that this is a real pain point for

users:

https://issues.apache.org/jira/browse/KAFKA-10340,
https://issues.apache.org/jira/browse/KAFKA-12990,
https://issues.apache.org/jira/browse/KAFKA-13634. Although

this

is

frequently reported with Kafka Connect, it applies to anyone

who

configures

a producer to use a high retry timeout. I am aware of the

max.block.ms

property, but it's painful and IMO poor behavior to require

users

to

reduce

the value of this property just to handle the single scenario

when

trying

to write to topics that don't exist, since it would also limit

the

retry

timeout for other operations that are legitimately retriable.

Raising new points:

5) I don't see the interplay between this handler and existing
retry-related properties mentioned anywhere in the KIP. I'm

assuming

that

properties like "retries", "max.block.ms", and "

delivery.timeout.ms

would

take precedence over the handler and once they are exhausted,

the

record/batch will fail no matter what? If so, it's probably

worth

briefly

mentioning this (no more than a sentence or two) in the KIP,

and

if

not,

I'm curious what you have in mind.

6) I also wonder if the API provides enough information in its

current

form. Would it be possible to provide handlers with some way

of

tracking

how long a record has been pending for (i.e., how long it's

been

since

the

record was provided as an argument to Producer::send)? One way

to

do

this

could be to add a method like `onNewRecord(ProducerRecord)`

and

allow/require the handler to do its own bookkeeping, probably

with a

matching `onRecordSuccess(ProducerRecord)` method so that the

handler

doesn't eat up an ever-increasing amount of memory trying to

track

records.

An alternative could be to include information about the

initial

time

the

record was received by the producer and the number of retries

that

have

been performed for it as parameters in the handle method(s),

but

I'm

not

sure how easy this would be to implement and it might clutter

things

up a

bit too much.

7) A small request--can we add Closeable (or, if you prefer,

AutoCloseable)

as a superinterface for the handler interface?

[1] -

https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/ReplicationPolicy.html

[2] -

https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html

[3] -

https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html#SEPARATOR_CONFIG

https://kafka.apache.org/37/javadoc/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.html#INTERNAL_TOPIC_SEPARATOR_ENABLED_CONFIG


Cheers,

Chris

On Wed, May 8, 2024 at 12:37 AM Matthias J. Sax <

mj...@apache.org

wrote:

Very interesting discussion.

Seems a central point is the question about "how generic" we

approach

this, and some people think we need to be conservative and

others

think

we should try to be as generic as possible.

Personally, I think following a limited scope for this KIP

(by

explicitly saying we only cover RecordTooLarge and
UnknownTopicOrPartition) might be better. We have concrete

tickets

that

we address, while for other exception (like authorization) we

don't

know

if people want to handle it to begin with. Boiling the ocean

might

not

get us too far, and being somewhat pragmatic might help to

move

this

KIP

forward. -- I also agree with Justin and Artem, that we want

to

be

careful anyway to not allow users to break stuff too easily.

As the same time, I agree that we should setup this change

in a

forward

looking way, and thus having a single generic handler allows

us

to

later

extend the handler more easily. This should also simplify

follow

up

KIP

that might add new error cases (I actually mentioned one more

to

Alieh

already, but we both agreed that it might be best to exclude

it

from

the

KIP right now, to make the 3.8 deadline. Doing a follow up

KIP

is

not

the end of the world.)



@Chris:

(2) This sounds fair to me, but not sure how "bad" it

actually

would

be?

If the contract is clearly defined, it seems to be fine what

the

KIP

proposes, and given that such a handler is an expert API, and

we

can

provide "best practices" (cf my other comment below in

[AL1]),

being

little bit pragmatic sound fine to me.

Not sure if I understand Justin's argument on this question?


(3) About having a default impl or not. I am fine with adding

one,

even

if I am not sure why Connect could not just add its own one

and

plug

it

in (and we would add corresponding configs for Connect, but

not

for

the

producer itself)? For this case, we could also do this as a

follow

up

KIP, but happy to include it in this KIP to provide value to

Connect

right away (even if the value might not come right away if we

miss

the

3.8 deadline due to expanded KIP scope...) --  For KS, we

would

for

sure

plugin our own impl, and lock down the config such that users

cannot

set

their own handler on the internal producer to begin with.

Might

be

good

to elaborate why the producer should have a default? We might

actually

want to add this to the KIP right away?

The key for a default impl would be, to not change the

current

behavior,

and having no default seems to achieve this. For the two

cases

you

mentioned, it's unclear to me what default value on "upper

bound

on

retires" for UnkownTopicOrPartitionException we should set?

Seems

it

would need to be the same as `delivery.timeout.ms`? However,

if

users

have `delivery.timeout.ms` actually overwritten we would

need

to

set

this config somewhat "dynamic"? Is this feasible? If we

hard-code 2

minutes, it might not be backward compatible. I have the

impression

we

might introduce some undesired coupling? -- For the "record

too

large"

case, the config seems to be boolean and setting it to

`false`

by

default seems to provide backward compatibility.



@Artem:

[AL1] While I see the point, I would think having a different

callback

for every exception might not really be elegant? In the end,

the

handler

is an very advanced feature anyway, and if it's implemented

in

bad

way, well, it's a user error -- we cannot protect users from

everything.

To me, a handler like this, is to some extend "business

logic"

and

if a

user gets business logic wrong, it's hard to protect them. --

We

would

of course provide best practice guidance in the JaveDocs, and

explain

that a handler should have explicit `if` statements for stuff

it

want

to

handle, and only a single default which return FAIL.


[AL2] Yes, but for KS we would retry at the application

layer.

Ie,

the

TX is not completed, we clean up and setup out task from

scratch,

to

ensure the pending transaction is completed before we resume.

If

the

TX

was indeed aborted, we would retry from older offset and thus

just

hit

the same error again and the loop begins again.


[AL2 cont.] Similar to AL1, I see such a handler to some

extend

as

business logic. If a user puts a bad filter condition in

their

KS

app,

and drops messages, it nothing we can do about it, and this

handler

IMHO, has a similar purpose. This is also the line of

thinking

apply

to EOS, to address Justin's concern about "should we allow to

drop

for

EOS", and my answer is "yes", because it's more business

logic

than

actual error handling IMHO. And by default, we fail... So

users

opt-in

to add business logic to drop records. It's an application

level

decision how to write the code.


[AL3] Maybe I misunderstand what you are saying, but to me,

checking

the

size of the record upfront is exactly what the KIP proposes?

No?




@Justin:

I saw the sample
code -- is it just an if statement checking for the error

before

the

handler is invoked? That seems a bit fragile.


What do you mean by fragile? Not sure if I see your point.




-Matthias

On 5/7/24 5:33 PM, Artem Livshits wrote:

Hi Alieh,

Thanks for the KIP.  The motivation talks about very

specific

cases,

but

the interface is generic.

[AL1]
If the interface evolves in the future I think we could have

the

following

confusion:

1. A user implemented SWALLOW action for both

RecordTooLargeException

and

UnknownTopicOrPartitionException.  For simpicity they just

return

SWALLOW

from the function, because it elegantly handles all known

cases.

2. The interface has evolved to support a new exception.
3. The user has upgraded their Kafka client.

Now a new kind of error gets dropped on the floor without

user's

intention

and it would be super hard to detect and debug.

To avoid the confusion, I think we should use handlers for

specific

exceptions.  Then if a new exception is added it won't get

silently

swalled

because the user would need to add new functionality to

handle

it.


I also have some higher level comments:

[AL2]

it throws a TimeoutException, and the user can only blindly

retry,

which

may result in an infinite retry loop

If the TimeoutException happens during transactional

processing

(exactly

once is the desired sematnics), then the client should not

retry

when

it

gets TimeoutException because without knowing the reason for
TimeoutExceptions, the client cannot know whether the

message

got

actually

produced or not and retrying the message may result in

duplicatees.

The thrown TimeoutException "cuts" the connection to the

underlying

root

cause of missing metadata

Maybe we should fix the error handling and return the proper

underlying

message?  Then the application can properly handle the

message

based

on

preferences.

 From the product perspective, it's not clear how safe it is

to

blindly

ignore UnknownTopicOrPartitionException.  This could lead to

situations

when a simple typo could lead to massive data loss (part of

the

data

would

effectively be produced to a "black hole" and the user may

not

notice

it

for a while).

In which situations would you recommend the user to "black

hole"

messages

in case of misconfiguration?

[AL3]

If the custom handler decides on SWALLOW for

RecordTooLargeException,


Is it my understanding that this KIP proposes that

functionality

that

would

only be able to SWALLOW RecordTooLargeException that happen

because

the

producer cannot produce the record (if the broker rejects

the

batch,

the

error won't get to the handler, because we cannot know which

other

records

get ignored).  In this case, why not just check the locally

configured

max

record size upfront and not produce the recrord in the first

place?

Maybe

we can expose a validation function from the producer that

could

validate

the records locally, so we don't need to produce the record

in

order

to

know that it's invalid.

-Artem

On Tue, May 7, 2024 at 2:07 PM Justine Olshan

<jols...@confluent.io.invalid>

wrote:

Alieh and Chris,

Thanks for clarifying 1) but I saw the motivation. I guess

just

didn't

understand how that would be ensured on the producer side.

saw

the

sample

code -- is it just an if statement checking for the error

before

the

handler is invoked? That seems a bit fragile.

Can you clarify what you mean by `since the code does not

reach

the

KS

interface and breaks somewhere in producer.` If we surfaced

this

error

to

the application in a better way would that also be a

solution

to

the

issue?


Justine

On Tue, May 7, 2024 at 1:55 PM Alieh Saeedi

<asae...@confluent.io.invalid>

wrote:

Hi,


Thank you, Chris and Justine, for the feedback.


@Chris

1) Flexibility: it has two meanings. The first meaning is

the

one

you

mentioned. We are going to cover more exceptions in the

future,

but

as

Justine mentioned, we must be very conservative about

adding

more

exceptions. Additionally, flexibility mainly means that

the

user

is

able

to

develop their own code. As mentioned in the motivation

section

and

the

examples, sometimes the user decides on dropping a record

based

on

the

topic, for example.


2) Defining two separate methods for retriable and

non-retriable

exceptions: although the idea is brilliant, the user may

still

make a

mistake by implementing the wrong method and see a

non-expecting

behaviour.

For example, he may implement handleRetriable() for

RecordTooLargeException

and define SWALLOW for the exception, but in practice, he

sees

no

change

in

default behaviour since he implemented the wrong method. I

think

we

can

never reduce the user’s mistakes to 0.



3) Default implementation for Handler: the default

behaviour

is

already

preserved with NO need of implementing any handler or

setting

the

corresponding config parameter `custom.exception.handler`.

What

you

mean

is

actually having a second default, which requires having

both

interface

and

config parameters. About UnknownTopicOrPartitionException:

the

producer

already offers the config parameter `max.block.ms` which

determines

the

duration of retrying. The main purpose of the user who

needs

the

handler

is

to get the root cause of TimeoutException and handle it in

the

way

he

intends. The KIP explains the necessity of it for KS

users.



4) Naming issue: By SWALLOW, we meant actually swallow the

error,

while

SKIP means skip the record, I think. If it makes sense for

more

ppl,

can

change it to SKIP


@Justine

1) was addressed by Chris.

2 and 3) The problem is exactly what you mentioned.

Currently,

there

is

no

way to handle these issues application-side. Even KS users

who

implement

KS

ProductionExceptionHandler are not able to handle the

exceptions

as

they

intend since the code does not reach the KS interface and

breaks

somewhere

in producer.

Cheers,
Alieh

On Tue, May 7, 2024 at 8:43 PM Chris Egerton <

fearthecel...@gmail.com>

wrote:

Hi Justine,

The method signatures for the interface are indeed

open-ended,

but

the

KIP

states that its uses will be limited. See the motivation

section:

We believe that the user should be able to develop

custom

exception

handlers for managing producer exceptions. On the other

hand,

this

will

be

an expert-level API, and using that may result in strange

behaviour

in

the

system, making it hard to find the root cause. Therefore,

the

custom

handler is currently limited to handling

RecordTooLargeException

and

UnknownTopicOrPartitionException.

Cheers,

Chris


On Tue, May 7, 2024, 14:37 Justine Olshan

<jols...@confluent.io.invalid

wrote:

Hi Alieh,

I was out for KSB and then was also sick. :(

To your point 1) Chris, I don't think it is limited to

two

specific

scenarios, since the interface accepts a generic

Exception e

and

can

be

implemented to check if that e is an instanceof any

exception.

didn't

see

anywhere that specific errors are enforced. I'm a bit

concerned

about

this

actually. I'm concerned about the opened-endedness and

the

contract

we

have

with transactions. We are allowing the client to make

decisions

that

are

somewhat invisible to the server. As an aside, can we

build

in

log

messages

when the handler decides to skip etc a message. I'm

really

concerned

about

messages being silently dropped.

I do think Chris's point 2) about retriable vs non

retriable

errors

is

fair. I'm a bit concerned about skipping a unknown topic

or

partition

exception too early, as there are cases where it can be

transient.


I'm still a little bit wary of allowing dropping records

as

part

of

EOS

generally as in many cases, these errors signify an

issue

with

the

original

data. I understand that streams and connect/mirror maker

may

have

reasons

they want to progress past these messages, but wondering

if

there

is

way

that can be done application-side. I'm willing to accept

this

sort

of

proposal if we can make it clear that this sort of thing

is

happening

and

we limit the blast radius for what we can do.

Justine

On Tue, May 7, 2024 at 9:55 AM Chris Egerton

<chr...@aiven.io.invalid

wrote:

Hi Alieh,

Sorry for the delay, I've been out sick. I still have

some

thoughts

that

I'd like to see addressed before voting.

1) If flexibility is the motivation for a pluggable

interface,

why

are

we

only limiting the uses for this interface to two very

specific

scenarios?

Why not also allow, e.g., authorization errors to be

handled

as

well

(allowing users to drop records destined for some

off-limits

topics,

or

retry for a limited duration in case there's a delay in

the

propagation

of

ACL updates)? It'd be nice to see some analysis of

other

errors

that

could

be handled with this new API, both to avoid the

follow-up

work

of

another

KIP to address them in the future, and to make sure

that

we're

not

painting

ourselves into a corner with the current API in a way

that

would

make

future modifications difficult.

2) Something feels a bit off with how retriable vs.

non-retriable

errors

are handled with the interface. Why not introduce two

separate

methods

to

handle each case separately? That way there's no

ambiguity

or

implicit

behavior when, e.g., attempting to retry on a

RecordTooLargeException.

This

could be something like `NonRetriableResponse

handle(ProducerRecord,

Exception)` and `RetriableResponse

handleRetriable(ProducerRecord,

Exception)`, though the exact names and shape can

obviously

be

toyed

with a

bit.

3) Although the flexibility of a pluggable interface

may

benefit

some

users' custom producer applications and Kafka Streams

applications,

it

comes at significant deployment cost for other

low-/no-code

environments,

including but not limited to Kafka Connect and

MirrorMaker

2.

Can

we

add

default implementation of the exception handler that

allows

for

some

simple

behavior to be tweaked via configuration property? Two

things

that

would

be

nice to have would be A) an upper bound on the retry

time

for

unknown-topic-partition exceptions and B) an option to

drop

records

that

are large enough to trigger a record-too-large

exception.


4) I'd still prefer to see "SKIP" or "DROP" instead of

the

proposed

"SWALLOW" option, which IMO is opaque and non-obvious,

especially

when

trying to guess the behavior for retriable errors.

Cheers,

Chris

On Fri, May 3, 2024 at 11:23 AM Alieh Saeedi

<asae...@confluent.io.invalid

wrote:

Hi all,


A summary of the KIP and the discussions:


The KIP introduces a handler interface for Producer in

order

to

handle

two

exceptions: RecordTooLargeException and

UnknownTopicOrPartitionException.

The handler handles the exceptions per-record.


- Do we need this handler?  [Motivation and Examples

sections]



RecordTooLargeException: 1) In transactions, the

producer

collects

multiple

records in batches. Then a RecordTooLargeException

related

to a

single

record leads to failing the entire batch. A custom

exception

handler

in

this case may decide on dropping the record and

continuing

the

processing.

See Example 1, please. 2) More over, in Kafka

Streams, a

record

that

is

too

large is a poison pill record, and there is no way to

skip

over

it. A

handler would allow us to react to this error inside

the

producer,

i.e.,

local to where the error happens, and thus simplify

the

overall

code

significantly. Please read the Motivation section for

more

explanation.



UnknownTopicOrPartitionException: For this case, the

producer

handles

this

exception internally and only issues a WARN log about

missing

metadata

and

retries internally. Later, when the producer hits "

deliver.timeout.ms"

it

throws a TimeoutException, and the user can only

blindly

retry,

which

may

result in an infinite retry loop. The thrown

TimeoutException

"cuts"

the

connection to the underlying root cause of missing

metadata

(which

could

indeed be a transient error but is persistent for a

non-existing

topic).

Thus, there is no programmatic way to break the

infinite

retry

loop.

Kafka

Streams also blindly retries for this case, and the

application

gets

stuck.




- Having interface vs configuration option:

[Motivation,

Examples,

and

Rejected Alternatives sections]

Our solution is introducing an interface due to the

full

flexibility

that

it offers. Sometimes users, especially Kafka Streams

ones,

determine

the

handler's behaviour based on the situation. For

example, f

acing UnknownTopicOrPartitionException*, *the user may

want

to

raise

an

error for some topics but retry it for other topics.

Having

configuration

option with a fixed set of possibilities does not

serve

the

user's

needs. See Example 2, please.


- Note on RecordTooLargeException: [Public Interfaces

section]


If the custom handler decides on SWALLOW for

RecordTooLargeException,

then

this record will not be a part of the batch of

transactions

and

will

also

not be sent to the broker in non-transactional mode.

So

no

worries

about

getting a RecordTooLargeException from the broker in

this

case,

as

the

record will never ever be sent to the broker. SWALLOW

means

drop

the

record

and continue/swallow the error.


- What if the handle() method implements RETRY for

RecordTooLargeException?

[Proposed Changes section]

We have to limit the user to only have FAIL or SWALLOW

for

RecordTooLargeException. Actually, RETRY must be equal

to

FAIL.

This

is

well documented/informed in javadoc.



- What if the handle() method of the handler throws an

exception?


The handler is expected to have correct code. If it

throws

an

exception,

everything fails.



This is a PoC PR <

https://github.com/apache/kafka/pull/15846

ONLY

for

RecordTooLargeException. The code changes related to
UnknownTopicOrPartitionException will be added to this

PR

LATER.



Looking forward to your feedback again.


Cheers,

Alieh

On Thu, Apr 25, 2024 at 11:46 PM Kirk True <

k...@kirktrue.pro>

wrote:

Hi Alieh,

Thanks for the updates!

Comments inline...

On Apr 25, 2024, at 1:10 PM, Alieh Saeedi

<asae...@confluent.io.INVALID


wrote:


Hi all,

Thanks a lot for the constructive feedbacks!



Addressing some of the main concerns:


- The `RecordTooLargeException` can be thrown by

broker,

producer

and

consumer. Of course, the `ProducerExceptionHandler`

interface

is

introduced

to affect only the exceptions thrown from the

producer.

This

KIP

very

specifically means to provide a possibility to

manage

the

`RecordTooLargeException` thrown from the

Producer.send()

method.

Please

see “Proposed Changes” section for more clarity. I

investigated

the

issue

there thoroughly. I hope it can explain the concern

about

how

we

handle

the

errors as well.



- The problem with Callback: Methods of Callback are

called

when

the

record

sent to the server is acknowledged, while this is

not

the

desired

time

for

all exceptions. We intend to handle exceptions

beforehand.


I guess it makes sense to keep the expectation for

when

Callback

is

invoked as-is vs. shoehorning more into it.

- What if the custom handler returns RETRY for

`RecordTooLargeException`? I

assume changing the producer configuration at

runtime

is

possible.

If

so,

RETRY for a too large record is valid because maybe

in

the

next

try,

the

too large record is not poisoning any more. I am not

100%

sure

about

the

technical details, though. Otherwise, we can

consider

the

RETRY

as

FAIL

for

this exception. Another solution would be to

consider

constant

number

of

times for RETRY which can be useful for other

exceptions

as

well.


It’s not presently possible to change the

configuration

of

an

existing

Producer at runtime. So if a record hits a

RecordTooLargeException

once,

no

amount of retrying (with the current Producer) will

change

that

fact.

So

I’m still a little stuck on how to handle a response

of

RETRY

for

an

“oversized” record.

- What if the handle() method itself throws an

exception?

think

rationally and pragmatically, the behaviour must be

exactly

like

when

no

custom handler is defined since the user actually

did

not

have

working

handler.


I’m not convinced that ignoring an errant handler is

the

right

choice.

It

then becomes a silent failure that might have

repercussions,

depending

on

the business logic. A user would have to proactively

trawls

through

the

logs for WARN/ERROR messages to catch it.

Throwing a hard error is pretty draconian, though…

- Why not use config parameters instead of an

interface?

As

explained

in

the “Rejected Alternatives” section, we assume that

the

handler

will

be

used for a greater number of exceptions in the

future.

Defining a

configuration parameter for each exception may make

the

configuration a

bit

messy. Moreover, the handler offers more

flexibility.


Agreed that the logic-via-configuration approach is

weird

and

limiting.

Forget I ever suggested it ;)

I’d think additional background in the Motivation

section

would

help

me

understand how users might use this feature beyond a)

skipping

“oversized”

records, and b) not retrying missing topics.

Small change:

-ProductionExceptionHandlerResponse -> Response for

brevity

and

simplicity.

Could’ve been HandlerResponse too I think!


The name change sounds good to me.

Thanks Alieh!



I thank you all again for your useful

questions/suggestions.


I would be happy to hear more of your concerns, as

stated

in

some

feedback.


Cheers,
Alieh

On Wed, Apr 24, 2024 at 12:31 AM Justine Olshan
<jols...@confluent.io.invalid> wrote:

Thanks Alieh for the updates.

I'm a little concerned about the design pattern

here.

It

seems

like

we

want

specific usages, but we are packaging it as a

generic

handler.

I think we tried to narrow down on the specific

errors

we

want

to

handle,

but it feels a little clunky as we have a generic

thing

for

two

specific

errors.

I'm wondering if we are using the right patterns to

solve

these

problems. I

agree though that we will need something more than

the

error

classes

I'm

proposing if we want to have different handling be

configurable.

My concern is that the open-endedness of a handler

means

that

we

are

creating more problems than we are solving. It is

still

unclear

to

me

how

we expect to handle the errors. Perhaps we could

include

an

example?

It

seems like there is a specific use case in mind and

maybe

we

can

Re: [DISCUSS] KIP-1038: Add Custom Error Handler to Producer

Reply via email to