Re: Retrying certain errors seen during failover

Justin Bertram Fri, 19 Jan 2024 10:35:31 -0800

> Is it expected that we’d hit this error during failover...

It's not necessarily surprising that you'd hit that error, although
generally when a broker fails the connection is terminated completely
rather than the current operation timing out. The timeout is ultimately
ambiguous. You can't reliably conclude that the broker has failed due to a
timeout like this. It could be the result of a network issue or a broker
slow-down for some reason (e.g. long GC pause). The broker may have
received what you sent but simply failed to send a response back within the
timeout or it may not have received anything. You can retry the operation,
but if you're sending a message that may result in a duplicate, although
that's why we have duplicate detection [1].


> ...and that it takes this long to manifest?

The default callTimeout is 30,000 milliseconds so your observation fits.

> Is there a way to reduce the timeout value, and is that recommended?

You can pass "callTimeout=X" on your connection URL, e.g.:
tcp://host:61616?callTimeout=10000.

The general recommendation is to use the default (for simplicity's sake)
and adjust as necessary for your use-case. Lowering the timeout means you
will detect such issues sooner, but also that you will be more susceptible
to timeouts in the event of a broker slow-down. It's a balancing act.

> Are there yet more codes we should retry?

I can't think of any additional codes.


Justin

[1]
https://activemq.apache.org/components/artemis/documentation/latest/duplicate-detection.html#duplicate-message-detection

On Fri, Jan 19, 2024 at 11:37 AM John Lilley
<john.lil...@redpointglobal.com.invalid> wrote:

> Greetings,
>
>
>
> Lino already posted this, but I think it got buried in the larger
> discussion of HA configuration.
>
>
>
> When a failover happens, we occasionally hit errors like
>
>
>
> 2024-01-18T22:46:13.436 [http-nio-9910-exec-7]
> RpcExceptionMapper.toResponse:79 [] INFO - Error in RPC response
> RpcException: httpCode=500, errorMessage=error sending message: AMQ219014:
> Timed out after waiting 30000 ms for response when sending packet 71
>
> The exception is thrown from
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send().  See
> Lino’s previous post for the entire stack trace.
>
>
>
> I’ve seen this kind of thing before and added logic to retry the send()
> call for “AMQ219016 Connection failure detected. Unblocking a blocking
> call that will never get a response”.
>
>
>
> I don’t have a problem retrying this code as well (in fact I’ve added the
> range AMQ219011 - AMQ219016 to the retry logic), but the 30 second delay is
> quite long, and it starts to trigger our own RPC timeouts by the time all
> of the reconnect is performed.
>
>
>
> So my questions are:
>
>    - Is it expected that we’d hit this error during failover and that it
>    takes this long to manifest?
>    - Is there a way to reduce the timeout value, and is that recommended?
>    - Are there yet more codes we should retry?
>
> Thanks
>
> john
>
>
>
>
>
> [image: rg] <https://www.redpointglobal.com/>
>
> John Lilley
>
> Data Management Chief Architect, Redpoint Global Inc.
>
> 34 Washington Street, Suite 205 Wellesley Hills, MA 02481
>
> *M: *+1 7209385761 <+1%207209385761> | john.lil...@redpointglobal.com
>
> PLEASE NOTE: This e-mail from Redpoint Global Inc. (“Redpoint”) is
> confidential and is intended solely for the use of the individual(s) to
> whom it is addressed. If you believe you received this e-mail in error,
> please notify the sender immediately, delete the e-mail from your computer
> and do not copy, print or disclose it to anyone else. If you properly
> received this e-mail as a customer, partner or vendor of Redpoint, you
> should maintain its contents in confidence subject to the terms and
> conditions of your agreement(s) with Redpoint.
>

Re: Retrying certain errors seen during failover

Reply via email to