A shot in the dark but could it be this:
https://mux.com/blog/5-years-of-flink-at-mux/ ?
> The JVM will cache DNS entries forever by default. This is
undesirable in Kubernetes deployments where there’s an expectation that
DNS entries can and do change frequently as pod deployments move between
nodes. We’ve seen Flink applications suddenly unable to talk to other
services in the cluster after pods are upgraded.
Best regard,
Dario
On 16.12.21 19:09, Julian Cardarelli wrote:
So it connects to http rest based micro services and they are outside
a Kubernetes HA setup for flink. All of a sudden and it’s arbitrary
not consistent it could be 10 days it could be 28 days, the calls stop
going out on this one job but not others.
Recycling it brings it back. But the job and state all appear intact
at the time of the cessation with the job in running state and no
discernible exceptions.
I suppose it could be something in the network layer but because other
jobs aren’t impacted I feel something else must be going on.
But the code throws nothing during this time period.
Is there any instrumentation we should be enabling to find out more
detail? It’s a bit troublesome to reproduce so want to load all that
in for next time it happens
Get Outlook for iOS <https://aka.ms/o0ukef>
___
Julian Cardarelli
CEO
T
*(800) 961-1549* <tel:(800)%20961-1549>
E
*jul...@thentia.com* <mailto:jul...@thentia.com>
*LinkedIn* <https://www.linkedin.com/in/julian-cardarelli/>
Thentia Website
<https://www.thentia.com/?utm_source=signature&utm_medium=banner&utm_campaign=evergreen>
DISCLAIMER
Neither Thentia Corporation, nor its directors, officers,
shareholders, representatives, employees, non-arms length companies,
subsidiaries, parent, affiliated brands and/or agencies are licensed
to provide legal advice. This e-mail may contain among other things
legal information. We disclaim any and all responsibility for the
content of this e-mail. YOU MUST NOT rely on any of our communications
as legal advice. Only a licensed legal professional may give you
advice. Our communications are never provided as legal advice, because
we are not licensed to provide legal advice nor do we possess the
knowledge, skills or capacity to provide legal advice. We disclaim any
and all responsibility related to any action you might take based upon
our communications and emphasize the need for you to never rely on our
communications as the basis of any claim or proceeding.
CONFIDENTIALITY
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager. This message contains confidential information and
is intended only for the individual(s) named. If you are not the named
addressee(s) you should not disseminate, distribute or copy this
e-mail. Please notify the sender immediately by e-mail if you have
received this e-mail by mistake and delete this e-mail from your
system. If you are not the intended recipient you are notified that
disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
------------------------------------------------------------------------
*From:* Chesnay Schepler <ches...@apache.org>
*Sent:* Wednesday, December 15, 2021 10:09:32 AM
*To:* Julian Cardarelli <jul...@thentia.com>; user@flink.apache.org
<user@flink.apache.org>
*Subject:* [EXTERNAL] Re: Periodic Job Failure
How are you deploying the job and the external services? Is the period
in which this happens usually the same?
Is it just a connection issue with external services, or are there
other errors as well?
On 15/12/2021 15:47, Julian Cardarelli wrote:
Hello –
We have a job that seems to stop working after some period of time –
perhaps 10-12 days. The job itself appears in the running state, but
for some reason it just stops communicating to external services.
I know this e-mail will be like “we don’t know what’s wrong with your
code.” I get that part, but if we cancel the job and resubmit,
everything flows again.
There doesn’t seem to be a clear answer on this and there is nothing
in the stack trace.
So, my question is what’s the best practice for troubleshooting
unexplained job malfunction over a prolonged period of time?
Thanks!
-jc
___
Julian Cardarelli
CEO
T
*(800) 961-1549* <tel:(800)%20961-1549>
E
*jul...@thentia.com* <mailto:jul...@thentia.com>
*LinkedIn*
<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fjulian-cardarelli%2F&data=04%7C01%7Cjulian%40thentia.com%7C9b7db435805642d4e9f108d9bfddc85f%7Caaed208b28414c339a4df5008ba71d0d%7C0%7C0%7C637751781532830685%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uQs3GLA1htxOIijVYLuMB1gPkbBBddsJaa8Orx%2FyCes%3D&reserved=0>
Thentia Website
<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.thentia.com%2F%3Futm_source%3Dsignature%26utm_medium%3Dbanner%26utm_campaign%3Devergreen&data=04%7C01%7Cjulian%40thentia.com%7C9b7db435805642d4e9f108d9bfddc85f%7Caaed208b28414c339a4df5008ba71d0d%7C0%7C0%7C637751781532830685%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=heEoKNRNqd3r%2FWO%2BB0P8aRKloLqZxeE3fTm3LYas3r0%3D&reserved=0>
DISCLAIMER
Neither Thentia Corporation, nor its directors, officers,
shareholders, representatives, employees, non-arms length companies,
subsidiaries, parent, affiliated brands and/or agencies are licensed
to provide legal advice. This e-mail may contain among other things
legal information. We disclaim any and all responsibility for the
content of this e-mail. YOU MUST NOT rely on any of our
communications as legal advice. Only a licensed legal professional
may give you advice. Our communications are never provided as legal
advice, because we are not licensed to provide legal advice nor do we
possess the knowledge, skills or capacity to provide legal advice. We
disclaim any and all responsibility related to any action you might
take based upon our communications and emphasize the need for you to
never rely on our communications as the basis of any claim
or proceeding.
CONFIDENTIALITY
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager. This message contains confidential information
and is intended only for the individual(s) named. If you are not the
named addressee(s) you should not disseminate, distribute or copy
this e-mail. Please notify the sender immediately by e-mail if you
have received this e-mail by mistake and delete this e-mail from your
system. If you are not the intended recipient you are notified that
disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
*Disclaimer*
The information contained in this communication from the sender is
confidential. It is intended solely for use by the recipient and
others authorized to receive it. If you are not the recipient, you
are hereby notified that any disclosure, copying, distribution or
taking action in relation of the contents of this information is
strictly prohibited and may be unlawful.
This email has been scanned for viruses and malware, and may have
been automatically archived by Mimecast, a leader in email security
and cyber resilience. Mimecast integrates email defenses with brand
protection, security awareness training, web security, compliance and
other essential capabilities. Mimecast helps protect large and small
organizations from malicious activity, human error and technology
failure; and to lead the movement toward building a more resilient
world. To find out more, visit our website.
*Disclaimer*
The information contained in this communication from the sender is
confidential. It is intended solely for use by the recipient and
others authorized to receive it. If you are not the recipient, you are
hereby notified that any disclosure, copying, distribution or taking
action in relation of the contents of this information is strictly
prohibited and may be unlawful.
This email has been scanned for viruses and malware, and may have been
automatically archived by Mimecast, a leader in email security and
cyber resilience. Mimecast integrates email defenses with brand
protection, security awareness training, web security, compliance and
other essential capabilities. Mimecast helps protect large and small
organizations from malicious activity, human error and technology
failure; and to lead the movement toward building a more resilient
world. To find out more, visit our website.