cansjt opened a new issue #21087:
URL: https://github.com/apache/airflow/issues/21087


   ### Apache Airflow version
   
   2.2.3 (latest released)
   
   ### What happened
   
   After upgrading Airflow to 2.2.3 (from 2.2.2) and cncf.kubernetes provider 
to 3.0.1 (from 2.0.3) we started to see these errors in the logs:
   ```
   {"asctime": "2022-01-25 08:19:39", "levelname": "ERROR", "process": 565811, 
"name": "airflow.executors.kubernetes_executor.KubernetesJobWatcher", 
"funcName": "run", "lineno": 111, "message": "Unknown error in 
KubernetesJobWatcher. Failing", "exc_info": "Traceback (most recent call 
last):\n  File 
\"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\",
 line 102, in run\n    self.resource_version = self._run(\n  File 
\"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\",
 line 145, in _run\n    for event in list_worker_pods():\n  File 
\"/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py\", line 182, 
in stream\n    raise 
client.rest.ApiException(\nkubernetes.client.exceptions.ApiException: 
(410)\nReason: Expired: too old resource version: 655595751 (655818065)\n"}
   Process KubernetesJobWatcher-6571:
   Traceback (most recent call last):
     File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in 
_bootstrap
       self.run()
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 102, in run
       self.resource_version = self._run(
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 145, in _run
       for event in list_worker_pods():
     File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", 
line 182, in stream
       raise client.rest.ApiException(
   kubernetes.client.exceptions.ApiException: (410)
   Reason: Expired: too old resource version: 655595751 (655818065)
   ``` 
   Pods are created and run to completion, but it seems the 
KubernetesJobWatcher is incapable of seeing that they completed. From there 
Airflow goes to a complete halt. 
   
   ### What you expected to happen
   
   No errors in the logs and the job watcher does it's job of collecting 
completed jobs.
   
   ### How to reproduce
   
   I wish I knew. Trying to downgrade the cncf.kubernetes provider to previous 
versions to see if it helps.
   
   ### Operating System
   
   k8s (Airflow images are Debian based)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon 2.6.0
   apache-airflow-providers-cncf-kubernetes 3.0.1
   apache-airflow-providers-ftp 2.0.1
   apache-airflow-providers-http 2.0.2
   apache-airflow-providers-imap 2.1.0
   apache-airflow-providers-postgres 2.4.0
   apache-airflow-providers-sqlite 2.0.1
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   The deployment is on k8s v1.19.16, made with helm3.
   
   ### Anything else
   
   This, in the symptoms, look a lot like #17629 but happens in a different 
place.
   Redeploying as suggested in that issues seemed to help, but most jobs that 
were supposed to run last night got stuck again. All jobs use the same pod 
template, without any customization.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to