When i had a similar issue it turned out that the way the task(s) were written, they'd RAPIDLY open a large number of new RDS connections.
AWS RDS - particularly if you're using the cluster endpoint, is performing a 'dns' lookup (4 hops if i recall correctly) before your connection request actually resolves to a real host. This lookup is throttled, and after a certain number of hits in a short time, it will return the error above (which is annoying, as it makes it look like the DB just 'vanishes' from time time). Brian On Sat, Aug 15, 2020 at 7:04 PM Ricky Shi <xiao.x....@gmail.com> wrote: > Hi Everyone, > > we encountered a very strange issue with airflow using AWS RDS as backend. > We found that when the number of tasks is big enough (>60), airflow will > fail with the error message (MySQL RDS backend) > > sqlalchemy.exc.OperationalError: > (MySQLdb._exceptions.OperationalError) (2005, "Unknown MySQL server > host ... $AWS RDS address) > > or (Postgres RDS backend): > > psycopg2.OperationalError: could not translate host name $AWS RDS address > > > When we restart airflow, it becomes fine; and the job scheduler & website > are both running fine. However, it will fail again after a couple of days > of smooth running, with the same error message. > > We found that on stack overflow, there are other ppl experiencing the same > issue but no solution found. Anyone knows how to resolve the issue? > > Thanks, > > -- > Ricky Shi >