Hi David, Thanks for bringing this up. I have several thoughts on this and would love to discuss them in more detail.
At this time, there is a fair bit of cognitive load with the 3.2.0 release. But, will happily discuss right after that. Best regards, Vikram On Thu, Apr 2, 2026 at 6:50 AM Blain David <[email protected]> wrote: > Hi everyone, > > I would like to start a discussion if there would be an interest in adding > a SAS provider to Airflow, based on some challenges and solutions we’ve > encountered in our environment. So let me explain. > > Until recently, we relied on the JdbcHook combined with GenericTransfer to > extract data from SAS to our Postgres DWH. However, this approach > introduces significant overhead due to the JVM dependency (via jaydebeapi). > In practice, this leads to duplicated memory usage and instability during > larger transfers (as our workers on K8S pods are limited in memory). > Lately, we’ve experienced multiple failures due to new imports of very > large tables, including JVM crashes caused by OOM issues as well as > intermittent backend errors from SAS during long runs (multiple hours, > especially for the initial runs). > > We also evaluated saspy as an alternative, but since it also relies on the > same underlying JDBC/JVM mechanism (e.g. jaydebeapi), it does not resolve > these issues. > > To address this, we implemented a native SASHook that uses SSH underneath > to connect to the SAS environment. This allows us to execute SQL queries > directly on the SAS server and retrieve results without involving a JVM. On > top of that, we've built a DB-API 2.0–compatible cursor implementation that > handles the interaction over SSH and integrates with pandas (leveraging > read_sas for returning results as DataFrames), thus it integrates > transparently with the DBApiHook. > > This approach has proven to be much more stable in our DAGs. Since > adopting it, we have not encountered failures anymore. While performance is > not on par with native ODBC or database drivers, it is still an improvement > over the JDBC bridge. For reference, we can fetch around 10k records in ~2 > seconds. The initial query is slightly slower due to SSH handshake and > session setup, but overall the trade-off is acceptable given the stability > gains. Probably performance can still be improved, and we will further > check into it but at least we resolved the instability issue and our > pipelines are reliable again. > > I would like to get feedback on how to contribute this, if any interest of > course. > > Should I split this into two parts? A standalone open-source project > providing a generic SAS DB-API 2.0 implementation and an Airflow provider > that depends on this library? > Or would it be preferable to include the full implementation directly > within the SAS Airflow provider (e.g., as a hook and cursor)? > > I’m happy to contribute the implementation and help maintain it, but I’d > like to align with the community on the preferred approach before moving > forward, if there would be any interest of course. > > The advantage of this solution is that it doesn't need of any new > dependencies, it purely relies on dependencies (paramiko an pandas) already > present in the Airflow ecosystem. > > Any thoughts or suggestions would be greatly appreciated. > > Best regards, > David >
