With new year's holiday, I didn't saw the discussion split. Sorry for the late catchup !
As maintainer of Airflow instances, it's great that I can bump any provider's version to get a bugfix quickly, without having to upgrade all the instances. But indeed, now, only ODBC provider can handle PyODBC.Row serialization. > I think the *only* possible way to make ODBC Row handled by serde is for > serde to discover serializers from installed providers and use them from > there. Sound like a nice solution ! But there, the ODBC hook would return a PyODBC.Row object, right ? Thus, defeating current initiative to have common tuples everywhere. I'm going to expand a bit the scope, and do a bit of fiction: IMO it would be great to have a real usable "DbApiHook": Not an abstract class, like now. But a Hook similar to the ODBC one, which allows to install any DB-API2.0 python libraries and setup a Connection with them - without subclassing anything. It could live outside of the common.sql provider, in a dedicated one. With this Hook, it will be possible to use the PyODBC python package directly. Thus, two providers will need pyodbc.Row serialization. Putting the serializer somewhere in a provider and discoverable for serde will avoid coupling with Airflow, while letting both Providers using it. Great ! Further fiction: Maybe a mechanism to discover those DB-API2.0 compliant packages ? Wrapping them into a thin provider. E.g.: We could have a 'PyODBC-provider' with just a serializer for serde, and PyODBC as dependency ? Further further fiction: Would that open the door to dropping the ODBC Provider ? Operators needs to die after all😊 In this situation, do we care if we return a tuple, a Databricks.Row, or a PyODBC.Row ? Those objects are compliant with python DbAPI2.0 after all. Today, the user that add the 'ODBC provider' expect standard types, because he setup the 'Airflow standardized ODBC provider'. In the fiction world, the user that setup a 'PyODBC-provider' should expect to have PyODBC.Row object. Again, sorry for catching up late. I'll open a PR with a proposal for pyodbc.row serializer in the odbc provider. The earlier the better if that might become a dependency to other providers in the future. Joffrey Bienvenu -----Original Message----- From: Jarek Potiuk <ja...@potiuk.com> Sent: Monday, 11 December 2023 20:22 To: dev@airflow.apache.org Subject: Re: [DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow] EXTERNAL MAIL: Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes si vous ne connaissez pas l'expéditeur et ne lui faites pas confiance. En cas de doute, envoyez cet email en pièce jointe à ab...@infrabel.be<mailto:ab...@infrabel.be>. Tried to get others some opportunity to comment, but I see it's mostly me <> Bolke. > > > Yep. I rectified the implicitness by storing this as an explicit > > decision record in the ADR proposal. I think it's been - similarly > > as lack of > > > <snip> > Happy to help here - since there is a good "vibe" on the ADR kind of approach, I am happy to "lead by example" and propose a PR with ADRs for serialization (learning from the whole discussions and that's a good opportunity for me to learn some of the nitty-gritty details ). Would that work for you and can I count on some reviews ? > It was added to the DAG serializer at the time, I mentioned that that > looked weird given the fact that it did not seem connected to DAG > serialization (to me) and serde should be considered. So no, I wasn't > insisting on making it part of serialization imho, it defacto was > already. > Let's straighten it up after writing ADRs and rounding up the serialization case. We are aiming to enable AIP-44 and Pydantic in 2.9 so that's a good time to see and make some measurements there. > > > > > Going through either of the serializations of ours here is not > > needed overhead at all - we can just purely rely on Rust and get ORM > > on one side and Pydantic object on the other without touching any of > > our > serialization > > code. > > > > That's fine to me. > Yeah. We'll sort it out. Luckily it's a very localized place and we can simply open draft PR showing how it will look like and see if it looks simpler/whether we lose any of the properties that > > > I don't think you want to ;-). > > Happy also to take a look at helping to straighten up DAG serialization possibly once I get a bit more familiar with the context. > We could. It's probably just a three pass serializer then. > Interesting :). > > > Exactly not. By adding it to Airflow 2.8.0 it would have made every > previous version of this provider suddenly work without any > adjustments against Airflow 2.8.0. Remember that the original bug > report said "cannot serialize value of type PyODBC.Row". > I was more talking about compatibility of ODBC provider. If we want to directly return ODBC.Row, it will only be serializable in Airflow 2.8.0 (currently ODBC provider has >= Airflow 2.6.0. What I want to achieve is to implement a fix where you can upgrade just provider and get serializable output even if the user decides to stay with Airflow 2.8.0. This w > You can argue both ways here architecturally. > > 1) Airflow 2.8 with a new feature that allows PyODBC.Row to be serialized, > affecting past and future providers at large and (TaskFlow) returned > variables. > This requires users to upgrade to Airflow 2.8.0. Many of our users delay such upgrades for months or years. So we are talking about making it possible for them to upgrade "easily" where selective upgrades to providers is the easy way. This also works better for some Airflow-as-a-service where 2.8 might only be available months from now. Migration to a new Airflow version is way more involved and uptick of new versions (especially .0) is slow. We are basically talking about having something available now - for Airflow 2.6, 2.7, (and 2.8) now vs. many users having it months or years from now. And releasing it now fixes actual issue people have and raised us. And implementing it in provider has an added benefit that IF we also implement a generic serializer as part of ODBC and have a way for serde to discover serializers by providers, then we could implement it in the provider in a way that either uses internal "serialization" or allows serde to discover it. I see really no benefit of having Airflow know anything about ODBC. Are we going to implement all 80 provider's specific code and put it to airlfow and all dependencies (even optional) as airflow dependencies, now that we made enormous effort in separating these? Imagine ODBC code depends on odbc_library. What happens if Airflow has optional dependency on odbc_library <3 and the provider has dependency on odbc_library >=3 ? Imagine that the provider returns ODBCRow from version 3 that serializer won't be able to handle. How do you install the two together? How do you install these two together? How do you make them work together? It really makes very little sense to have airflow depend on the library and code that provider already depends on, because this will create a whole new class of potential conflicts. I think the *only* possible way to make ODBC Row handled by serde is for serde to discover serializers from installed providers and use them from there. This solves all the "dependency conflict" cases nicely and is long term sustainable and maintainable. > This is exactly the intention with the serializers in > airflow.serialization.serializers, > however can be tough as not all expose the information in public APIs (e.g. > deltalake) > that allow to reinstate the object. > > But yes: loose, and pragmatic public API / SemVer. > > Yep. > >