RE: [DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow]

Bienvenu Joffrey Fri, 12 Jan 2024 00:32:28 -0800

With new year's holiday, I didn't saw the discussion split. Sorry for the late 
catchup !


As maintainer of Airflow instances, it's great that I can bump any provider's 
version to get a bugfix quickly, without having to upgrade all the instances. 
But indeed, now, only ODBC provider can handle PyODBC.Row serialization.

> I think the *only* possible way to make ODBC Row handled by serde is for
> serde to discover serializers from installed providers and use them from
> there.

Sound like a nice solution ! But there, the ODBC hook would return a PyODBC.Row 
object, right ? Thus, defeating current initiative to have common tuples 
everywhere.

I'm going to expand a bit the scope, and do a bit of fiction:

IMO it would be great to have a real usable "DbApiHook": Not an abstract class, 
like now. But a Hook similar to the ODBC one, which allows to install any 
DB-API2.0 python libraries and setup a Connection with them - without 
subclassing anything. It could live outside of the common.sql provider, in a 
dedicated one. With this Hook, it will be possible to use the PyODBC python 
package directly. Thus, two providers will need pyodbc.Row serialization. 
Putting the serializer somewhere in a provider and discoverable for serde will 
avoid coupling with Airflow, while letting both Providers using it. Great !

Further fiction: Maybe a mechanism to discover those DB-API2.0 compliant 
packages ? Wrapping them into a thin provider. E.g.: We could have a 
'PyODBC-provider' with just a serializer for serde, and PyODBC as dependency ?

Further further fiction: Would that open the door to dropping the ODBC Provider 
? Operators needs to die after all😊 In this situation, do we care if we return 
a tuple, a Databricks.Row, or a PyODBC.Row ? Those objects are compliant with 
python DbAPI2.0 after all. Today, the user that add the 'ODBC provider' expect 
standard types, because he setup the 'Airflow standardized ODBC provider'. In 
the fiction world, the user that setup a 'PyODBC-provider' should expect to 
have PyODBC.Row object.


Again, sorry for catching up late. I'll open a PR with a proposal for 
pyodbc.row serializer in the odbc provider. The earlier the better if that 
might become a dependency to other providers in the future.


Joffrey Bienvenu

-----Original Message-----
From: Jarek Potiuk <ja...@potiuk.com> 
Sent: Monday, 11 December 2023 20:22
To: dev@airflow.apache.org
Subject: Re: [DISCUSS] Using serde for PyODBC case/other cases [was 
Serialization in Apache Airflow]

EXTERNAL MAIL: Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes 
si vous ne connaissez pas l'expéditeur et ne lui faites pas confiance. En cas 
de doute, envoyez cet email en pièce jointe à 
ab...@infrabel.be<mailto:ab...@infrabel.be>.

Tried to get others some opportunity to comment, but I see it's mostly me <> 
Bolke.

>
> > Yep. I rectified the implicitness by storing this as an explicit 
> > decision record  in the ADR proposal. I think it's been - similarly 
> > as lack of
> >
> <snip>
>

Happy to help here - since there is a good "vibe" on the ADR kind of approach, 
I am happy to  "lead by example" and propose a PR with ADRs for serialization 
(learning from the whole discussions and that's a good opportunity for me to 
learn some of the nitty-gritty details ).

Would that work for you and can I count on some reviews ?


> It was added to the DAG serializer at the time, I mentioned that that 
> looked weird given the fact that it did not seem connected to DAG 
> serialization (to me) and serde should be considered. So no, I wasn't 
> insisting on making it part of serialization imho, it defacto was 
> already.
>

Let's straighten it up after writing ADRs and rounding up the serialization 
case. We are aiming to enable AIP-44 and Pydantic in 2.9 so that's a good time 
to see and make some measurements there.

>
> >
> > Going through either of the serializations of ours here is not 
> > needed overhead at all - we can just purely rely on Rust and get ORM 
> > on one side and Pydantic object on the other without touching any of 
> > our
> serialization
> > code.
> >
>
> That's fine to me.
>

Yeah. We'll sort it out. Luckily it's a very localized place and we can simply 
open draft PR showing how it will look like and see if it looks simpler/whether 
we lose any of the properties that

>
>
> I don't think you want to ;-).
>
>
Happy also to take a look at helping to straighten up DAG serialization 
possibly once I get a bit more familiar with the context.


> We could. It's probably just a three pass serializer then.
>

Interesting :).


> >
> Exactly not. By adding it to Airflow 2.8.0 it would have made every 
> previous version of this provider suddenly work without any 
> adjustments against Airflow 2.8.0. Remember that the original bug 
> report said "cannot serialize value of type PyODBC.Row".
>

I was more talking about compatibility of ODBC provider. If we want to
directly
return ODBC.Row, it will only be serializable in Airflow 2.8.0 (currently
ODBC
provider has >= Airflow 2.6.0. What I want to achieve is to implement a fix
where you can upgrade just provider and get serializable output even if
the user decides to stay with Airflow 2.8.0. This w


> You can argue both ways here architecturally.
>
> 1) Airflow 2.8 with a new feature that allows PyODBC.Row to be serialized,
> affecting past and future providers at large and (TaskFlow) returned
> variables.
>

This requires users to upgrade to Airflow 2.8.0. Many of our users delay
such upgrades for months or years. So we are talking about making it
possible for them to upgrade "easily" where selective upgrades to
providers is the easy way. This also works better for some
Airflow-as-a-service
where 2.8 might only be available months from now.

Migration to a new Airflow version is way more involved and uptick of new
versions (especially .0) is slow.

We are basically talking about having something available now - for Airflow
2.6, 2.7, (and 2.8) now
vs. many users having it months or years from now. And releasing it now
fixes actual issue
people have and raised us. And implementing it in provider has an added
benefit that
IF we also implement a generic serializer as part of ODBC and have a way for
serde to discover serializers by providers, then we could implement it in
the
provider in a way that either uses internal "serialization" or allows serde
to
discover it.

I see really no benefit of having Airflow know anything about ODBC. Are we
going to
implement all 80 provider's specific code and put it to airlfow and all
dependencies (even
optional) as airflow dependencies, now that we made enormous effort in
separating these?

Imagine ODBC code depends on odbc_library.
What happens if Airflow has optional dependency on odbc_library <3 and the
provider has dependency on odbc_library  >=3 ? Imagine that the provider
returns
ODBCRow from version 3 that serializer won't be able to handle.

How do you install the two together? How do you install these two together?
How do
you make them work together?

It really makes very little sense to have airflow depend on the library and
code that
provider already depends on, because this will create a whole new class of
potential conflicts.

I think the *only* possible way to make ODBC Row handled by serde is for
serde to discover serializers from installed providers and use them from
there.

This solves all the "dependency conflict" cases nicely and is long term
sustainable and maintainable.


> This is exactly the intention with the serializers in
> airflow.serialization.serializers,
> however can be tough as not all expose the information in public APIs (e.g.
> deltalake)
> that allow to reinstate the object.
>
> But yes: loose, and pragmatic public API / SemVer.
>
> Yep.

>
>

RE: [DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow]

Reply via email to