[DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow]

Jarek Potiuk Mon, 04 Dec 2023 11:09:30 -0800

Hey here,

Also separating it out as another topic.


>
> It is a choice that needs to be made at the API level - here Common.sql. In
> this case the particular form of the tuple was, somewhat implicit, part of
> the spec.
>

Yep. I rectified the implicitness by storing this as an explicit decision
record
 in the ADR proposal. I think it's been - similarly as lack of Serialization
decision records - an omission that resulted in this, somewhat heated
debate :D


>
> The real question is, when you find something unserializable do you then
> provide a custom serializer for serde or do you convert your object to
> something that can be serialized by default. The first one will make it
> available to all other places where this objects needs to be serialized.
> The latter limits that scope to where the conversion happens - here
> common.sql


>
So to provide an example in context of our discussion of PyODBC.Row (see
> the discussion of the linked PR) only common.sql can now deal with
> PyODBC.Row. If for some reason another provider returns a PyODBC.Row that
> object remains unserializable.
>

Yes. That's fine for me. I think we do not always have to serialize things
the same way.
Additionally serializing something using serde uses a custom format that
requires
deserialization step for it to be usable. DBApiHook response can be used
"as is"
as tuple for any other operator used by Common SQL

I have explained what the downside is of the approach you took, although
> its impact in this case is probably low (not many other places will return
> PyODBC.Row, but consider DataFrame hypothetically).
>

Yes. For DataFrame I am fine. This is "common" format that it's worth to
have
serializer because it can be produced and consumed by many. It's quite
unlikely to happen to ODBC and simply DBApiHook has its own "standard"
being tuple that is the language DBAPIHook talks to the users. In a way
returning PyODBC.Row in this sense violates assumptions made a year
ago (now documented in the ADR proposal) - that this is the "API" of DB
Hook.
So while "make_serializable" was likely a wrong name, the goal of it was
"produce a tuple that is directly serializable".

Just to give another example - we do the same with Snowflake
that returns in some cases a Dictionary instead of tuples of values. In
this case
we cannot rely on Snowflake Dictionary - because we expect tuples as output
so we MUST convert (and we cannot use Serde here). Here it is mostly
accidental
that PyODBC.Row also conforms to Tuple semantics, but not fully. So
similarly to
Snowflake Dictionary we need to bring it to the semantics that users of the
DBAPIHook (Common SQL operators) will understand and expect. That's all.
Serializable property is one of those and what we've done is the same thing
we do for other DBApiHook implementations.


> I think you needed a serializable tuple so that it could go into XCom. This
> makes the need for an explicit choice imho.
>
> Yep. That's explicit choice we made.



> >
> >
> On speed. Afaik AIP-44 still makes use of the DAG Serialization code apart
> from Pydantic's rust implementation. The DAG serialization code is
> seriously
> slow. So, before claiming speed improvements here show the numbers please
> :-)


Yep. When we first added it we used Pydantic V1.And then it did not matter.
But
we have it using our serialization code - because at some point you
insisted on
making it part of "a serialization" - we have absolutely no need for that
in AIP-44
and I think when we go to prime time - we will just skip it entirely.

Going through either of the serializations of ours here is not needed
overhead at all - we can just purely rely on Rust and get ORM on one side
and Pydantic object on the other without touching any of our serialization
code.
It just work out-of-the box. without any of our "logic" to maintain except
explicitly
creating pure-python models that we are going to use (and which are
verified via MyPy).

More benchmarking and reasoning on Pydantic v2 speed improvements are here
https://thedataquarry.com/posts/why-pydantic-v2-matters/

While it's not the 16x I claimed (I mixed it with 16x improvements on
emulation of ARM processors), 5x it's still very impressive if you ask me.

We can of course keep both. However if you ever looked at the DAG
> serialization code you might want to reconsider ;-).


Yep. I am not at all strong on it. I do not intimately know that code
either :).


> Alternatively, you
> could build on serde and provide the guarantees that you require there.  It
> can be done. I have been able to serialize a DAG with serde (hence knowing
> the 1:10 ratio), but until now havent added it, because its schema looks
> different.
>

Yeah. I am good for that if we can pull it off with backwards compatibility
in place.


>
> BTW: There is no need from the XCom serializer / serde to deserialize into
> the original object. By convention we like to of course, but it is not
> required.
>

I understand, but the versioning feature for example assumes so. and adds
complexity if the serialized form is to be used. From what I understand
(and what some of the Serde tests
verify the API of the serde is that you do in-fact get the same, comparable
object
when you do the serialize/deserialize roundtrip. And well, the only way to
make use
of the serialized object is to deserialize it first. Because of
transparency, hiding the details
and design of the serde, you should not care or understand about the
"serialized" form so you
effectively MUST deserialize to use it.

So far - also looking at the tests - I was under the impression that it was
expected property of Serde to get the roundtrip to return an equivalent
object. But maybe I was mistaken here?

One more reason to have decision records captured somewhere if that was not
the intention.


> For the record: the XCom serializer lazy loads its dependencies. So it does
> not make
> core Airflow dependent on client libraries arguably.
>

But it means that if we have the same libraries used in providers and in
core - we have to declare the dependencies in two places - once for
provider extra and once for serde "extension" extra. You cannot use
provider dependencies if you want to have core use the same client library
because of two reasons:

1) they might depend on different versions
2) you would have to specify provider extra and install the provider (which
you might not need
   in this case

And in case 1) This might lead to the situation that if core and provider
depend on different, conflicting versions of the libraries, we are f*d -
because you might not be able to install provider
in certain version with core in certain version. if we move the
serialization code to the provider,
we have no such problem. It always depends on the version that the provider
requires.


The con is that the API becomes strict. We cannot change it easily anymore.
>

Of course we can extend that. But I believe you wanted to make ODBCRow
provider to depend on the very same API already? Isn't it. It seems I see
contradiction. It looks like it's ok for
the ODBCRow/provider to depend on serde and implement their serialization
code, but on the other hand it's not ok for (say) DeltaLake to do the same
and become Databricks provider?

Or maybe I missed something? Maybe your idea was to make the ODBC
serializer part of Airflow 2.8 and add >= airflow 2.8.0 for the ODBC
provider? That's what I call unnecessary coupling.


>
> > Maybe that's a bit of a different question for Pandas - as we have no
> > separate Pandas provider
> > and we use and link to it already from Airflow core. But... maybe it's
> also
> > a sign we need pandas
> > provider in this case? This question for me is open.
> >
> >
> To me this depends on what we want to do with data awareness (separate
> discussion). If you think:
>
> ObjectStoragePath : Table : Dataframe
>
> It is useful to have dataframe as part of core. This could be pandas or
> maybe polars which is so much faster :-).
>

Yeah - I am not too strong on that and I believe it has benefits to have
both Pandas and Polars. Especially if we can make sure that it's very much
loosely coupled (use public APIs ony and get very clear about SemVer/
Versioning for those to get as open as possible) with Pandas and polars
as optional extras, I am quite fine with it. I think we do not need a
provider in this case.

Case-by-case is fine to me as long as the choice is explicit and thus
> options are considered.
>
>
Yep. I think this discussion is mostly about that :). And hopefully will
result with some decisions recorded :)


>
> --
>
> --
> Bolke de Bruin
> bdbr...@gmail.com
>

[DISCUSS] Using serde for PyODBC case/other cases [was Serialization in Apache Airflow]

Reply via email to