Hey here, Also separating it out as another topic.
> > It is a choice that needs to be made at the API level - here Common.sql. In > this case the particular form of the tuple was, somewhat implicit, part of > the spec. > Yep. I rectified the implicitness by storing this as an explicit decision record in the ADR proposal. I think it's been - similarly as lack of Serialization decision records - an omission that resulted in this, somewhat heated debate :D > > The real question is, when you find something unserializable do you then > provide a custom serializer for serde or do you convert your object to > something that can be serialized by default. The first one will make it > available to all other places where this objects needs to be serialized. > The latter limits that scope to where the conversion happens - here > common.sql > So to provide an example in context of our discussion of PyODBC.Row (see > the discussion of the linked PR) only common.sql can now deal with > PyODBC.Row. If for some reason another provider returns a PyODBC.Row that > object remains unserializable. > Yes. That's fine for me. I think we do not always have to serialize things the same way. Additionally serializing something using serde uses a custom format that requires deserialization step for it to be usable. DBApiHook response can be used "as is" as tuple for any other operator used by Common SQL I have explained what the downside is of the approach you took, although > its impact in this case is probably low (not many other places will return > PyODBC.Row, but consider DataFrame hypothetically). > Yes. For DataFrame I am fine. This is "common" format that it's worth to have serializer because it can be produced and consumed by many. It's quite unlikely to happen to ODBC and simply DBApiHook has its own "standard" being tuple that is the language DBAPIHook talks to the users. In a way returning PyODBC.Row in this sense violates assumptions made a year ago (now documented in the ADR proposal) - that this is the "API" of DB Hook. So while "make_serializable" was likely a wrong name, the goal of it was "produce a tuple that is directly serializable". Just to give another example - we do the same with Snowflake that returns in some cases a Dictionary instead of tuples of values. In this case we cannot rely on Snowflake Dictionary - because we expect tuples as output so we MUST convert (and we cannot use Serde here). Here it is mostly accidental that PyODBC.Row also conforms to Tuple semantics, but not fully. So similarly to Snowflake Dictionary we need to bring it to the semantics that users of the DBAPIHook (Common SQL operators) will understand and expect. That's all. Serializable property is one of those and what we've done is the same thing we do for other DBApiHook implementations. > I think you needed a serializable tuple so that it could go into XCom. This > makes the need for an explicit choice imho. > > Yep. That's explicit choice we made. > > > > > On speed. Afaik AIP-44 still makes use of the DAG Serialization code apart > from Pydantic's rust implementation. The DAG serialization code is > seriously > slow. So, before claiming speed improvements here show the numbers please > :-) Yep. When we first added it we used Pydantic V1.And then it did not matter. But we have it using our serialization code - because at some point you insisted on making it part of "a serialization" - we have absolutely no need for that in AIP-44 and I think when we go to prime time - we will just skip it entirely. Going through either of the serializations of ours here is not needed overhead at all - we can just purely rely on Rust and get ORM on one side and Pydantic object on the other without touching any of our serialization code. It just work out-of-the box. without any of our "logic" to maintain except explicitly creating pure-python models that we are going to use (and which are verified via MyPy). More benchmarking and reasoning on Pydantic v2 speed improvements are here https://thedataquarry.com/posts/why-pydantic-v2-matters/ While it's not the 16x I claimed (I mixed it with 16x improvements on emulation of ARM processors), 5x it's still very impressive if you ask me. We can of course keep both. However if you ever looked at the DAG > serialization code you might want to reconsider ;-). Yep. I am not at all strong on it. I do not intimately know that code either :). > Alternatively, you > could build on serde and provide the guarantees that you require there. It > can be done. I have been able to serialize a DAG with serde (hence knowing > the 1:10 ratio), but until now havent added it, because its schema looks > different. > Yeah. I am good for that if we can pull it off with backwards compatibility in place. > > BTW: There is no need from the XCom serializer / serde to deserialize into > the original object. By convention we like to of course, but it is not > required. > I understand, but the versioning feature for example assumes so. and adds complexity if the serialized form is to be used. From what I understand (and what some of the Serde tests verify the API of the serde is that you do in-fact get the same, comparable object when you do the serialize/deserialize roundtrip. And well, the only way to make use of the serialized object is to deserialize it first. Because of transparency, hiding the details and design of the serde, you should not care or understand about the "serialized" form so you effectively MUST deserialize to use it. So far - also looking at the tests - I was under the impression that it was expected property of Serde to get the roundtrip to return an equivalent object. But maybe I was mistaken here? One more reason to have decision records captured somewhere if that was not the intention. > For the record: the XCom serializer lazy loads its dependencies. So it does > not make > core Airflow dependent on client libraries arguably. > But it means that if we have the same libraries used in providers and in core - we have to declare the dependencies in two places - once for provider extra and once for serde "extension" extra. You cannot use provider dependencies if you want to have core use the same client library because of two reasons: 1) they might depend on different versions 2) you would have to specify provider extra and install the provider (which you might not need in this case And in case 1) This might lead to the situation that if core and provider depend on different, conflicting versions of the libraries, we are f*d - because you might not be able to install provider in certain version with core in certain version. if we move the serialization code to the provider, we have no such problem. It always depends on the version that the provider requires. The con is that the API becomes strict. We cannot change it easily anymore. > Of course we can extend that. But I believe you wanted to make ODBCRow provider to depend on the very same API already? Isn't it. It seems I see contradiction. It looks like it's ok for the ODBCRow/provider to depend on serde and implement their serialization code, but on the other hand it's not ok for (say) DeltaLake to do the same and become Databricks provider? Or maybe I missed something? Maybe your idea was to make the ODBC serializer part of Airflow 2.8 and add >= airflow 2.8.0 for the ODBC provider? That's what I call unnecessary coupling. > > > Maybe that's a bit of a different question for Pandas - as we have no > > separate Pandas provider > > and we use and link to it already from Airflow core. But... maybe it's > also > > a sign we need pandas > > provider in this case? This question for me is open. > > > > > To me this depends on what we want to do with data awareness (separate > discussion). If you think: > > ObjectStoragePath : Table : Dataframe > > It is useful to have dataframe as part of core. This could be pandas or > maybe polars which is so much faster :-). > Yeah - I am not too strong on that and I believe it has benefits to have both Pandas and Polars. Especially if we can make sure that it's very much loosely coupled (use public APIs ony and get very clear about SemVer/ Versioning for those to get as open as possible) with Pandas and polars as optional extras, I am quite fine with it. I think we do not need a provider in this case. Case-by-case is fine to me as long as the choice is explicit and thus > options are considered. > > Yep. I think this discussion is mostly about that :). And hopefully will result with some decisions recorded :) > > -- > > -- > Bolke de Bruin > bdbr...@gmail.com >