Thanks for starting the discussion Andrey.

Some background on the choice for Pendulum at the time. In the early days
of Airflow it wasn't timezone aware. Originating from Airbnb which had a
reasonable mature data organization the view was everything needs to be in
UTC. According to Maxime the engineers would dream in UTC ;-). However, in
the real world which also needs to deal with legacy that didn't hold. Often
systems of record did not store timezone information but were localized
nevertheless. Cutoff times in banks happen in localized time and if you
want to meet those, Airflow needed to do better.

Doing timezones and being timezone aware proved to be exceptionally hard.
Many libraries get it wrong [1] and fail silently (i.e. Arrow) or apply DST
transitions wrongly (pytz). When dealing with payments that stuff cannot
happen. To make things worse, in Python timezone support is pretty
convoluted, while some standardization happened in 3.9 by using IANA
provided timezone information from the local system, its API is messy. For
example, we got recently bitten by datetime.tzname() (which is
supposed to 'time
zone name') returning short-hand notation timezones (e.g. PST) instead of
full timezone names (e.g. "Europe/Amsterdam") which makes deserialization
non deterministic.

So, what I am trying to say, is tread carefully when doing changes as
proposed in [2] (moving to zoneinfo seems to make sense though and will
also be in Pendulum 3). Make sure those changes are formally correct and
don't assume because they are now part of python itself (pytz was the
defacto standard for a long time). Pendulum has proven us in the past,
maybe we indeed should help the project if possible and if that isn't
possible verify formal correctness of any other library.

Bolke

[1] https://pendulum.eustace.io/faq/
[2] https://github.com/apache/airflow/issues/19450

On Thu, 28 Sept 2023 at 11:03, Andrey Anshin <andrey.ans...@taragol.is>
wrote:

> This discussion is more about the known problem of pendulum and how we
> could deal with it and maybe how we (as Community) might help autor.
>
> The library is mostly supported by a single author Sébastien Eustace (
> https://github.com/sdispater) and it seems like we bump into the situation
> which is described in xkcd #2347 (
> https://imgs.xkcd.com/comics/dependency.png). To be honest it is not
> something new when library mainly supported by one author so there is
> always a risk that the library will no longer be supported / abandoned
> And if takes in account that pendulum provides core functionality in
> Airflow it could have dramatical impact in the future.
>
> Pendulum is a really nice library which helps a lot of developers to work
> with dates/datetimes. However there is one major problem, the last release
> of this library happened more than 3 years ago (
> https://pypi.org/project/pendulum/#history) in the time when Airflow
> 1.10.11 was released
>
> Fortunately, the project is not abandoned and on a regular basis commits
> add into the master branch. However these commits are not included into any
> final release and that's why some things related to datetime don't work as
> expected in Airflow. There are list of known (for me) issues which are
> affect Airflow
>
> *Memory Leak on parse*:
> - https://github.com/sdispater/pendulum/issues/720, this one fixed  2
> years
> ago but not available yet (https://github.com/sdispater/pendulum/pull/563
> ).
> Since we use parse dates in airflow codebase: datetime parameters and
> datetime in logs this one could be a reason for memory leakage in Airflow:
> - https://github.com/apache/airflow/discussions/24694
> - https://github.com/apache/airflow/discussions/28597
>
> *Incorrect time zones*, known issues and should be already fixed in master
> branch
> - https://github.com/sdispater/pendulum/issues/700, Mexico do not use DST
> anymore
> - https://github.com/sdispater/pendulum/issues/706, Egypt reinstate DST
>
> We add clarification in https://github.com/apache/airflow/pull/30467,
> however it seems like there is no other way rather than patching Pendulum
> right now.
>
> All these issues should be solved as soon as pendulum 3 is released. The
> current announced estimation is end of september/ beginning of October:
> https://github.com/sdispater/pendulum/issues/600#issuecomment-1711299677
>
> So in theory we would have a fixed version of pendulum soon, and it might
> break something in Airflow but from my point of view it is better than
> current status.
>
> However there might be a situation where the release of the pendulum would
> be postponed, so maybe better to have a backup plan. What could we do in
> this case?
>
> Maybe we should start to use zoneinfo.ZoneInfo instead of pendulum
> datetime? https://github.com/apache/airflow/issues/19450
> Pros:
> - stdlib (python 3.9+)
> - In pendulum 3.0 Timezone based on zoneinfo.Zoneinfo
>
> Cons:
> - Current serialization model can't deal with backport packages. E.g.
> timezone which are serialized in backport_zoneinfo can't be deserialized in
> zoneinfo
>
> Maybe we should replace parse datetime with another solution. Does anyone
> know a good replacement?
>
> Maybe someone from Airflow Community could propose their help with
> maintenance of library:
> - https://github.com/sdispater/pendulum/issues/590
>
> Maybe we should get rid of the pendulum at all, as a last resort solution.
> I can't imagine how we could do that, because a lot of stuff depends on the
> pendulum and removing it would be a breaking change.
>
> ----
> Best Wishes
> *Andrey Anshin*
>


-- 

--
Bolke de Bruin
bdbr...@gmail.com

Reply via email to