We've got 56 votes (wow!) ExternalPythonOperator won. It got 41% . Followed by PythonExternalenvOperator 30% and PythonRunenvOperator with 26%.
I am fine with either of those. But - despite slightly lower support - I think PythonExternalenvOperator reflects a bit better the resemblance to PythonVirtualenvOperator that I think is important. Asking those who were very strong on ExternalPythonOperator - is PythonExternalenvOperator "good enough" for you as well? The poll had only one option to choose from, but if that is an acceptable option for those who favoured "ExternalPythonOperator" - I have personally a slight preference for that one. J. On Wed, Aug 31, 2022 at 3:10 PM Jarek Potiuk <[email protected]> wrote: > Just 5 hours left to change the world! > > You can become one of the people who influenced the decision on naming the > new operator :D > > https://twitter.com/jarekpotiuk/status/1563602012100767746 > > (Right, maybe changing the world just a little, but still) > > J. > > > On Sat, Aug 27, 2022 at 9:01 PM Jarek Potiuk <[email protected]> wrote: > >> Seems we are only now at the stage that we need to choose the best name >> for the operator >> >> I started a name poll on Twitter :) >> >> https://twitter.com/jarekpotiuk/status/1563602012100767746 >> >> PR here: https://github.com/apache/airflow/pull/25780 >> >> J. >> >> >> >> On Thu, Aug 18, 2022 at 1:53 AM Jarek Potiuk <[email protected]> wrote: >> >>> Draft PR - needs some more tests and review with typing changes - in >>> https://github.com/apache/airflow/pull/25780 >>> Eventually PythonExternalOperator seems like a good name. >>> >>> J. >>> >>> >>> On Wed, Aug 17, 2022 at 10:37 PM Jeambrun Pierre <[email protected]> >>> wrote: >>> >>>> I also like the ability to use a specific interpreter. >>>> >>>> Maybe we could leave everything that is env related to the PVO (even >>>> using an existing one) and let another one handle the interpreter. >>>> >>>> As Ash mentioned I also feel like an additional parameter >>>> (python/interpreter etc.) to the PO would make sense and is quite intuitive >>>> rather than a complete new operator, but it might be harder to implement. >>>> >>>> Best >>>> Pierre Jeambrun >>>> >>>> Le mer. 17 août 2022 à 20:46, Collin McNulty >>>> <[email protected]> a écrit : >>>> >>>>> I concur that this would be very useful. I can see a common pattern >>>>> being to have a task to create an environment if it does not already exist >>>>> and then subsequent tasks use that environment. >>>>> >>>>> On Wed, Aug 17, 2022 at 12:30 PM Jarek Potiuk <[email protected]> >>>>> wrote: >>>>> >>>>>> Sounds like this is really in the middle between PVO and PO :). >>>>>> >>>>>> BTW. I spoke with a customer of mine today and they said they would >>>>>> ABSOLUTELY love it. They were actually blocked from migrating to 2.3.3 >>>>>> because one of their teams needed a DBT environment while the other >>>>>> team needed some other dependency and they are conflicting with each >>>>>> other. They are using Nomad + Docker already and while extending the >>>>>> image with another venv is super-easy for them, they were considering >>>>>> building several Docker images to serve their users but it is an order >>>>>> of magnitude more complex problem for them because they would have to >>>>>> make a whole new pipeline to build a distribute multiple images and >>>>>> implements queue-base split between the teams or switch to using >>>>>> DockerOperator. >>>>>> >>>>>> This one will allow them to do limited version of multi-tenancy for >>>>>> their teams - without the actual separation but with even more >>>>>> fine-grained separation of envs - because they would be able to use >>>>>> different deps even for different tasks in the same DAG. >>>>>> >>>>>> >>>>>> J, >>>>>> >>>>>> On Wed, Aug 17, 2022 at 6:21 PM Ash Berlin-Taylor <[email protected]> >>>>>> wrote: >>>>>> > >>>>>> > Another option would be to change the PythonOperator/@task to take >>>>>> a `python` argument (which also does change the behaviour of _that_ >>>>>> operator a lot with or without that argument if we did that.) >>>>>> > >>>>>> > On 17 August 2022 15:46:52 BST, Jarek Potiuk <[email protected]> >>>>>> wrote: >>>>>> >> >>>>>> >> Yeah. TP - I like that explicit separation. It's much cleaner. I >>>>>> still >>>>>> >> have to think about the name though. While I see where >>>>>> >> ExternalPythonOperator comes from, It sounds a bit less than >>>>>> obvious. >>>>>> >> I think the name should somehow contain "Environment" because very >>>>>> few >>>>>> >> people realise that running Python from a virtualenv actually >>>>>> >> implicitly "activates" the venv. >>>>>> >> I think maybe deprecating the old PythonVirtualenvOperator and >>>>>> >> introducing two new operators: PythonInCreatedVirtualEnvOperator, >>>>>> >> PythonInExistingVirtualEnvOperator ? Not exactly those names - they >>>>>> >> are too long - but something like that. Maybe we should get rid of >>>>>> >> Python in the name at all ? >>>>>> >> >>>>>> >> BTW. I think we should generally do more of the discussions here >>>>>> and >>>>>> >> express our thoughts about Airflow here. Even if there are no >>>>>> answers >>>>>> >> or interest immediately, I think that it makes sense to do a bit >>>>>> of a >>>>>> >> melting pot that sometimes might produce some cool (or rather hot) >>>>>> >> stuff as a result. >>>>>> >> >>>>>> >> On Wed, Aug 17, 2022 at 8:45 AM Tzu-ping Chung >>>>>> <[email protected]> wrote: >>>>>> >>> >>>>>> >>> >>>>>> >>> One thing I thought of (but never bothered to write about) is to >>>>>> introduce a separate operator instead, say ExternalPythonOperator (bike >>>>>> shedding on name is welcomed), that explicitly takes a path to the >>>>>> interpreter (say in a virtual environment) and just use that to run the >>>>>> code. This also enables users to create a virtual environment upfront, >>>>>> but >>>>>> avoids needing to overload PythonVirtualenvOperator for the purpose. This >>>>>> also opens an extra use case that you can use any Python installation to >>>>>> run the code (say a custom-compiled interpreter), although nobody asked >>>>>> about that. >>>>>> >>> >>>>>> >>> TP >>>>>> >>> >>>>>> >>> >>>>>> >>> On 13 Aug 2022, at 02:52, Jeambrun Pierre <[email protected]> >>>>>> wrote: >>>>>> >>> >>>>>> >>> I feel like this is a great alternative at the price of a very >>>>>> moderate effort. (I'd be glad to help with it). >>>>>> >>> >>>>>> >>> Mutually exclusive sounds good to me as well. >>>>>> >>> >>>>>> >>> Best, >>>>>> >>> Pierre >>>>>> >>> >>>>>> >>> Le ven. 12 août 2022 à 15:23, Jarek Potiuk <[email protected]> a >>>>>> écrit : >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> Mutually exclusive. I think that has the nice property of >>>>>> forcing people to prepare immutable venvs upfront. >>>>>> >>>> >>>>>> >>>> On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor < >>>>>> [email protected]> wrote: >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> Yes, this has been on my background idea list for an age -- >>>>>> I'd love to see it happen! >>>>>> >>>>> >>>>>> >>>>> Have you thought about how it would behave when you specify an >>>>>> existing virtualenv and include requirements in the operator that are not >>>>>> already installed there? Or would they be mutually exclusive? (I don't >>>>>> mind >>>>>> either way, just wondering which way you are heading) >>>>>> >>>>> >>>>>> >>>>> -ash >>>>>> >>>>> >>>>>> >>>>> On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk < >>>>>> [email protected]> wrote: >>>>>> >>>>> >>>>>> >>>>> Hello everyone, >>>>>> >>>>> >>>>>> >>>>> TL;DR; I propose to extend our PythonVirtualenvOperator with >>>>>> "use existing venv" feature and make it a viable way of handling some >>>>>> multi-dependency sets using multiple pre-installed venvs. >>>>>> >>>>> >>>>>> >>>>> More context: >>>>>> >>>>> >>>>>> >>>>> I had this idea coming after a discussion in our Slack: >>>>>> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179 >>>>>> >>>>> >>>>>> >>>>> My thoughts were - why don't we add support for "use existing >>>>>> venv" in PythonVirtualenvOperator as first-class-citizen ? >>>>>> >>>>> >>>>>> >>>>> Currently (unless there are some tricks I am not aware of) or >>>>>> extend PVO, the PVO will always attempt to create a virtualenv based on >>>>>> extra requirements. And while it gives the users a possibility of having >>>>>> some tasks use different dependencies, the drawback is that the venv is >>>>>> created dynamically when tasks starts - potentially a lot of overhead for >>>>>> startup time and some unpleasant failure scenarios - like networking >>>>>> problems, PyPI or local repoi not available, automated (and unnoticed) >>>>>> upgrade of dependencies. >>>>>> >>>>> >>>>>> >>>>> Those are basically the same problems that caused us to >>>>>> strongly discourage our users in our Helm Chart to use >>>>>> _PIP_ADDITIONAL_DEPENDENCIES in production and criticize the Community >>>>>> Helm Chart for dynamic dependency installation they promote as a "valid" >>>>>> approach. Yet our PVO currently does exactly this. >>>>>> >>>>> >>>>>> >>>>> We had some past discussions how this can be improved - with >>>>>> caching, or using different images for different dependencies and >>>>>> similar - >>>>>> and even we have >>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing >>>>>> proposal to use different images for different sets of requirements. >>>>>> >>>>> >>>>>> >>>>> Proposal: >>>>>> >>>>> >>>>>> >>>>> During the discussion yesterday I started to think a simpler >>>>>> solution is possible and rather simple to implement by us and for users >>>>>> to >>>>>> use. >>>>>> >>>>> >>>>>> >>>>> Why not have different venvs preinstalled and let the PVO >>>>>> choose the one that should be used? >>>>>> >>>>> >>>>>> >>>>> It does not invalidate AIP-46. AIP-46 serves a bit different >>>>>> purpose and some cases cannot be handled this way - when you need >>>>>> different >>>>>> "system level" dependencies for example) but it might be much simpler >>>>>> from >>>>>> deployment point of view and allow it to handle "multi-dependency sets" >>>>>> for >>>>>> Python libraries only with minimal deployment overhead (which AIP-46 >>>>>> necessarily has). And I think it will be enough for a vast number of the >>>>>> "multi-dependency-sets" cases. >>>>>> >>>>> >>>>>> >>>>> Why don't we allow the users to prepare those venvs upfront >>>>>> and simply enable PVE to use them rather than create them dynamically ? >>>>>> >>>>> >>>>>> >>>>> Advantages: >>>>>> >>>>> >>>>>> >>>>> * it nicely handles cases where some of your tasks need a >>>>>> different set of dependencies than others (for execution, not necessarily >>>>>> parsing at least initially). >>>>>> >>>>> >>>>>> >>>>> * no startup time overhead needed as with current PVO >>>>>> >>>>> >>>>>> >>>>> * possible to run in both cases - "venv installation" and >>>>>> "docker image" installation >>>>>> >>>>> >>>>>> >>>>> * it has finer granularity level than AIP-46 - unlike in >>>>>> AIP-46 you could use different sets of dependencies >>>>>> >>>>> >>>>>> >>>>> * very easy to pull off for the users without modifying their >>>>>> deployments,For local venv, you just create the venvs, For Docker image >>>>>> case, your custom image needs to add several lines similar to: >>>>>> >>>>> >>>>>> >>>>> RUN python -m venv --system-site-packages PACKAGE1==NN >>>>>> PACKAGE2==NN /opt/venv1 >>>>>> >>>>> RUN python -m venv --system-site-packages PACKAGE1==NN >>>>>> PACKAGE2==NN /opt/venv2 >>>>>> >>>>> >>>>>> >>>>> and PythonVenvOperator should have extra >>>>>> "use_existing_venv=/opt/venv2") parameter >>>>>> >>>>> >>>>>> >>>>> * we only need to manage ONE image (!) even if you have >>>>>> multiple sets of dependencies (this has the advantage that it is actually >>>>>> LOWER overhead than having separate images for each env -when it comes to >>>>>> various resources overhead (same workers could handle multiple dependency >>>>>> sets for examples, same image is reused by multiple PODs in K8S etc. ). >>>>>> >>>>> >>>>>> >>>>> * later (when AIP-43 (separate dag processor with ability to >>>>>> use different processors for different subdirectories) is completed and >>>>>> AIP-46 is approved/implemented, we could also extend DAG Parsing to be >>>>>> able >>>>>> to use those predefined venvs for parsing. That would eliminate the need >>>>>> for local imports and add support to even use different sets of libraries >>>>>> in top-level code (per DAG, not per task). It would not solve different >>>>>> "system" level dependencies - and for that AiP-46 is still a very valid >>>>>> case. >>>>>> >>>>> >>>>>> >>>>> Disadvantages: >>>>>> >>>>> >>>>>> >>>>> I thought very hard about this one and I actually could not >>>>>> find any disadvantages :) >>>>>> >>>>> >>>>>> >>>>> It's simple to implement, use and explain, it can be >>>>>> implemented very quickly (like - in a few hours with tests and >>>>>> documentation I think) and performance-wise it is better for any other >>>>>> solution (including AIP-46) providing that the case is limited to >>>>>> different >>>>>> Python dependencies. >>>>>> >>>>> >>>>>> >>>>> But possibly there are things that I missed. It all looks too >>>>>> good to be true, and I wonder why we do not have it already today - once >>>>>> I >>>>>> thought about it, it seems very obvious. So I probably missed something. >>>>>> >>>>> >>>>>> >>>>> WDYT? >>>>>> >>>>> >>>>>> >>>>> J. >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>> >>>>>> >>>>> -- >>>>> >>>>> Collin McNulty >>>>> Lead Airflow Engineer >>>>> >>>>> Email: [email protected] <[email protected]> >>>>> Time zone: US Central (CST UTC-6 / CDT UTC-5) >>>>> >>>>> >>>>> <https://www.astronomer.io/> >>>>> >>>>
