Just 5 hours left to change the world! You can become one of the people who influenced the decision on naming the new operator :D
https://twitter.com/jarekpotiuk/status/1563602012100767746 (Right, maybe changing the world just a little, but still) J. On Sat, Aug 27, 2022 at 9:01 PM Jarek Potiuk <[email protected]> wrote: > Seems we are only now at the stage that we need to choose the best name > for the operator > > I started a name poll on Twitter :) > > https://twitter.com/jarekpotiuk/status/1563602012100767746 > > PR here: https://github.com/apache/airflow/pull/25780 > > J. > > > > On Thu, Aug 18, 2022 at 1:53 AM Jarek Potiuk <[email protected]> wrote: > >> Draft PR - needs some more tests and review with typing changes - in >> https://github.com/apache/airflow/pull/25780 >> Eventually PythonExternalOperator seems like a good name. >> >> J. >> >> >> On Wed, Aug 17, 2022 at 10:37 PM Jeambrun Pierre <[email protected]> >> wrote: >> >>> I also like the ability to use a specific interpreter. >>> >>> Maybe we could leave everything that is env related to the PVO (even >>> using an existing one) and let another one handle the interpreter. >>> >>> As Ash mentioned I also feel like an additional parameter >>> (python/interpreter etc.) to the PO would make sense and is quite intuitive >>> rather than a complete new operator, but it might be harder to implement. >>> >>> Best >>> Pierre Jeambrun >>> >>> Le mer. 17 août 2022 à 20:46, Collin McNulty >>> <[email protected]> a écrit : >>> >>>> I concur that this would be very useful. I can see a common pattern >>>> being to have a task to create an environment if it does not already exist >>>> and then subsequent tasks use that environment. >>>> >>>> On Wed, Aug 17, 2022 at 12:30 PM Jarek Potiuk <[email protected]> wrote: >>>> >>>>> Sounds like this is really in the middle between PVO and PO :). >>>>> >>>>> BTW. I spoke with a customer of mine today and they said they would >>>>> ABSOLUTELY love it. They were actually blocked from migrating to 2.3.3 >>>>> because one of their teams needed a DBT environment while the other >>>>> team needed some other dependency and they are conflicting with each >>>>> other. They are using Nomad + Docker already and while extending the >>>>> image with another venv is super-easy for them, they were considering >>>>> building several Docker images to serve their users but it is an order >>>>> of magnitude more complex problem for them because they would have to >>>>> make a whole new pipeline to build a distribute multiple images and >>>>> implements queue-base split between the teams or switch to using >>>>> DockerOperator. >>>>> >>>>> This one will allow them to do limited version of multi-tenancy for >>>>> their teams - without the actual separation but with even more >>>>> fine-grained separation of envs - because they would be able to use >>>>> different deps even for different tasks in the same DAG. >>>>> >>>>> >>>>> J, >>>>> >>>>> On Wed, Aug 17, 2022 at 6:21 PM Ash Berlin-Taylor <[email protected]> >>>>> wrote: >>>>> > >>>>> > Another option would be to change the PythonOperator/@task to take a >>>>> `python` argument (which also does change the behaviour of _that_ operator >>>>> a lot with or without that argument if we did that.) >>>>> > >>>>> > On 17 August 2022 15:46:52 BST, Jarek Potiuk <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> Yeah. TP - I like that explicit separation. It's much cleaner. I >>>>> still >>>>> >> have to think about the name though. While I see where >>>>> >> ExternalPythonOperator comes from, It sounds a bit less than >>>>> obvious. >>>>> >> I think the name should somehow contain "Environment" because very >>>>> few >>>>> >> people realise that running Python from a virtualenv actually >>>>> >> implicitly "activates" the venv. >>>>> >> I think maybe deprecating the old PythonVirtualenvOperator and >>>>> >> introducing two new operators: PythonInCreatedVirtualEnvOperator, >>>>> >> PythonInExistingVirtualEnvOperator ? Not exactly those names - they >>>>> >> are too long - but something like that. Maybe we should get rid of >>>>> >> Python in the name at all ? >>>>> >> >>>>> >> BTW. I think we should generally do more of the discussions here and >>>>> >> express our thoughts about Airflow here. Even if there are no >>>>> answers >>>>> >> or interest immediately, I think that it makes sense to do a bit of >>>>> a >>>>> >> melting pot that sometimes might produce some cool (or rather hot) >>>>> >> stuff as a result. >>>>> >> >>>>> >> On Wed, Aug 17, 2022 at 8:45 AM Tzu-ping Chung >>>>> <[email protected]> wrote: >>>>> >>> >>>>> >>> >>>>> >>> One thing I thought of (but never bothered to write about) is to >>>>> introduce a separate operator instead, say ExternalPythonOperator (bike >>>>> shedding on name is welcomed), that explicitly takes a path to the >>>>> interpreter (say in a virtual environment) and just use that to run the >>>>> code. This also enables users to create a virtual environment upfront, but >>>>> avoids needing to overload PythonVirtualenvOperator for the purpose. This >>>>> also opens an extra use case that you can use any Python installation to >>>>> run the code (say a custom-compiled interpreter), although nobody asked >>>>> about that. >>>>> >>> >>>>> >>> TP >>>>> >>> >>>>> >>> >>>>> >>> On 13 Aug 2022, at 02:52, Jeambrun Pierre <[email protected]> >>>>> wrote: >>>>> >>> >>>>> >>> I feel like this is a great alternative at the price of a very >>>>> moderate effort. (I'd be glad to help with it). >>>>> >>> >>>>> >>> Mutually exclusive sounds good to me as well. >>>>> >>> >>>>> >>> Best, >>>>> >>> Pierre >>>>> >>> >>>>> >>> Le ven. 12 août 2022 à 15:23, Jarek Potiuk <[email protected]> a >>>>> écrit : >>>>> >>>> >>>>> >>>> >>>>> >>>> Mutually exclusive. I think that has the nice property of >>>>> forcing people to prepare immutable venvs upfront. >>>>> >>>> >>>>> >>>> On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor < >>>>> [email protected]> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Yes, this has been on my background idea list for an age -- I'd >>>>> love to see it happen! >>>>> >>>>> >>>>> >>>>> Have you thought about how it would behave when you specify an >>>>> existing virtualenv and include requirements in the operator that are not >>>>> already installed there? Or would they be mutually exclusive? (I don't >>>>> mind >>>>> either way, just wondering which way you are heading) >>>>> >>>>> >>>>> >>>>> -ash >>>>> >>>>> >>>>> >>>>> On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk < >>>>> [email protected]> wrote: >>>>> >>>>> >>>>> >>>>> Hello everyone, >>>>> >>>>> >>>>> >>>>> TL;DR; I propose to extend our PythonVirtualenvOperator with >>>>> "use existing venv" feature and make it a viable way of handling some >>>>> multi-dependency sets using multiple pre-installed venvs. >>>>> >>>>> >>>>> >>>>> More context: >>>>> >>>>> >>>>> >>>>> I had this idea coming after a discussion in our Slack: >>>>> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179 >>>>> >>>>> >>>>> >>>>> My thoughts were - why don't we add support for "use existing >>>>> venv" in PythonVirtualenvOperator as first-class-citizen ? >>>>> >>>>> >>>>> >>>>> Currently (unless there are some tricks I am not aware of) or >>>>> extend PVO, the PVO will always attempt to create a virtualenv based on >>>>> extra requirements. And while it gives the users a possibility of having >>>>> some tasks use different dependencies, the drawback is that the venv is >>>>> created dynamically when tasks starts - potentially a lot of overhead for >>>>> startup time and some unpleasant failure scenarios - like networking >>>>> problems, PyPI or local repoi not available, automated (and unnoticed) >>>>> upgrade of dependencies. >>>>> >>>>> >>>>> >>>>> Those are basically the same problems that caused us to >>>>> strongly discourage our users in our Helm Chart to use >>>>> _PIP_ADDITIONAL_DEPENDENCIES in production and criticize the Community >>>>> Helm Chart for dynamic dependency installation they promote as a "valid" >>>>> approach. Yet our PVO currently does exactly this. >>>>> >>>>> >>>>> >>>>> We had some past discussions how this can be improved - with >>>>> caching, or using different images for different dependencies and similar >>>>> - >>>>> and even we have >>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing >>>>> proposal to use different images for different sets of requirements. >>>>> >>>>> >>>>> >>>>> Proposal: >>>>> >>>>> >>>>> >>>>> During the discussion yesterday I started to think a simpler >>>>> solution is possible and rather simple to implement by us and for users to >>>>> use. >>>>> >>>>> >>>>> >>>>> Why not have different venvs preinstalled and let the PVO >>>>> choose the one that should be used? >>>>> >>>>> >>>>> >>>>> It does not invalidate AIP-46. AIP-46 serves a bit different >>>>> purpose and some cases cannot be handled this way - when you need >>>>> different >>>>> "system level" dependencies for example) but it might be much simpler from >>>>> deployment point of view and allow it to handle "multi-dependency sets" >>>>> for >>>>> Python libraries only with minimal deployment overhead (which AIP-46 >>>>> necessarily has). And I think it will be enough for a vast number of the >>>>> "multi-dependency-sets" cases. >>>>> >>>>> >>>>> >>>>> Why don't we allow the users to prepare those venvs upfront and >>>>> simply enable PVE to use them rather than create them dynamically ? >>>>> >>>>> >>>>> >>>>> Advantages: >>>>> >>>>> >>>>> >>>>> * it nicely handles cases where some of your tasks need a >>>>> different set of dependencies than others (for execution, not necessarily >>>>> parsing at least initially). >>>>> >>>>> >>>>> >>>>> * no startup time overhead needed as with current PVO >>>>> >>>>> >>>>> >>>>> * possible to run in both cases - "venv installation" and >>>>> "docker image" installation >>>>> >>>>> >>>>> >>>>> * it has finer granularity level than AIP-46 - unlike in AIP-46 >>>>> you could use different sets of dependencies >>>>> >>>>> >>>>> >>>>> * very easy to pull off for the users without modifying their >>>>> deployments,For local venv, you just create the venvs, For Docker image >>>>> case, your custom image needs to add several lines similar to: >>>>> >>>>> >>>>> >>>>> RUN python -m venv --system-site-packages PACKAGE1==NN >>>>> PACKAGE2==NN /opt/venv1 >>>>> >>>>> RUN python -m venv --system-site-packages PACKAGE1==NN >>>>> PACKAGE2==NN /opt/venv2 >>>>> >>>>> >>>>> >>>>> and PythonVenvOperator should have extra >>>>> "use_existing_venv=/opt/venv2") parameter >>>>> >>>>> >>>>> >>>>> * we only need to manage ONE image (!) even if you have >>>>> multiple sets of dependencies (this has the advantage that it is actually >>>>> LOWER overhead than having separate images for each env -when it comes to >>>>> various resources overhead (same workers could handle multiple dependency >>>>> sets for examples, same image is reused by multiple PODs in K8S etc. ). >>>>> >>>>> >>>>> >>>>> * later (when AIP-43 (separate dag processor with ability to >>>>> use different processors for different subdirectories) is completed and >>>>> AIP-46 is approved/implemented, we could also extend DAG Parsing to be >>>>> able >>>>> to use those predefined venvs for parsing. That would eliminate the need >>>>> for local imports and add support to even use different sets of libraries >>>>> in top-level code (per DAG, not per task). It would not solve different >>>>> "system" level dependencies - and for that AiP-46 is still a very valid >>>>> case. >>>>> >>>>> >>>>> >>>>> Disadvantages: >>>>> >>>>> >>>>> >>>>> I thought very hard about this one and I actually could not >>>>> find any disadvantages :) >>>>> >>>>> >>>>> >>>>> It's simple to implement, use and explain, it can be >>>>> implemented very quickly (like - in a few hours with tests and >>>>> documentation I think) and performance-wise it is better for any other >>>>> solution (including AIP-46) providing that the case is limited to >>>>> different >>>>> Python dependencies. >>>>> >>>>> >>>>> >>>>> But possibly there are things that I missed. It all looks too >>>>> good to be true, and I wonder why we do not have it already today - once I >>>>> thought about it, it seems very obvious. So I probably missed something. >>>>> >>>>> >>>>> >>>>> WDYT? >>>>> >>>>> >>>>> >>>>> J. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>>>> >>>> -- >>>> >>>> Collin McNulty >>>> Lead Airflow Engineer >>>> >>>> Email: [email protected] <[email protected]> >>>> Time zone: US Central (CST UTC-6 / CDT UTC-5) >>>> >>>> >>>> <https://www.astronomer.io/> >>>> >>>
