Fine :). Let it be then :) On Thu, Sep 1, 2022 at 6:40 AM Abhishek Bhakat <[email protected]> wrote:
> Would like to vote for ExternalPythonOperator. > Cause usually Virtualenv have symbolic links for python binaries untill > used —copies to make it fully portable. > Additionally there is option to use differently compiled python altogether > (For example pypy <https://www.pypy.org/index.html> or jython > <https://www.jython.org/>). Naming these "External Pythons" makes more > sense to me. > > Thanks, > Abhishek > > On 31-Aug-2022 at 9:30:42 PM, Ash Berlin-Taylor <[email protected]> wrote: > >> Personally if those two I greatly prefer ExternalPythonOperator. (I >> didn't vote for either of those) >> >> (Also I think PythonExternalEnvOperator would be the "correct" casing, >> Virtualenv is a thing in python, Externalenv isn't.) >> >> -ash >> >> On 31 August 2022 21:28:20 BST, Jarek Potiuk <[email protected]> wrote: >>> >>> We've got 56 votes (wow!) >>> >>> ExternalPythonOperator won. It got 41% . Followed by >>> PythonExternalenvOperator 30% and PythonRunenvOperator with 26%. >>> >>> I am fine with either of those. But - despite slightly lower support - I >>> think PythonExternalenvOperator reflects a bit better the resemblance to >>> PythonVirtualenvOperator that I think is important. >>> >>> Asking those who were very strong on ExternalPythonOperator - is >>> PythonExternalenvOperator "good enough" for you as well? >>> >>> The poll had only one option to choose from, but if that is an >>> acceptable option for those who favoured "ExternalPythonOperator" - I have >>> personally a slight preference for that one. >>> >>> J. >>> >>> >>> >>> >>> On Wed, Aug 31, 2022 at 3:10 PM Jarek Potiuk <[email protected]> wrote: >>> >>>> Just 5 hours left to change the world! >>>> >>>> You can become one of the people who influenced the decision on naming >>>> the new operator :D >>>> >>>> https://twitter.com/jarekpotiuk/status/1563602012100767746 >>>> >>>> (Right, maybe changing the world just a little, but still) >>>> >>>> J. >>>> >>>> >>>> On Sat, Aug 27, 2022 at 9:01 PM Jarek Potiuk <[email protected]> wrote: >>>> >>>>> Seems we are only now at the stage that we need to choose the best >>>>> name for the operator >>>>> >>>>> I started a name poll on Twitter :) >>>>> >>>>> https://twitter.com/jarekpotiuk/status/1563602012100767746 >>>>> >>>>> PR here: https://github.com/apache/airflow/pull/25780 >>>>> >>>>> J. >>>>> >>>>> >>>>> >>>>> On Thu, Aug 18, 2022 at 1:53 AM Jarek Potiuk <[email protected]> wrote: >>>>> >>>>>> Draft PR - needs some more tests and review with typing changes - in >>>>>> https://github.com/apache/airflow/pull/25780 >>>>>> Eventually PythonExternalOperator seems like a good name. >>>>>> >>>>>> J. >>>>>> >>>>>> >>>>>> On Wed, Aug 17, 2022 at 10:37 PM Jeambrun Pierre < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I also like the ability to use a specific interpreter. >>>>>>> >>>>>>> Maybe we could leave everything that is env related to the PVO (even >>>>>>> using an existing one) and let another one handle the interpreter. >>>>>>> >>>>>>> As Ash mentioned I also feel like an additional parameter >>>>>>> (python/interpreter etc.) to the PO would make sense and is quite >>>>>>> intuitive >>>>>>> rather than a complete new operator, but it might be harder to >>>>>>> implement. >>>>>>> >>>>>>> Best >>>>>>> Pierre Jeambrun >>>>>>> >>>>>>> Le mer. 17 août 2022 à 20:46, Collin McNulty >>>>>>> <[email protected]> a écrit : >>>>>>> >>>>>>>> I concur that this would be very useful. I can see a common pattern >>>>>>>> being to have a task to create an environment if it does not already >>>>>>>> exist >>>>>>>> and then subsequent tasks use that environment. >>>>>>>> >>>>>>>> On Wed, Aug 17, 2022 at 12:30 PM Jarek Potiuk <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Sounds like this is really in the middle between PVO and PO :). >>>>>>>>> >>>>>>>>> BTW. I spoke with a customer of mine today and they said they would >>>>>>>>> ABSOLUTELY love it. They were actually blocked from migrating to >>>>>>>>> 2.3.3 >>>>>>>>> because one of their teams needed a DBT environment while the other >>>>>>>>> team needed some other dependency and they are conflicting with >>>>>>>>> each >>>>>>>>> other. They are using Nomad + Docker already and while extending >>>>>>>>> the >>>>>>>>> image with another venv is super-easy for them, they were >>>>>>>>> considering >>>>>>>>> building several Docker images to serve their users but it is an >>>>>>>>> order >>>>>>>>> of magnitude more complex problem for them because they would have >>>>>>>>> to >>>>>>>>> make a whole new pipeline to build a distribute multiple images and >>>>>>>>> implements queue-base split between the teams or switch to using >>>>>>>>> DockerOperator. >>>>>>>>> >>>>>>>>> This one will allow them to do limited version of multi-tenancy for >>>>>>>>> their teams - without the actual separation but with even more >>>>>>>>> fine-grained separation of envs - because they would be able to use >>>>>>>>> different deps even for different tasks in the same DAG. >>>>>>>>> >>>>>>>>> >>>>>>>>> J, >>>>>>>>> >>>>>>>>> On Wed, Aug 17, 2022 at 6:21 PM Ash Berlin-Taylor <[email protected]> >>>>>>>>> wrote: >>>>>>>>> > >>>>>>>>> > Another option would be to change the PythonOperator/@task to >>>>>>>>> take a `python` argument (which also does change the behaviour of >>>>>>>>> _that_ >>>>>>>>> operator a lot with or without that argument if we did that.) >>>>>>>>> > >>>>>>>>> > On 17 August 2022 15:46:52 BST, Jarek Potiuk <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >> >>>>>>>>> >> Yeah. TP - I like that explicit separation. It's much cleaner. >>>>>>>>> I still >>>>>>>>> >> have to think about the name though. While I see where >>>>>>>>> >> ExternalPythonOperator comes from, It sounds a bit less than >>>>>>>>> obvious. >>>>>>>>> >> I think the name should somehow contain "Environment" because >>>>>>>>> very few >>>>>>>>> >> people realise that running Python from a virtualenv actually >>>>>>>>> >> implicitly "activates" the venv. >>>>>>>>> >> I think maybe deprecating the old PythonVirtualenvOperator and >>>>>>>>> >> introducing two new operators: >>>>>>>>> PythonInCreatedVirtualEnvOperator, >>>>>>>>> >> PythonInExistingVirtualEnvOperator ? Not exactly those names - >>>>>>>>> they >>>>>>>>> >> are too long - but something like that. Maybe we should get rid >>>>>>>>> of >>>>>>>>> >> Python in the name at all ? >>>>>>>>> >> >>>>>>>>> >> BTW. I think we should generally do more of the discussions >>>>>>>>> here and >>>>>>>>> >> express our thoughts about Airflow here. Even if there are no >>>>>>>>> answers >>>>>>>>> >> or interest immediately, I think that it makes sense to do a >>>>>>>>> bit of a >>>>>>>>> >> melting pot that sometimes might produce some cool (or rather >>>>>>>>> hot) >>>>>>>>> >> stuff as a result. >>>>>>>>> >> >>>>>>>>> >> On Wed, Aug 17, 2022 at 8:45 AM Tzu-ping Chung >>>>>>>>> <[email protected]> wrote: >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> One thing I thought of (but never bothered to write about) is >>>>>>>>> to introduce a separate operator instead, say ExternalPythonOperator >>>>>>>>> (bike >>>>>>>>> shedding on name is welcomed), that explicitly takes a path to the >>>>>>>>> interpreter (say in a virtual environment) and just use that to run >>>>>>>>> the >>>>>>>>> code. This also enables users to create a virtual environment >>>>>>>>> upfront, but >>>>>>>>> avoids needing to overload PythonVirtualenvOperator for the purpose. >>>>>>>>> This >>>>>>>>> also opens an extra use case that you can use any Python installation >>>>>>>>> to >>>>>>>>> run the code (say a custom-compiled interpreter), although nobody >>>>>>>>> asked >>>>>>>>> about that. >>>>>>>>> >>> >>>>>>>>> >>> TP >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> On 13 Aug 2022, at 02:52, Jeambrun Pierre < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>> >>>>>>>>> >>> I feel like this is a great alternative at the price of a >>>>>>>>> very moderate effort. (I'd be glad to help with it). >>>>>>>>> >>> >>>>>>>>> >>> Mutually exclusive sounds good to me as well. >>>>>>>>> >>> >>>>>>>>> >>> Best, >>>>>>>>> >>> Pierre >>>>>>>>> >>> >>>>>>>>> >>> Le ven. 12 août 2022 à 15:23, Jarek Potiuk <[email protected]> >>>>>>>>> a écrit : >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> Mutually exclusive. I think that has the nice property of >>>>>>>>> forcing people to prepare immutable venvs upfront. >>>>>>>>> >>>> >>>>>>>>> >>>> On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> Yes, this has been on my background idea list for an age -- >>>>>>>>> I'd love to see it happen! >>>>>>>>> >>>>> >>>>>>>>> >>>>> Have you thought about how it would behave when you specify >>>>>>>>> an existing virtualenv and include requirements in the operator that >>>>>>>>> are >>>>>>>>> not already installed there? Or would they be mutually exclusive? (I >>>>>>>>> don't >>>>>>>>> mind either way, just wondering which way you are heading) >>>>>>>>> >>>>> >>>>>>>>> >>>>> -ash >>>>>>>>> >>>>> >>>>>>>>> >>>>> On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>> >>>>>>>>> >>>>> Hello everyone, >>>>>>>>> >>>>> >>>>>>>>> >>>>> TL;DR; I propose to extend our PythonVirtualenvOperator >>>>>>>>> with "use existing venv" feature and make it a viable way of handling >>>>>>>>> some >>>>>>>>> multi-dependency sets using multiple pre-installed venvs. >>>>>>>>> >>>>> >>>>>>>>> >>>>> More context: >>>>>>>>> >>>>> >>>>>>>>> >>>>> I had this idea coming after a discussion in our Slack: >>>>>>>>> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179 >>>>>>>>> >>>>> >>>>>>>>> >>>>> My thoughts were - why don't we add support for "use >>>>>>>>> existing venv" in PythonVirtualenvOperator as first-class-citizen ? >>>>>>>>> >>>>> >>>>>>>>> >>>>> Currently (unless there are some tricks I am not aware of) >>>>>>>>> or extend PVO, the PVO will always attempt to create a virtualenv >>>>>>>>> based on >>>>>>>>> extra requirements. And while it gives the users a possibility of >>>>>>>>> having >>>>>>>>> some tasks use different dependencies, the drawback is that the venv >>>>>>>>> is >>>>>>>>> created dynamically when tasks starts - potentially a lot of overhead >>>>>>>>> for >>>>>>>>> startup time and some unpleasant failure scenarios - like networking >>>>>>>>> problems, PyPI or local repoi not available, automated (and unnoticed) >>>>>>>>> upgrade of dependencies. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Those are basically the same problems that caused us to >>>>>>>>> strongly discourage our users in our Helm Chart to use >>>>>>>>> _PIP_ADDITIONAL_DEPENDENCIES in production and criticize the >>>>>>>>> Community >>>>>>>>> Helm Chart for dynamic dependency installation they promote as a >>>>>>>>> "valid" >>>>>>>>> approach. Yet our PVO currently does exactly this. >>>>>>>>> >>>>> >>>>>>>>> >>>>> We had some past discussions how this can be improved - >>>>>>>>> with caching, or using different images for different dependencies and >>>>>>>>> similar - and even we have >>>>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing >>>>>>>>> proposal to use different images for different sets of requirements. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Proposal: >>>>>>>>> >>>>> >>>>>>>>> >>>>> During the discussion yesterday I started to think a >>>>>>>>> simpler solution is possible and rather simple to implement by us and >>>>>>>>> for >>>>>>>>> users to use. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Why not have different venvs preinstalled and let the PVO >>>>>>>>> choose the one that should be used? >>>>>>>>> >>>>> >>>>>>>>> >>>>> It does not invalidate AIP-46. AIP-46 serves a bit >>>>>>>>> different purpose and some cases cannot be handled this way - when >>>>>>>>> you need >>>>>>>>> different "system level" dependencies for example) but it might be >>>>>>>>> much >>>>>>>>> simpler from deployment point of view and allow it to handle >>>>>>>>> "multi-dependency sets" for Python libraries only with minimal >>>>>>>>> deployment >>>>>>>>> overhead (which AIP-46 necessarily has). And I think it will be >>>>>>>>> enough for >>>>>>>>> a vast number of the "multi-dependency-sets" cases. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Why don't we allow the users to prepare those venvs upfront >>>>>>>>> and simply enable PVE to use them rather than create them dynamically >>>>>>>>> ? >>>>>>>>> >>>>> >>>>>>>>> >>>>> Advantages: >>>>>>>>> >>>>> >>>>>>>>> >>>>> * it nicely handles cases where some of your tasks need a >>>>>>>>> different set of dependencies than others (for execution, not >>>>>>>>> necessarily >>>>>>>>> parsing at least initially). >>>>>>>>> >>>>> >>>>>>>>> >>>>> * no startup time overhead needed as with current PVO >>>>>>>>> >>>>> >>>>>>>>> >>>>> * possible to run in both cases - "venv installation" and >>>>>>>>> "docker image" installation >>>>>>>>> >>>>> >>>>>>>>> >>>>> * it has finer granularity level than AIP-46 - unlike in >>>>>>>>> AIP-46 you could use different sets of dependencies >>>>>>>>> >>>>> >>>>>>>>> >>>>> * very easy to pull off for the users without modifying >>>>>>>>> their deployments,For local venv, you just create the venvs, For >>>>>>>>> Docker >>>>>>>>> image case, your custom image needs to add several lines similar to: >>>>>>>>> >>>>> >>>>>>>>> >>>>> RUN python -m venv --system-site-packages PACKAGE1==NN >>>>>>>>> PACKAGE2==NN /opt/venv1 >>>>>>>>> >>>>> RUN python -m venv --system-site-packages PACKAGE1==NN >>>>>>>>> PACKAGE2==NN /opt/venv2 >>>>>>>>> >>>>> >>>>>>>>> >>>>> and PythonVenvOperator should have extra >>>>>>>>> "use_existing_venv=/opt/venv2") parameter >>>>>>>>> >>>>> >>>>>>>>> >>>>> * we only need to manage ONE image (!) even if you have >>>>>>>>> multiple sets of dependencies (this has the advantage that it is >>>>>>>>> actually >>>>>>>>> LOWER overhead than having separate images for each env -when it >>>>>>>>> comes to >>>>>>>>> various resources overhead (same workers could handle multiple >>>>>>>>> dependency >>>>>>>>> sets for examples, same image is reused by multiple PODs in K8S etc. >>>>>>>>> ). >>>>>>>>> >>>>> >>>>>>>>> >>>>> * later (when AIP-43 (separate dag processor with ability >>>>>>>>> to use different processors for different subdirectories) is >>>>>>>>> completed and >>>>>>>>> AIP-46 is approved/implemented, we could also extend DAG Parsing to >>>>>>>>> be able >>>>>>>>> to use those predefined venvs for parsing. That would eliminate the >>>>>>>>> need >>>>>>>>> for local imports and add support to even use different sets of >>>>>>>>> libraries >>>>>>>>> in top-level code (per DAG, not per task). It would not solve >>>>>>>>> different >>>>>>>>> "system" level dependencies - and for that AiP-46 is still a very >>>>>>>>> valid >>>>>>>>> case. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Disadvantages: >>>>>>>>> >>>>> >>>>>>>>> >>>>> I thought very hard about this one and I actually could not >>>>>>>>> find any disadvantages :) >>>>>>>>> >>>>> >>>>>>>>> >>>>> It's simple to implement, use and explain, it can be >>>>>>>>> implemented very quickly (like - in a few hours with tests and >>>>>>>>> documentation I think) and performance-wise it is better for any other >>>>>>>>> solution (including AIP-46) providing that the case is limited to >>>>>>>>> different >>>>>>>>> Python dependencies. >>>>>>>>> >>>>> >>>>>>>>> >>>>> But possibly there are things that I missed. It all looks >>>>>>>>> too good to be true, and I wonder why we do not have it already today >>>>>>>>> - >>>>>>>>> once I thought about it, it seems very obvious. So I probably missed >>>>>>>>> something. >>>>>>>>> >>>>> >>>>>>>>> >>>>> WDYT? >>>>>>>>> >>>>> >>>>>>>>> >>>>> J. >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>> >>>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Collin McNulty >>>>>>>> Lead Airflow Engineer >>>>>>>> >>>>>>>> Email: [email protected] <[email protected]> >>>>>>>> Time zone: US Central (CST UTC-6 / CDT UTC-5) >>>>>>>> >>>>>>>> >>>>>>>> <https://www.astronomer.io/> >>>>>>>> >>>>>>>
