I meant python packages broadly, such as third-party dependencies of your pipeline, or the package that has the modules comprising your pipeline.
> What happens if I don’t set pipeline_options.view_as(SetupOptions).save_main_session to true? Then we don't save and load the content of main session on the worker. See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session Saving main session might be necessary when your main entrypoint code is importing functions/modules used in your pipeline. When your pipeline is in a separate package, your mnain entrypoint can be minimalistic, like so: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/4fe5495e7bfc7d7bc3c3dd41411e84af833a282e/dataflow/flex-templates/pipeline_with_dependencies/main.py#L35 , and there is no need to save the main session. we only need to make sure that the relevant package, such as `my_package` is installed in the worker runtime environment. On Mon, Nov 4, 2024 at 10:49 AM Henry Tremblay <[email protected]> wrote: > > > > > *From:* Valentyn Tymofieiev <[email protected]> > *Sent:* Monday, November 4, 2024 10:36 AM > *To:* [email protected] > *Cc:* Henry Tremblay <[email protected]> > *Subject:* Re: Solution to import problem > > > > You don't often get email from [email protected]. Learn why this is > important <https://aka.ms/LearnAboutSenderIdentification> > > > > What matters is the package must be installed on the image running on the > worker. You can manually install the package during image building > (preferred), or you can tell Beam to install the package for you, if you > provide --extra_package or --setup_file options. > > > > When you say package, this can mean different things: for example, the > secretmanager library, your own libraries, and the functions in the main.py > file (where the pipleline is created). What happens if I don’t set > pipeline_options.view_as(SetupOptions).save_main_session to true? > > > > On Mon, Nov 4, 2024 at 8:23 AM Henry Tremblay via user < > [email protected]> wrote: > > My launcher image has: > > > > ENTRYPOINT ["/opt/google/dataflow/python_template_launcher"] > > > > If I use this as my sdk_harness_image, as I did before, the job will hang, > probably because the worker is trying to launch a job. So I need a separate > image with: > > > > ENTRYPOINT ["/opt/apache/beam/boot"] > > > > I think Google suggests that you actually include the ENTRYPOINT as an arg > in the Dockerfile, so you can use the same image, and then build it twice > with a different argument. > > > > I also don’t need: > > > > ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${BASE}/${PY_FILE}" > > ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE=${BASE}/$SETUP > > In my worker image. > > > > What is not completely clear to me is why you need the setup.py to run on > the launcher image, and not the worker. Also, if you need: > > > > pipeline_options.view_as(SetupOptions).save_main_session = > save_main_session > > > > That line is supposed to save the code in the main session and transfer it > to the worker (via pickle), but I am wondering if you need this at all. > > > > > > *From:* XQ Hu via user <[email protected]> > *Sent:* Monday, November 4, 2024 6:31 AM > *To:* [email protected] > *Cc:* XQ Hu <[email protected]> > *Subject:* Re: Solution to import problem > > > > For ENTRYPOINT, as long as your image copies the launcher file (like > https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/dataflow/flex-templates/pipeline_with_dependencies/Dockerfile#L38), > you can just do `ENTRYPOINT ["/opt/apache/beam/boot"]`. > > Again, using one container image is more convenient if you start managing > more Python package depecdencies. > > > > On Mon, Nov 4, 2024 at 1:16 AM Henry Tremblay <[email protected]> > wrote: > > Sorry, yes, you are correct, though Google does not document this. > > > > 1. Formerly I can import pyscog and requests because they are in the > worker image you linked to. > > 2. secretmanager cannot be imported because it is not in the worker image. > > 3. passing the parameter --parameters > sdk_container_image=$IMAGE_URL_WORKER causes the worker to use the > pre-built image > > 4. I cannot use the same Docker image for both launcher and worker because > of the ENTRYPOINT > > > > On Sun, Nov 3, 2024 at 1:53 PM XQ Hu via user <[email protected]> > wrote: > > I think the problem is you do not specify sdk_container_image when running > your template. > > > > > https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies#run-the-template > has more details. > > > > Basically, you do not need to > https://github.com/paulhtremblay/data-engineering/blob/main/dataflow_/flex_proj_with_secret_manager/Dockerfile > for your template launcher. > > > > You can use the same image for both launcher and dataflow workers. You > only need to copy python_template_launcher to your image like > https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/dataflow/flex-templates/pipeline_with_dependencies/Dockerfile#L38. > Then you can use this image for both launcher and dataflow workers. When > running the template job, you need to add --parameters > sdk_container_image=$SDK_CONTAINER_IMAGE. > > > > > > On Sun, Nov 3, 2024 at 4:18 PM Henry Tremblay <[email protected]> > wrote: > > A few weeks ago I had posted a problem I had with importing the Google > Cloud Secret Manager library in Python. > > > > Here is the problem and solution: > > > > > https://github.com/paulhtremblay/data-engineering/tree/main/dataflow_/flex_proj_with_secret_manager > > > > -- > > Henry Tremblay > > Data Engineer, Best Buy > > > > > -- > > Henry Tremblay > > Data Engineer, Best Buy > >
