I meant python packages broadly, such as third-party dependencies of your
pipeline, or the package that has the modules comprising your pipeline.

> What happens if I don’t set
pipeline_options.view_as(SetupOptions).save_main_session to true?

Then we don't save and load the content of main session on the worker.
See:
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pickling-and-managing-the-main-session

Saving main session might be necessary when your main entrypoint code is
importing functions/modules used in your pipeline. When your pipeline is in
a separate package, your mnain entrypoint can be minimalistic, like so:
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/4fe5495e7bfc7d7bc3c3dd41411e84af833a282e/dataflow/flex-templates/pipeline_with_dependencies/main.py#L35
, and there is no need to save the main session. we only need to make sure
that the relevant package, such as `my_package` is installed in the worker
runtime environment.

On Mon, Nov 4, 2024 at 10:49 AM Henry Tremblay <[email protected]>
wrote:

>
>
>
>
> *From:* Valentyn Tymofieiev <[email protected]>
> *Sent:* Monday, November 4, 2024 10:36 AM
> *To:* [email protected]
> *Cc:* Henry Tremblay <[email protected]>
> *Subject:* Re: Solution to import problem
>
>
>
> You don't often get email from [email protected]. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
>
>
>
> What matters is the package must be installed on the image running on the
> worker. You can manually install the package during image building
> (preferred), or you can tell Beam to install the package for you, if you
> provide --extra_package or --setup_file options.
>
>
>
> When you say package, this can mean different things: for example, the
> secretmanager library, your own libraries, and the functions in the main.py
> file (where the pipleline is created). What happens if I don’t set
> pipeline_options.view_as(SetupOptions).save_main_session to true?
>
>
>
> On Mon, Nov 4, 2024 at 8:23 AM Henry Tremblay via user <
> [email protected]> wrote:
>
> My launcher image has:
>
>
>
> ENTRYPOINT ["/opt/google/dataflow/python_template_launcher"]
>
>
>
> If I use this as my sdk_harness_image, as I did before, the job will hang,
> probably because the worker is trying to launch a job. So I need a separate
> image with:
>
>
>
> ENTRYPOINT ["/opt/apache/beam/boot"]
>
>
>
> I think Google suggests that you actually include the ENTRYPOINT as an arg
> in the Dockerfile, so you can use the same image, and then build it twice
> with a different argument.
>
>
>
> I also don’t need:
>
>
>
> ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${BASE}/${PY_FILE}"
>
> ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE=${BASE}/$SETUP
>
> In my worker image.
>
>
>
> What is not completely clear to me is why you need the setup.py to run on
> the launcher image, and not the worker. Also, if you need:
>
>
>
> pipeline_options.view_as(SetupOptions).save_main_session =
> save_main_session
>
>
>
> That line is supposed to save the code in the main session and transfer it
> to the worker (via pickle), but I am wondering if you need this at all.
>
>
>
>
>
> *From:* XQ Hu via user <[email protected]>
> *Sent:* Monday, November 4, 2024 6:31 AM
> *To:* [email protected]
> *Cc:* XQ Hu <[email protected]>
> *Subject:* Re: Solution to import problem
>
>
>
> For ENTRYPOINT, as long as your image copies the launcher file (like
> https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/dataflow/flex-templates/pipeline_with_dependencies/Dockerfile#L38),
> you can just do `ENTRYPOINT ["/opt/apache/beam/boot"]`.
>
> Again, using one container image is more convenient if you start managing
> more Python package depecdencies.
>
>
>
> On Mon, Nov 4, 2024 at 1:16 AM Henry Tremblay <[email protected]>
> wrote:
>
> Sorry, yes, you are correct, though Google does not document this.
>
>
>
> 1. Formerly I can import pyscog and requests because they are in the
> worker image you linked to.
>
> 2. secretmanager cannot be imported because it is not in the worker image.
>
> 3. passing the parameter --parameters
> sdk_container_image=$IMAGE_URL_WORKER causes the worker to use the
> pre-built image
>
> 4. I cannot use the same Docker image for both launcher and worker because
> of the ENTRYPOINT
>
>
>
> On Sun, Nov 3, 2024 at 1:53 PM XQ Hu via user <[email protected]>
> wrote:
>
> I think the problem is you do not specify sdk_container_image when running
> your template.
>
>
>
>
> https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies#run-the-template
> has more details.
>
>
>
> Basically, you do not need to
> https://github.com/paulhtremblay/data-engineering/blob/main/dataflow_/flex_proj_with_secret_manager/Dockerfile
> for your template launcher.
>
>
>
> You can use the same image for both launcher and dataflow workers. You
> only need to copy python_template_launcher to your image like
> https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/dataflow/flex-templates/pipeline_with_dependencies/Dockerfile#L38.
> Then you can use this image for both  launcher and dataflow workers. When
> running the template job, you need to add --parameters
> sdk_container_image=$SDK_CONTAINER_IMAGE.
>
>
>
>
>
> On Sun, Nov 3, 2024 at 4:18 PM Henry Tremblay <[email protected]>
> wrote:
>
> A few weeks ago I had posted a problem I had with importing the Google
> Cloud Secret Manager library in Python.
>
>
>
> Here is the problem and solution:
>
>
>
>
> https://github.com/paulhtremblay/data-engineering/tree/main/dataflow_/flex_proj_with_secret_manager
>
>
>
> --
>
> Henry Tremblay
>
> Data Engineer, Best Buy
>
>
>
>
> --
>
> Henry Tremblay
>
> Data Engineer, Best Buy
>
>

Reply via email to