Hi,

I have a PyFlink job that consists of:

   - Multiple Python files.
   - Multiple 3rdparty Python dependencies, specified in a
   `requirements.txt` file.
   - A few Java dependencies, mainly for external connectors.
   - An overall job config YAML file.

Here's a simplified structure of the code layout.

flink/
├── deps
│   ├── jar
│   │   ├── flink-connector-kafka_2.11-1.12.2.jar
│   │   └── kafka-clients-2.4.1.jar
│   └── pip
│       └── requirements.txt
├── conf
│   └── job.yaml
└── job
    ├── some_file_x.py
    ├── some_file_y.py
    └── main.py

I'm able to execute this job running it locally i.e. invoking something
like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S.
I'm looking for any best practices in packaging and specifying dependencies
that people tend to follow. As per the documentation here [1], various
Python files, including the conf YAML, can be specified using the --pyFiles
option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to
another piece of documentation here [2], I should be able to specify the
requirements.txt directly inside the code and submit it via the --pyFiles
option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program

Reply via email to