Biao Geng created FLINK-39194:
---------------------------------
Summary: Improve PyFlink's support on specifying local resource
files
Key: FLINK-39194
URL: https://issues.apache.org/jira/browse/FLINK-39194
Project: Flink
Issue Type: Improvement
Reporter: Biao Geng
Attachments: image-2026-03-03-11-08-20-309.png
When running a PyFlink job, we may need to add and ship some resource files to
workers (e.g. model weights of a PyTorch model) to run the python program.
Currently, we have some relevant options:
# {*}--pyFiles{*}: this option would add files like .py/.egg/.zip/.whl or
directory to PYTHONPATH . It looks like a good fit however, in our current
[implementation|https://github.com/apache/flink/blob/4f85d3074eccfe628e2926269ec7e943c61d2a9c/flink-python/src/main/java/org/apache/flink/python/env/AbstractPythonEnvironmentManager.java#L328],
for a normal resource(e.g. resnet18-f37072fd.pth), we would build a path like
`/mnt/disk1/yarn/nm-local-dir/usercache/root/appcache/application_xxx_0007/python-dist-xxx-xx-xx-xx-xx/python-files/resnet18-f37072fd.pth/resnet18-f37072fd.pth`.
The filename would be repeated and as the file is added in PYTHONPATH, not
working dir, we need to add some extra codes to build the real path in tht
python script to use it. This also implies that we may need to improve the
document here:
!image-2026-03-03-11-08-20-309.png!
2. {*}--pyArchieves{*}: this option is only for zipped files and it can work
when we build a specific tar (e.g. --pyArchives
hdfs:///envs/gpu_test_env.tar.gz#gpu_env,hdfs:///models/resnet18-f37072fd.pth,hdfs:///models/imagenet_classes.txt).
The only shortage is that users have to build a tar or zip by themselves.
We may fix the usage of --pyFiles in the doc and avoid the repeatance of the
normal files in the --pyFiles
--
This message was sent by Atlassian Jira
(v8.20.10#820010)