Biao Geng created FLINK-39194:
---------------------------------

             Summary: Improve PyFlink's support on specifying local resource 
files
                 Key: FLINK-39194
                 URL: https://issues.apache.org/jira/browse/FLINK-39194
             Project: Flink
          Issue Type: Improvement
            Reporter: Biao Geng
         Attachments: image-2026-03-03-11-08-20-309.png

When running a PyFlink job, we may need to add and ship some resource files to 
workers (e.g. model weights of a PyTorch model) to run the python program.

 

Currently, we have some relevant options:
 # {*}--pyFiles{*}:  this option would add files like .py/.egg/.zip/.whl or 
directory to PYTHONPATH . It looks like a good fit however, in our current 
[implementation|https://github.com/apache/flink/blob/4f85d3074eccfe628e2926269ec7e943c61d2a9c/flink-python/src/main/java/org/apache/flink/python/env/AbstractPythonEnvironmentManager.java#L328],
 for a normal resource(e.g. resnet18-f37072fd.pth), we would build a path like 
`/mnt/disk1/yarn/nm-local-dir/usercache/root/appcache/application_xxx_0007/python-dist-xxx-xx-xx-xx-xx/python-files/resnet18-f37072fd.pth/resnet18-f37072fd.pth`.
 The filename would be repeated and as the file is added in PYTHONPATH, not 
working dir, we need to add some extra codes to build the real path in tht 
python script to use it. This also implies that we may need to improve the 
document here:
!image-2026-03-03-11-08-20-309.png!
2. {*}--pyArchieves{*}: this option is only for zipped files and it can work 
when we build a specific tar (e.g. --pyArchives 
hdfs:///envs/gpu_test_env.tar.gz#gpu_env,hdfs:///models/resnet18-f37072fd.pth,hdfs:///models/imagenet_classes.txt).
 The only shortage is that users have to build a tar or zip by themselves.

 

We may fix the usage of --pyFiles in the doc and avoid the repeatance of the 
normal files in the --pyFiles



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to