Hi, So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call *sc.addPyFile("libs.zip")* and it works pretty well for a while.
Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python *limitation*. So I got a workaround: 1) Do not call sc.addPyFile, instead extract the libs.zip into current directory 2) When my python code starts, manually call *sys.path.insert(0, f"{os.getcwd()}/libs")* to set PYTHONPATH This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: def do_something(p): ... rdd = sc.parallelize([ {"x": 1, "y": 2}, {"x": 2, "y": 3}, {"x": 3, "y": 4}, ]) a = rdd.map(do_something) What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do *sys.path.insert(0, f"share_mount_dir/libs") *in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution? Thanks, Stone