Hi,

So my pyspark app depends on some python libraries, it is not a problem, I
pack all the dependencies into a file libs.zip, and then call
*sc.addPyFile("libs.zip")* and it works pretty well for a while.

Then I encountered a problem, if any of my library has any binary file
dependency (like .so files), this approach does not work. Mainly because
when you set PYTHONPATH to a zip file, python does not look up needed
binary library (e.g. a .so file) inside the zip file, this is a python
*limitation*. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current
directory
2) When my python code starts, manually call *sys.path.insert(0,
f"{os.getcwd()}/libs")* to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my
code in executor need python library that has binary code? Below is am
example:

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has
binary code? My current solution is, extract libs.zip into a NFS share (or
a SMB share) and manually do *sys.path.insert(0, f"share_mount_dir/libs") *in
my "do_something" function, but adding such code in each function looks
ugly, is there any better/elegant solution?

Thanks,
Stone

Reply via email to