Dear Spark users, My team is working on a small library that builds on PySpark and is organized like PySpark as well -- it has a JVM component (that runs in the Spark driver and executor) and a Python component (that runs in the PySpark driver and executor processes). What's a good approach for packaging such a library?
Some ideas we've considered: Package up the JVM component as a Jar and the Python component as a binary egg. This is reasonable but it means that there are two separate artifacts that people have to manage and keep in sync. Include Python files in the Jar and add it to the PYTHONPATH. This follows the example of the Spark assembly jar, but deviates from the Python community's standards. We'd really appreciate hearing experiences from other people who have built libraries on top of PySpark. Punya
smime.p7s
Description: S/MIME cryptographic signature