Re: Submitting job with external dependencies to pyspark

2020-01-28 Thread Chris Teoh
Usually this isn't done as the data is meant to be on a shared/distributed storage, eg HDFS, S3, etc. Spark should then read this data into a dataframe and your code logic applies to the dataframe in a distributed manner. On Wed, 29 Jan 2020 at 09:37, Tharindu Mathew wrote: > That was really he

Re: Submitting job with external dependencies to pyspark

2020-01-28 Thread Tharindu Mathew
That was really helpful. Thanks! I actually solved my problem using by creating a venv and using the venv flags. Wondering now how to submit the data as an archive? Any idea? On Mon, Jan 27, 2020, 9:25 PM Chris Teoh wrote: > Use --py-files > > See > https://spark.apache.org/docs/latest/submittin

Re: Submitting job with external dependencies to pyspark

2020-01-27 Thread Chris Teoh
Use --py-files See https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies I hope that helps. On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, wrote: > Hi, > > Newbie to pyspark/spark here. > > I'm trying to submit a job to pyspark with a dependen