Usually this isn't done as the data is meant to be on a shared/distributed
storage, eg HDFS, S3, etc.
Spark should then read this data into a dataframe and your code logic
applies to the dataframe in a distributed manner.
On Wed, 29 Jan 2020 at 09:37, Tharindu Mathew
wrote:
> That was really he
That was really helpful. Thanks! I actually solved my problem using by
creating a venv and using the venv flags. Wondering now how to submit the
data as an archive? Any idea?
On Mon, Jan 27, 2020, 9:25 PM Chris Teoh wrote:
> Use --py-files
>
> See
> https://spark.apache.org/docs/latest/submittin
Use --py-files
See
https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies
I hope that helps.
On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew,
wrote:
> Hi,
>
> Newbie to pyspark/spark here.
>
> I'm trying to submit a job to pyspark with a dependen