Try to use Dataframe instead of RDD.
Here's an introduction to Dataframe:
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
2016-05-06 21:52 GMT+07:00 pratik gawande :
> Thanks Shao for quick reply. I will look into how pyspark jobs are
> exe
Thanks Shao for quick reply. I will look into how pyspark jobs are executed.
Any suggestions or reference to docs on how to tune pyspark jobs?
On Thu, May 5, 2016 at 10:12 PM -0700, "Saisai Shao"
mailto:sai.sai.s...@gmail.com>> wrote:
Writing RDD based application using pyspark will bring i
Writing RDD based application using pyspark will bring in additional
overheads, Spark is running on the JVM whereas your python code is running
on python runtime, so data should be communicated between JVM world and
python world, this requires additional serialization-deserialization, IPC.
Also oth
Hello,
I am new to spark. For one of job I am finding significant performance
difference when run in pyspark vs scala. Could you please let me know if this
is known and scala is preferred over python for writing spark jobs? Also DAG
visualization shows completely different DAGs for scala and p