Re: PySpark Serialization/Deserialization (Pickling) Overhead

rok Wed, 08 Mar 2017 00:17:37 -0800

My guess is that the UI serialization times show the Java side only. To get
a feeling for the python pickling/unpickling, use the show_profiles()
method of the SparkContext instance: http://spark.apache.
org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.show_profiles


That will show you how much of the execution time is used up by
cPickle.load() and .dump() methods.

Hope that helps,

Rok

On Wed, Mar 8, 2017 at 3:18 AM, Yeoul Na [via Apache Spark User List] <
ml-node+s1001560n28468...@n3.nabble.com> wrote:

>
> Hi all,
>
> I am trying to analyze PySpark performance overhead. People just say
> PySpark
> is slower than Scala due to the Serialization/Deserialization overhead. I
> tried with the example in this post:
> https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many
> articles say straight-forward Python implementation is the slowest due to
> the serialization/deserialization overhead.
>
> However, when I actually looked at the log in the Web UI, serialization
> and deserialization time of PySpark do not seem to be any bigger than that
> of Scala. The main contributor was "Executor Computing Time". Thus, we
> cannot sure whether this is due to serialization or because Python code is
> basically slower than Scala code.
>
> So my question is that does "Task Deserialization Time" in Spark WebUI
> actually include serialization/deserialization times in PySpark? If this is
> not the case, how can I actually measure the serialization/deserialization
> overhead?
>
> Thanks,
> Yeoul
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-
> Serialization-Deserialization-Pickling-Overhead-tp28468.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cm9rcm9za2FyQGdtYWlsLmNvbXwxfC0xNDM4OTI3NjU3>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468p28469.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: PySpark Serialization/Deserialization (Pickling) Overhead

Reply via email to