My guess is that the UI serialization times show the Java side only. To get a feeling for the python pickling/unpickling, use the show_profiles() method of the SparkContext instance: http://spark.apache. org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.show_profiles
That will show you how much of the execution time is used up by cPickle.load() and .dump() methods. Hope that helps, Rok On Wed, Mar 8, 2017 at 3:18 AM, Yeoul Na [via Apache Spark User List] < ml-node+s1001560n28468...@n3.nabble.com> wrote: > > Hi all, > > I am trying to analyze PySpark performance overhead. People just say > PySpark > is slower than Scala due to the Serialization/Deserialization overhead. I > tried with the example in this post: > https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many > articles say straight-forward Python implementation is the slowest due to > the serialization/deserialization overhead. > > However, when I actually looked at the log in the Web UI, serialization > and deserialization time of PySpark do not seem to be any bigger than that > of Scala. The main contributor was "Executor Computing Time". Thus, we > cannot sure whether this is due to serialization or because Python code is > basically slower than Scala code. > > So my question is that does "Task Deserialization Time" in Spark WebUI > actually include serialization/deserialization times in PySpark? If this is > not the case, how can I actually measure the serialization/deserialization > overhead? > > Thanks, > Yeoul > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark- > Serialization-Deserialization-Pickling-Overhead-tp28468.html > To start a new topic under Apache Spark User List, email > ml-node+s1001560n1...@n3.nabble.com > To unsubscribe from Apache Spark User List, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cm9rcm9za2FyQGdtYWlsLmNvbXwxfC0xNDM4OTI3NjU3> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468p28469.html Sent from the Apache Spark User List mailing list archive at Nabble.com.