Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Michael Armbrust
> > At my company we use Avro heavily and it's not been fun when i've tried to > work with complex avro schemas and python. This may not be relevant to you > however...otherwise I found Python to be a great fit for Spark :) > Have you tried using https://github.com/databricks/spark-avro ? It shoul

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread sethah
Regarding features, the general workflow for the Spark community when adding new features is to first add them in Scala (since Spark is written in Scala). Once this is done, a Jira ticket will be created requesting that the feature be added to the Python API (example - SPARK-9773

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Sean Owen
These are true, but it's not because Spark is written in Scala; it's because it executes in the JVM. So, Scala/Java-based apps have an advantage in that they don't have to serialize data back and forth to a Python process, which also brings a new set of things that can go wrong. Python is also inhe

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Siegfried Bilstein
Python APIs are sometimes a little behind Scala APIs. Another issue that arises sometimes is when you have dependencies on Java or Scala classes for serializing and deserializing data. Working with non-trivial Avro schemas has been a bit of a pain for me in Python due to the difficulty in dealing

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Don Drake
If you are using Dataframes in PySpark, then the performance will be the same as Scala. However, if you need to implement your own UDF, or run a map() against a DataFrame in Python, then you will pay the penalty for performance when executing those functions since all of your data has to go throug

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Richard Eggert
That should have read "a lot of neat tricks", not "a lot of nest tricks". That's what I get for sending emails on my phone On Oct 6, 2015 8:32 PM, "Richard Eggert" wrote: > Since the Python API is built on top of the Scala implementation, its > performance can be at best roughly the same as

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Richard Eggert
Since the Python API is built on top of the Scala implementation, its performance can be at best roughly the same as that of the Scala API (as in the case of DataFrames and SQL) and at worst several orders of magnitude slower. Likewise, since the a Scala implementation of new features necessaril

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread ayan guha
Hi 2 cents 1. It should not be true anymore if data frames are used. The reason is regardless of the language DF uses same optimization engine behind the scene. 2. This is generally true in the sense Python APIs are typically little behind of scala/java ones. Best Ayan On Wed, Oct 7, 2015 at 9