Hi Reynold, In my project I want to use Python API too. When you mention DF's are we talking about pandas or this is something internal to spark py api. If you could elaborate a bit on this or point me to alternate documentation. Thanks much --sasha
On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote: > Once the data frame API is released for 1.3, you can write your thing in > Python and get the same performance. It can't express everything, but for > basic things like projection, filter, join, aggregate and simple numeric > computation, it should work pretty well. > > > On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <pastuszka.przemys...@gmail.com > > > wrote: > > > Hi, > > > > In my company, we've been trying to use PySpark to run ETLs on our data. > > Alas, it turned out to be terribly slow compared to Java or Scala API > > (which > > we ended up using to meet performance criteria). > > > > To be more quantitative, let's consider simple case: > > I've generated test file (848MB): /seq 1 100000000 > /tmp/test/ > > > > and tried to run simple computation on it, which includes three steps: > read > > -> multiply each row by 2 -> take max > > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/ > > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/ > > > > Here are the results of this simple benchmark: > > CPython - 59s > > PyPy - 26s > > Scala version - 7s > > > > I didn't dig into what exactly contributes to execution times of CPython > / > > PyPy, but it seems that serialization / deserialization, when sending > data > > to the worker may be the issue. > > I know some guys already have been asking about using Jython > > ( > > > http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658 > > , > > > > > http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html > > ), > > but it seems, that no one have really done this with Spark. > > It looks like performance gain from using jython can be huge - you > wouldn't > > need to spawn PythonWorkers, all the code would be just executed inside > > SparkExecutor JVM, using python code compiled to java bytecode. Do you > > think > > that's possible to achieve? Do you see any obvious obstacles? Of course, > > jython doesn't have C extensions, but if one doesn't need them, then it > > should fit here nicely. > > > > I'm willing to try to marry Spark with Jython and see how it goes. > > > > What do you think about this? > > > > > > > > > > > > -- > > View this message in context: > > > http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html > > Sent from the Apache Spark Developers List mailing list archive at > > Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > -- Aleksandar Kacanski