Re: Python vs. Scala

2017-09-06 Thread Conconscious
Just run by yourself this test and check the results. During the run also check with top a worker. Python: import random def inside(p): x, y = random.random(), random.random() return x * x + y * y < 1 def estimate_pi(num_samples): count = sc.parallelize(xrange(0, num_samples)).filte

Re: Python vs. Scala

2017-09-05 Thread ayan guha
And I have just the opposite experience ie I know Python but I see scala demands more :) I think there are few fair points on both sides, and scala wins: 1. Feature parity: Definitely scala wins. Not only new spark features, but if you intend to use 3rd party connectors (such as Azure services).

Re: Python vs Scala performance

2014-10-22 Thread Davies Liu
Sorry, there is not, you can try clone from github and build it from scratch, see [1] [1] https://github.com/apache/spark Davies On Wed, Oct 22, 2014 at 2:31 PM, Marius Soutier wrote: > Can’t install that on our cluster, but I can try locally. Is there a > pre-built binary available? > > On 22

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Can’t install that on our cluster, but I can try locally. Is there a pre-built binary available? On 22.10.2014, at 19:01, Davies Liu wrote: > In the master, you can easily profile you job, find the bottlenecks, > see https://github.com/apache/spark/pull/2556 > > Could you try it and show the s

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Yeah we’re using Python 2.7.3. On 22.10.2014, at 20:06, Nicholas Chammas wrote: > On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT > wrote: > > > > Wild guess maybe, but do you decode the json records in Python ? it could be > much slower as the default lib is quite slow. > > > Oh yea

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT wrote: Wild guess maybe, but do you decode the json records in Python ? it could > be much slower as the default lib is quite slow. > Oh yeah, this is a good place to look. Also, just upgrading to Python 2.7 may be enough performance improvement

Re: Python vs Scala performance

2014-10-22 Thread Davies Liu
In the master, you can easily profile you job, find the bottlenecks, see https://github.com/apache/spark/pull/2556 Could you try it and show the stats? Davies On Wed, Oct 22, 2014 at 7:51 AM, Marius Soutier wrote: > It’s an AWS cluster that is rather small at the moment, 4 worker nodes @ 28 > G

Re: Python vs Scala performance

2014-10-22 Thread Eustache DIEMERT
Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the default lib is quite slow. If so try ujson [1] - a C implementation that is at least an order of magnitude faster. HTH [1] https://pypi.python.org/pypi/ujson 2014-10-22 16:51 GMT+02:00 Marius Soutier

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
It’s an AWS cluster that is rather small at the moment, 4 worker nodes @ 28 GB RAM and 4 cores, but fast enough for the currently 40 Gigs a day. Data is on HDFS in EBS volumes. Each file is a Gzip-compress collection of JSON objects, each one between 115-120 MB to be near the HDFS block size. O

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Didn’t seem to help: conf = SparkConf().set("spark.shuffle.spill", "false").set("spark.default.parallelism", "12") sc = SparkContext(appName=’app_name', conf = conf) but still taking as much time On 22.10.2014, at 14:17, Nicholas Chammas wrote: > Total guess without knowing anything about you

Re: Python vs Scala performance

2014-10-22 Thread Arian Pasquali
Interesting thread Marius, Btw, I'm curious about your cluster size. How small it is in terms of ram and cores. Arian 2014-10-22 13:17 GMT+01:00 Nicholas Chammas : > Total guess without knowing anything about your code: Do either of these > two notes from the 1.1.0 release notes >

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
Total guess without knowing anything about your code: Do either of these two notes from the 1.1.0 release notes affect things at all? - PySpark now performs external spilling during aggregations. Old behavior can be restored by set

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not that... On 22.10.2014, at 13:02, Nicholas Chammas wrote: > What version of Spark are you running? Some recent changes to how PySpark > works relative to Scala Spark may explain things. > > PySpark should not be that mu

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas
What version of Spark are you running? Some recent changes to how PySpark works relative to Scala Spark may explain things. PySpark should not be that much slower, not by a stretch. On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab wrote:

RE: Python vs Scala performance

2014-10-22 Thread Ashic Mahtab
I'm no expert, but looked into how the python bits work a while back (was trying to assess what it would take to add F# support). It seems python hosts a jvm inside of it, and talks to "scala spark" in that jvm. The python server bit "translates" the python calls to those in the jvm. The python