The slowness in PySpark may be related to searching path added by PySpark,
could you show the sys.path?
On Thu, Sep 3, 2015 at 1:38 PM, Priedhorsky, Reid wrote:
>
> On Sep 3, 2015, at 12:39 PM, Davies Liu wrote:
>
> I think this is not a problem of PySpark, you also saw this if you
> profile thi
On Sep 3, 2015, at 12:39 PM, Davies Liu
mailto:dav...@databricks.com>> wrote:
I think this is not a problem of PySpark, you also saw this if you
profile this script:
```
list(map(map_, range(sc.defaultParallelism)))
```
81777/808740.0860.0000.3600.000 :2264(_handle_fromlist)
T
I think this is not a problem of PySpark, you also saw this if you
profile this script:
```
list(map(map_, range(sc.defaultParallelism)))
```
81777/808740.0860.0000.3600.000 :2264(_handle_fromlist)
On Thu, Sep 3, 2015 at 11:16 AM, Priedhorsky, Reid wrote:
>
> On Sep 2, 2015, a
On Sep 2, 2015, at 11:31 PM, Davies Liu
mailto:dav...@databricks.com>> wrote:
Could you have a short script to reproduce this?
Good point. Here you go. This is Python 3.4.3 on Ubuntu 15.04.
import pandas as pd # must be in default path for interpreter
import pyspark
LEN = 260
ITER_CT = 1
Could you have a short script to reproduce this?
On Wed, Sep 2, 2015 at 2:10 PM, Priedhorsky, Reid wrote:
> Hello,
>
> I have a PySpark computation that relies on Pandas and NumPy. Currently, my
> inner loop iterates 2,000 times. I’m seeing the following show up in my
> profiling:
>
> 74804/29102
Hello,
I have a PySpark computation that relies on Pandas and NumPy. Currently, my
inner loop iterates 2,000 times. I’m seeing the following show up in my
profiling:
74804/291020.2040.0002.1730.000 :2234(_find_and_load)
74804/291020.1450.0001.8670.000 :2207(_find