Hi,

I am facing a weird behaviour while running a python script. Here is what
the code looks like mostly:

def fn1(ip):
   some code...
    ...

def fn2(row):
    ...
    some operations
    ...
    return row1


udf_fn1 = udf(fn1)
cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with
~4500 partitions
ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
    .drop("colz") \
    .withColumnRenamed("colz", "coly")

edf = ddf \
    .filter(ddf.colp == 'some_value') \
    .rdd.map(lambda row: fn2(row)) \
    .toDF()

print edf.count() // simple way for the performance test in both platforms

Now when I run the same code in a brand new jupyter notebook it runs 6x
faster than when I run this python script using spark-submit. The
configurations are printed and  compared from both the platforms and they
are exact same. I even tried to run this script in a single cell of jupyter
notebook and still have the same performance. I need to understand if I am
missing something in the spark-submit which is causing the issue.  I tried
to minimise the script to reproduce the same error without much code.

Both are run in client mode on a yarn based spark cluster. The machines
from which both are executed are also the same and from same user.

What i found is the  the quantile values for median for one ran with
jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
able to figure out why this is happening.

Any one faced this kind of issue before or know how to resolve this?

*Regards,*
*Dhrub*

Reply via email to