Hi, I am facing a weird behaviour while running a python script. Here is what the code looks like mostly:
def fn1(ip): some code... ... def fn2(row): ... some operations ... return row1 udf_fn1 = udf(fn1) cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with ~4500 partitions ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ .drop("colz") \ .withColumnRenamed("colz", "coly") edf = ddf \ .filter(ddf.colp == 'some_value') \ .rdd.map(lambda row: fn2(row)) \ .toDF() print edf.count() // simple way for the performance test in both platforms Now when I run the same code in a brand new jupyter notebook it runs 6x faster than when I run this python script using spark-submit. The configurations are printed and compared from both the platforms and they are exact same. I even tried to run this script in a single cell of jupyter notebook and still have the same performance. I need to understand if I am missing something in the spark-submit which is causing the issue. I tried to minimise the script to reproduce the same error without much code. Both are run in client mode on a yarn based spark cluster. The machines from which both are executed are also the same and from same user. What i found is the the quantile values for median for one ran with jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not able to figure out why this is happening. Any one faced this kind of issue before or know how to resolve this? *Regards,* *Dhrub*