Hi,

I just looked through your code


   1. I assume you are testing this against spark 3.1.1?
   2. You are testing this set-up in a local mode in a single JVM, so it is
   not really distributed. I doubt whether a meaningful performance deduction
   can be made here
   3. It is accepted that newer versions of the product will offer more
   capabilities and hence they are expected to be more resource hungry but I
   concur obviously not 5 times slower
   4. You are reading a csv gzipped. The gz file is not splittable,
   therefore Spark needs to read the whole file using a single core which will
   slow things down (CPU intensive). After the read is done the data can be
   shuffled to increase parallelism.
   5. Intellij, Pycharm etc run in local mode anyway
   6. Have you tried $SPARK_HOME/bin.spark-submit --master local[something]
   xyz.py


FYI, I have used both Spark 3.0.1 and 3.1.1 both in local (spark-submit
--master local) and in distributed mode (spark-submit --master yarn
--deploy-mode client ..) and do not see this behaviour

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 8 Apr 2021 at 12:07, maziyar <maziyar.pan...@iscpif.fr> wrote:

> Hi,
>
> I have a simple code that does a groupby, agg count, sort, etc. This code
> finishes within 5 minutes on Spark 3.1.x. However, the same code, same
> dataset, same SparkSession (configs) on Spark 3.0.2 will finish within a
> minute. That is over 5x times the difference.
>
> My SparkSession (same when it is used with --conf):
>
> val spark: SparkSession = SparkSession
>     .builder()
>     .appName("test")
>     .master("local[*]")
>     .config("spark.driver.memory", "16G")
>     .config("spark.driver.maxResultSize", "0")
>     .config("spark.kryoserializer.buffer.max","200M")
>
> .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
>     .getOrCreate()
>
> Environments which I tested both 3.1.1 and 3.0.2:
> - Intellij
> - spark-shell
> - pyspark shell
> - pure Python with PyPI pyspark
>
> The code, dataset, and initial report for reproducibility:
>
> https://github.com/JohnSnowLabs/spark-nlp/issues/2739#issuecomment-815635930
>
> I have observed that in Spark 3.1.1, only 2 tasks are doing the majority of
> the procession and it is not evenly distributed as one expects in a
> 12-partition DataFrame:
>
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png>
>
>
> However, without any change in any line of code or environment, Spark 3.0.2
> will evenly distribute the tasks at the same time and everything runs in
> parallel:
>
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png>
>
>
> Is there a new feature in Spark 3.1.1, a new config, something that causes
> this unbalanced task execution which wasn't there before in Spark 2.4.x and
> 3.0.x? (I have read the migration guide but, could not find anything
> relevant:
>
> https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31
> )
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to