[Spark SQL]: Does Spark SQL can have better performance?

2021-04-28 Thread Amin Borjian
Hi. We use spark 3.0.1 in HDFS cluster and we store our files as parquet with snappy compression and enabled dictionary. We try to perform a simple query: parquetFile = spark.read.parquet("path/to/hadf") parquetFile.createOrReplaceTempView("parquetFile") spark.sql("SELECT * FROM parquetFile WHER

Re: Handling skew in window functions

2021-04-28 Thread Mich Talebzadeh
Hi Michael, I guess as ever your mileage varies. My suggestion is that you try saling and see whether it will retain the ordering. The most significant column will be step_id so I guess it will be OK. HTH Mich view my Linkedin profile

Re: Handling skew in window functions

2021-04-28 Thread Michael Doo
Hi Mich, Thank you for the suggestions. I took a look at the other thread you mentioned. One feature of my code that I'm not sure would be affected by salting is the use of collect_list(). My understanding is that collect_list() will retain the row ordering of values. You can see in my Window defi

[Announcement] Analytics Zoo 0.10.0 release

2021-04-28 Thread Jason Dai
Hi Everyone, I’m happy to announce the 0.10.0 release for Analytics Zoo (distributed TensorFlow and PyTorch on Apache Spark/Flink & Ray); the highlights of this release include: - A re-designed document website