Re: AQE effectiveness

2020-08-21 Thread Maryann Xue
It would break CachedTableSuite."A cached table preserves the partitioning
and ordering of its cached SparkPlan" if AQE was turned on.

Anyway, the chance of this outputPartitioning being useful is rather low
and should not justify turning off AQE for SQL cache.

On Thu, Aug 20, 2020 at 10:54 PM Koert Kuipers  wrote:

> in our inhouse spark version i changed this without trouble and it didnt
> even break any tests
> just some minor changes in CacheManager it seems
>
> On Thu, Aug 20, 2020 at 1:12 PM Maryann Xue 
> wrote:
>
>> No. The worst case of enabling AQE in cached data is not losing the
>> opportunity of using/reusing the cache, but rather just an extra shuffle if
>> the outputPartitioning happens to match without AQE and not match after
>> AQE. The chance of this happening is rather low.
>>
>> On Thu, Aug 20, 2020 at 12:09 PM Koert Kuipers  wrote:
>>
>>> i see. it makes sense to maximize re-use of cached data. i didn't
>>> realize we have two potentially conflicting goals here.
>>>
>>>
>>> On Thu, Aug 20, 2020 at 12:41 PM Maryann Xue 
>>> wrote:
>>>
 AQE has been turned off deliberately so that the `outputPartitioning`
 of the cached relation won't be changed by AQE partition coalescing or skew
 join optimization and the outputPartitioning can potentially be used by
 relations built on top of the cache.

 On a second thought, we should probably add a config there and enable
 AQE by default.


 Thanks,
 Maryann

 On Thu, Aug 20, 2020 at 11:12 AM Koert Kuipers 
 wrote:

> we tend to have spark.sql.shuffle.partitions set very high by default
> simply because some jobs need it to be high and it's easier to then just
> set the default high instead of having people tune it manually per job. 
> the
> main downside is lots of part files which leads to pressure on the driver,
> and dynamic allocation becomes troublesome if every aggregation requires
> thousands of tasks... even the simplest aggregation on tiny small data 
> will
> demand all resources on the cluster.
>
> because of these issues AQE appeals a lot to me: by automatically
> scaling the reducer partitions we avoid these issues. so we have AQE 
> turned
> on by default. every once in a while i scan through our spark AMs and logs
> to see how it's doing. i mostly look for stages that have a number of 
> tasks
> equal to spark.sql.shuffle.partitions, a sign to me that AQE isn't being
> effective. unfortunately this seems to be the majority. i suspect it has 
> to
> do with caching/persisting which we use frequently. a simple reproduction
> is below.
>
> any idea why caching/persisting would interfere with AQE?
>
> best, koert
>
> $ hadoop fs -text fruits.csv
> fruit,color,quantity
> apple,red,5
> grape,blue,50
> pear,green,3
>
> # works well using AQE, uses 1 to 3 tasks per job
> $ spark-3.1.0-SNAPSHOT/bin/spark-shell --conf
> spark.sql.adaptive.enabled=true
> scala> val data = spark.read.format("csv").option("header",
> true).load("fruits.csv").persist()
> scala> data.groupBy("fruit").count().write.format("csv").save("out)
>
> # does not work well using AQR, uses 200 tasks (e.g.
> spark.sql.shuffle.partitions) for certain jobs. the only difference is 
> when
> persist is called.
> $ spark-3.1.0-SNAPSHOT/bin/spark-shell --conf
> spark.sql.adaptive.enabled=true
> scala> val data = spark.read.format("csv").option("header",
> true).load("fruits.csv").groupBy("fruit").count().persist()
> scala> data.write.format("csv").save("out)
>
>


Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-21 Thread Tom Graves
 There is a correctness issue with caching that should go into this if 
possible: https://github.com/apache/spark/pull/29506
Tom
On Wednesday, August 19, 2020, 11:18:37 AM CDT, Wenchen Fan 
 wrote:  
 
 I think so. I don't see other bug reports for 2.4.
On Thu, Aug 20, 2020 at 12:11 AM Nicholas Marion  wrote:


It appears all 3 issues slated for Spark 2.4.7 have been merged. Should we be 
looking at getting RC2 ready?




| Regards, 

NICHOLAS T. MARION 
IBM Open Data Analytics for z/OS - CPO and Service Team Lead  |


|  |
| Phone: 1-845-433-5010 | Tie-Line: 293-5010 
E-mail: nmar...@us.ibm.com 
Find me on:   | 

2455 South Rd 
Poughkeepie, New York 12601-5400 
United States  |
|  |
|  |




Xiao Li ---08/17/2020 11:33:30 
AM---https://issues.apache.org/jira/browse/SPARK-32609

From: Xiao Li 
To: Prashant Sharma 
Cc: Takeshi Yamamuro , dev 
Date: 08/17/2020 11:33 AM
Subject: [EXTERNAL] Re: [VOTE] Release Spark 2.4.7 (RC1)





https://issues.apache.org/jira/browse/SPARK-32609 got merged. This is to fix a 
correctness bug in DSV2 of Spark 2.4. Please include it in the upcoming Spark 
2.4.7 release. 

Thanks,

Xiao

On Sun, Aug 9, 2020 at 10:26 PM Prashant Sharma  wrote:   
Thanks for letting us know. So this vote is cancelled in favor of RC2.



On Sun, Aug 9, 2020 at 8:31 AM Takeshi Yamamuro  wrote:
Thanks for letting us know about the two issues above, Dongjoon.


I've checked the release materials (signatures, tag, ...) and it looks fine, 
too.
Also, I run the tests on my local Mac (java 1.8.0) with the options
`-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes -Psparkr`
and they passed.

Bests,
Takeshi



On Sun, Aug 9, 2020 at 11:06 AM Dongjoon Hyun  wrote:  
 
Another instance is SPARK-31703 which filed on May 13th and the PR arrived two 
days ago.

    [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big 
endian platforms
    https://github.com/apache/spark/pull/29383

It seems that the patch is already ready in this case.
I raised the priority of SPARK-31703 to `Blocker` for both Apache Spark 2.4.7 
and 3.0.1.

Bests,
Dongjoon.


On Sat, Aug 8, 2020 at 6:10 AM Holden Karau  wrote:
I'm going to go ahead and vote -0 then based on that then.

On Fri, Aug 7, 2020 at 11:36 PM Dongjoon Hyun  wrote:  
 
Hi, All.

Unfortunately, there is an on-going discussion about the new decimal 
correctness.

Although we fixed one correctness issue at master and backported it partially 
to 3.0/2.4, it turns out that it needs more patched to be complete.

Please see https://github.com/apache/spark/pull/29125 for on-going discussion 
for both 3.0/2.4.

    [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with 
overflowed value

I also confirmed that 2.4.7 RC1 is affected.

Bests,
Dongjoon.


On Thu, Aug 6, 2020 at 2:48 PM Sean Owen  wrote:
+1 from me. The same as usual. Licenses and sigs look OK, builds and
passes tests on a standard selection of profiles.

On Thu, Aug 6, 2020 at 7:07 AM Prashant Sharma  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.7.
>
> The vote is open until Aug 9th at 9AM PST and passes if a majority +1 PMC 
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.7
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.7 (try project = SPARK AND 
> "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.7-rc1 (commit 
> dc04bf53fe821b7a07f817966c6c173f3b3788c6):
> https://github.com/apache/spark/tree/v2.4.7-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1352/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-docs/
>
> The list of bug fixes going into 2.4.7 can be found at the following URL:
> https://s.apache.org/spark-v2.4.7-rc1
>
> This release is using the release script of the tag v2.4.7-rc1.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up buildi