Testing with spark 2.3 and I see a difference in the sql coalesce talking to hive vs spark 2.2. It seems spark 2.3 ignores the coalesce. Query:spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= '20170301' AND dt <= '20170331' AND something IS NOT NULL").coalesce(160000).show()
in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't. Anyone know about this issue or are there some weird config changes, otherwise I'll file a jira? Note I also see a performance difference when reading cached data. Spark 2.3. Small query on 19GB cached data, spark 2.3 is 30% worse. This is only 13 seconds on spark 2.2 vs 17 seconds on spark 2.3. Straight up reading from hive (orc) seems better though. Tom On Thursday, February 1, 2018, 11:23:45 AM CST, Michael Heuer <heue...@gmail.com> wrote: We found two classes new to Spark 2.3.0 that must be registered in Kryo for our tests to pass on RC2 org.apache.spark.sql.execution.datasources.BasicWriteTaskStats org.apache.spark.sql.execution.datasources.ExecutedWriteSummary https://github.com/bigdatagenomics/adam/pull/1897 Perhaps a mention in release notes? michael On Thu, Feb 1, 2018 at 3:29 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side that should be everything outstanding. On Thu, 1 Feb 2018 at 06:21 Yin Huai <yh...@databricks.com> wrote: seems we are not running tests related to pandas in pyspark tests (see my email "python tests related to pandas are skipped in jenkins"). I think we should fix this test issue and make sure all tests are good before cutting RC3. On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal <samee...@apache.org> wrote: Just a quick status update on RC3 -- SPARK-23274 was resolved yesterday and tests have been quite healthy throughout this week and the last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202) is resolved. On 30 January 2018 at 10:12, Andrew Ash <and...@andrewash.com> wrote: I'd like to nominate SPARK-23274 as a potential blocker for the 2.3.0 release as well, due to being a regression from 2.2.0. The ticket has a simple repro included, showing a query that works in prior releases but now fails with an exception in the catalyst optimizer. On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal <sameer.a...@gmail.com> wrote: This vote has failed due to a number of aforementioned blockers. I'll follow up with RC3 as soon as the 2 remaining (non-QA) blockers are resolved: https://s.apache. org/oXKi On 25 January 2018 at 12:59, Sameer Agarwal <sameer.a...@gmail.com> wrote: Most tests pass on RC2, except I'm still seeing the timeout caused by https://issues.apache.org/ jira/browse/SPARK-23055 ; the tests never finish. I followed the thread a bit further and wasn't clear whether it was subsequently re-fixed for 2.3.0 or not. It says it's resolved along with https://issues.apache. org/jira/browse/SPARK-22908 for 2.3.0 though I am still seeing these tests fail or hang: - subscribing topic by name from earliest offsets (failOnDataLoss: false)- subscribing topic by name from earliest offsets (failOnDataLoss: true) Sean, while some of these tests were timing out on RC1, we're not aware of any known issues in RC2. Both maven (https://amplab.cs.berkeley. edu/jenkins/view/Spark%20QA% 20Test%20(Dashboard)/job/ spark-branch-2.3-test-maven- hadoop-2.6/146/testReport/org. apache.spark.sql.kafka010/ history/) and sbt (https://amplab.cs.berkeley. edu/jenkins/view/Spark%20QA% 20Test%20(Dashboard)/job/ spark-branch-2.3-test-sbt- hadoop-2.6/123/testReport/org. apache.spark.sql.kafka010/ history/) historical builds on jenkins for org.apache.spark.sql. kafka010 look fairly healthy. If you're still seeing timeouts in RC2, can you create a JIRA with any applicable build/env info? On Tue, Jan 23, 2018 at 9:01 AM Sean Owen <so...@cloudera.com> wrote: I'm not seeing that same problem on OS X and /usr/bin/tar. I tried unpacking it with 'xvzf' and also unzipping it first, and it untarred without warnings in either case. I am encountering errors while running the tests, different ones each time, so am still figuring out whether there is a real problem or just flaky tests. These issues look like blockers, as they are inherently to be completed before the 2.3 release. They are mostly not done. I suppose I'd -1 on behalf of those who say this needs to be done first, though, we can keep testing. SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrellaSPARK-23114 Spark R 2.3 QA umbrella Here are the remaining items targeted for 2.3: SPARK-15689 Data source API v2SPARK-20928 SPIP: Continuous Processing Mode for Structured StreamingSPARK-21646 Add new type coercion rules to compatible with HiveSPARK-22386 Data Source V2 improvementsSPARK-22731 Add a test for ROWID type to OracleIntegrationSuiteSPARK-22735 Add VectorSizeHint to ML features documentationSPARK-22739 Additional Expression Support for ObjectsSPARK-22809 pyspark is sensitive to imports with dotsSPARK-22820 Spark 2.3 SQL API audit On Mon, Jan 22, 2018 at 7:09 PM Marcelo Vanzin <van...@cloudera.com> wrote: +0 Signatures check out. Code compiles, although I see the errors in [1] when untarring the source archive; perhaps we should add "use GNU tar" to the RM checklist? Also ran our internal tests and they seem happy. My concern is the list of open bugs targeted at 2.3.0 (ignoring the documentation ones). It is not long, but it seems some of those need to be looked at. It would be nice for the committers who are involved in those bugs to take a look. [1] https://superuser.com/ questions/318809/linux-os-x- tar-incompatibility-tarballs- created-on-os-x-give-errors- when-unt On Mon, Jan 22, 2018 at 1:36 PM, Sameer Agarwal <samee...@apache.org> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.0. The vote is open until Friday January 26, 2018 at 8:00:00 am UTC and > passes if a majority of at least 3 PMC +1 votes are cast. > > > [ ] +1 Release this package as Apache Spark 2.3.0 > > [ ] -1 Do not release this package because ... > > > To learn more about Apache Spark, please see https://spark.apache.org/ > > The tag to be voted on is v2.3.0-rc2: > https://github.com/apache/ spark/tree/v2.3.0-rc2 > ( 489ecb0ef23e5d9b705e5e5bae4fa3 d871bdac91) > > List of JIRA tickets resolved in this release can be found here: > https://issues.apache.org/ jira/projects/SPARK/versions/ 12339551 > > The release files, including signatures, digests, etc. can be found at: > https://dist.apache.org/repos/ dist/dev/spark/v2.3.0-rc2-bin/ > > Release artifacts are signed with the following key: > https://dist.apache.org/repos/ dist/dev/spark/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/ content/repositories/ orgapachespark-1262/ > > The documentation corresponding to this release can be found at: > https://dist.apache.org/repos/ dist/dev/spark/v2.3.0-rc2- > docs/_site/index.html > > > FAQ > > ============================== ========= > What are the unresolved issues targeted for 2.3.0? > ============================== ========= > > Please see https://s.apache.org/oXKi. At the time of writing, there are > currently no known release blockers. > > ========================= > How can I help test this release? > ========================= > > If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions. > > If you're working in PySpark you can set up a virtual env and install the > current RC and see if anything important breaks, in the Java/Scala you can > add the staging repository to your projects resolvers and test with the RC > (make sure to clean up the artifact cache before/after so you don't end up > building with a out of date RC going forward). > > ============================== ============= > What should happen to JIRA tickets still targeting 2.3.0? > ============================== ============= > > Committers should look at those and triage. Extremely important bug fixes, > documentation, and API tweaks that impact compatibility should be worked on > immediately. Everything else please retarget to 2.3.1 or 2.3.0 as > appropriate. > > =================== > Why is my bug not fixed? > =================== > > In order to make timely releases, we will typically not hold the release > unless the bug in question is a regression from 2.2.0. That being said, if > there is something which is a regression from 2.2.0 and has not been > correctly targeted please ping me or a committer to help target the issue > (you can see the open issues listed as impacting Spark 2.3.0 at > https://s.apache.org/WmoI). > > > Regards, > Sameer -- Marcelo ------------------------------ ------------------------------ --------- To unsubscribe e-mail: dev-unsubscribe@spark.apache. org -- Sameer AgarwalComputer Science | UC Berkeleyhttp://cs.berkeley.edu/~ sameerag