[OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

Xiao Li Wed, 29 Apr 2020 11:10:10 -0700

Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, an *[API] *tag is added in the
title.


CORE
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-30623core-spark-external-shuffle-allow-disable-of-separate-event-loop-group-66--33>[3.0][SPARK-30623][CORE]
Spark external shuffle allow disable of separate event loop group (+66, -33)
>
<https://github.com/apache/spark/commit/0fe203e7032a326ff0c78f6660ab1b09409fe09d>

PR#22173 <https://github.com/apache/spark/pull/22173> introduced a perf
regression in shuffle, even if we disable the feature flag
spark.shuffle.server.chunkFetchHandlerThreadsPercent. To fix the perf
regression, this PR refactors the related code to completely disable this
feature by default.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31314core-revert-spark-29285-to-fix-shuffle-regression-caused-by-creating-temporary-file-eagerly-10--71>[3.0][SPARK-31314][CORE]
Revert SPARK-29285 to fix shuffle regression caused by creating temporary
file eagerly (+10, -71)>
<https://github.com/apache/spark/commit/07c50784d34e10bbfafac7498c0b70c4ec08048a>

PR#25962 <https://github.com/apache/spark/pull/25962> introduced a perf
regression in shuffle, which may create empty files unnecessarily. This PR
reverts it.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api80spark-29154core-update-spark-scheduler-for-stage-level-scheduling-704--218>[API][3.1][SPARK-29154][CORE]
Update Spark scheduler for stage level scheduling (+704, -218)>
<https://github.com/apache/spark/commit/474b1bb5c2bce2f83c4dd8e19b9b7c5b3aebd6c4>

This PR updates the DAG scheduler to schedule tasks to match the resource
profile. It's for the stage level scheduling.
[API][3.1][SPARK-29153][CORE] Add ability to merge resource profiles within
a stage with Stage Level Scheduling (+304, -15)>
<https://github.com/apache/spark/commit/55dea9be62019d64d5d76619e1551956c8bb64d0>

Add the ability to optionally merged resource profiles if they are
specified on multiple RDDs within a Stage. The feature is part of Stage
Level Scheduling. There is a config
spark.scheduler.resourceProfile.mergeConflicts to enable this feature, the
config if off by default.

spark.scheduler.resource.profileMergeConflicts (Default: false)

   - If set to true, Spark will merge ResourceProfiles when different
   profiles are specified in RDDs that get combined into a single stage. When
   they are merged, Spark chooses the maximum of each resource and creates a
   new ResourceProfile. The default of false results in Spark throwing an
   exception if multiple different ResourceProfiles are found in RDDs going
   into the same stage.

<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#api80spark-31208core-add-an-experimental-api-cleanshuffledependencies-158--71>[API][3.1][SPARK-31208][CORE]
Add an experimental API: cleanShuffleDependencies (+158, -71)>
<https://github.com/apache/spark/commit/8f010bd0a81aa9c7a8d2b8037b6160ab3277cc74>

Add a new experimental developer API RDD.cleanShuffleDependencies(blocking:
Boolean) to allow explicitly clean up shuffle files. This could help
dynamic scaling of K8s backend since the backend only recycles executors
without shuffle files.

  /**   * :: Experimental ::   * Removes an RDD's shuffles and it's
non-persisted ancestors.   * When running without a shuffle service,
cleaning up shuffle files enables downscaling.   * If you use the RDD
after this call, you should checkpoint and materialize it first.   *
If you are uncertain of what you are doing, please do not use this
feature.   * Additional techniques for mitigating orphaned shuffle
files:   *   * Tuning the driver GC to be more aggressive, so the
regular context cleaner is triggered   *   * Setting an appropriate
TTL for shuffle files to be auto cleaned   */
  @Experimental
  @DeveloperApi
  @Since("3.1.0")
  def cleanShuffleDependencies(blocking: Boolean = false): Unit

<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#80spark-31179-fast-fail-the-connection-while-last-connection-failed-in-fast-fail-time-window-68--12>[3.1][SPARK-31179]
Fast fail the connection while last connection failed in fast fail time
window (+68, -12)>
<https://github.com/apache/spark/commit/ec289252368127f8261eb6a2270362ba0b65db36>

In TransportFactory, if a connection to the destination address fails, the
new connection requests [that are created within a time window] fail fast
for avoiding too many retries. This time window size is set to 95% of the
IO retry wait time (spark.io.shuffle.retryWait whose default is 5 seconds).
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#sql>
SQL
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-25556spark-17636spark-31026spark-31060sql-nested-column-predicate-pushdown-for-parquet-852--480>[API][3.0][SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL]
Nested Column Predicate Pushdown for Parquet (+852, -480)>
<https://github.com/apache/spark/commit/cb0db213736de5c5c02b09a2d5c3e17254708ce1>

This PR extends the data source framework to support filter pushdown with
nested fields, and implement it in the Parquet data source.

spark.sql.optimizer.nestedPredicatePushdown.enabled [Default: true]

   - When true, Spark tries to push down predicates for nested columns and
   or names containing dots to data sources. Currently, Parquet implements
   both optimizations while ORC only supports predicates for names containing
   dots. The other data sources don't support this feature yet.

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-30822sql-remove-semicolon-at-the-end-of-a-sql-query-14--7>[3.0][SPARK-30822][SQL]
Remove semicolon at the end of a sql query (+14, -7)>
<https://github.com/apache/spark/commit/44431d4b1a22c3db87d7e4a24df517d6d45905a8>

This PR updates the SQL parser to ignore the trailing semicolons in the SQL
queries. Now sql("select 1;") works.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31086sql-add-back-the-deprecated-sqlcontext-methods-403--0>[API][3.0][SPARK-31086][SQL]
Add Back the Deprecated SQLContext methods (+403, -0)>
<https://github.com/apache/spark/commit/b7e4cc775b7eac68606d1f385911613f5139db1b>

This PR adds back the following APIs whose maintenance costs are relatively
small.

   - SQLContext.applySchema
   - SQLContext.parquetFile
   - SQLContext.jsonFile
   - SQLContext.jsonRDD
   - SQLContext.load
   - SQLContext.jdbc

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31087sql-add-back-multiple-removed-apis-458--15>[API][3.0][SPARK-31087][SQL]
Add Back Multiple Removed APIs (+458, -15)>
<https://github.com/apache/spark/commit/3884455780a214c620f309e00d5a083039746755>

This PR adds back the following APIs whose maintenance costs are relatively
small.

   - functions.toDegrees/toRadians
   - functions.approxCountDistinct
   - functions.monotonicallyIncreasingId
   - Column.!==
   - Dataset.explode
   - Dataset.registerTempTable
   - SQLContext.getOrCreate, setActive, clearActive, constructors

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31088sql-add-back-hivecontext-and-createexternaltable-532--11>[API][3.0][SPARK-31088][SQL]
Add back HiveContext and createExternalTable (+532, -11)>
<https://github.com/apache/spark/commit/b9eafcb52658b7f5ec60bb4ebcc9da0fde94e105>

This PR adds back the following APIs whose maintenance costs are relatively
small.

   - HiveContext
   - createExternalTable APIs

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31147sql-forbid-char-type-in-non-hive-serde-tables-134--20>[API][3.0][SPARK-31147][SQL]
Forbid CHAR type in non-Hive-Serde tables (+134, -20)>
<https://github.com/apache/spark/commit/4f274a4de96def5c6f2c5c273a1c22c4f4d68376>

Spark introduced CHAR type for hive compatibility but it only works for
hive tables. CHAR type is never documented and is treated as a STRING type
for non-Hive tables. It violates the SQL standard and is very confusing.
This PR forbids CHAR type in non-Hive tables as it's not supported
correctly.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31170sql-spark-sql-cli-should-respect-hive-sitexml-and-sparksqlwarehousedir-159--66>[3.0][SPARK-31170][SQL]
Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir
(+159, -66)>
<https://github.com/apache/spark/commit/8be16907c261657f83f5d5934bcd978d8dacf7ff>

This PR fixes Spark CLI to respect the warehouse config(
spark.sql.warehousr.dir or hive.metastore.warehourse.dir) and the
hive-site.xml.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api70spark-31201sql-add-an-individual-config-for-skewed-partition-threshold-21--4>[API][3.0][SPARK-31201][SQL]
Add an individual config for skewed partition threshold (+21, -4)>
<https://github.com/apache/spark/commit/05498af72e19b058b210815e1053f3fa9b0157d9>

Skew join handling comes with an overhead: we need to read some data
repeatedly. We should treat a partition as skewed if it's large enough so
that it's beneficial to do so.

Currently the size threshold is the advisory partition size, which is 64 MB
by default. This is not large enough for the skewed partition size
threshold. Thus, a new conf is added.

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes (Default: 256
MB)

   - A partition is considered as skewed if its size in bytes is larger
   than this threshold and also larger than
   'spark.sql.adaptive.skewJoin.skewedPartitionFactor' multiplying the median
   partition size. Ideally this config should be set larger than
   'spark.sql.adaptive.advisoryPartitionSizeInBytes'

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31227sql-non-nullable-null-type-in-complex-types-should-not-coerce-to-nullable-type-73--29>[3.0][SPARK-31227][SQL]
Non-nullable null type in complex types should not coerce to nullable type
(+73, -29)>
<https://github.com/apache/spark/commit/3bd10ce007832522e38583592b6f358e185cdb7d>

This PR targets for non-nullable null type not to coerce to nullable type
in complex types, to fix queries like concat(array(), array(1)).
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#api66spark-31234sql-resetcommand-should-not-affect-static-sql-configuration-24--4>[API][2.4][SPARK-31234][SQL]
ResetCommand should not affect static SQL Configuration (+24, -4)>
<https://github.com/apache/spark/commit/44bd36ad7b315f4c7592cdc1edf04356fcd23645>

Currently, ResetCommand clears all configurations, including runtime SQL
configs, static SQL configs and spark context level configs. This PR fixes
it to only clear the runtime SQL configs, as other configs are not supposed
to change at runtime.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31238sql-rebase-dates-tofrom-julian-calendar-in-writeread-for-orc-datasource-153--9>[3.0][SPARK-31238][SQL]
Rebase dates to/from Julian calendar in write/read for ORC datasource
(+153, -9)>
<https://github.com/apache/spark/commit/d72ec8574113f9a7e87f3d7ec56c8447267b0506>

Spark 3.0 switches to the Proleptic Gregorian calendar, which is different
from the ORC format. This PR rebases the datetime values when
reading/writing ORC files, to adjust the calendar changes.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31297sql-speed-up-dates-rebasing-174--91>[3.0][SPARK-31297][SQL]
Speed up dates rebasing (+174, -91)>
<https://github.com/apache/spark/commit/bb0b416f0b3a2747a420b17d1bf659891bae3274>

This PR replaces the current implementation of the
rebaseGregorianToJulianDays() and rebaseJulianToGregorianDays() functions
in DateTimeUtils by a new one which is based on the fact that difference
between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was
changed only 14 times for entire supported range of valid dates [0001-01-01,
9999-12-31]. This brings about 3x speedup for fixing the performance
regression caused by the calendar switch.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-31312sql-cache-class-instance-for-the-udf-instance-in-hivefunctionwrapper-203--52>[3.0][SPARK-31312][SQL]
Cache Class instance for the UDF instance in HiveFunctionWrapper (+203, -52)
>
<https://github.com/apache/spark/commit/2a6aa8e87bec39f6bfec67e151ef8566b75caecd>

This PR caches the Class instance for the UDF instance in
HiveFunctionWrapper to fix the case where Hive simple UDF is somehow
transformed (expression is copied) and evaluated later with another
classloader (for the case current thread context classloader is somehow
changed). In this case, Spark throws CNFE as of now.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-29721sql-prune-unnecessary-nested-fields-from-generate-without-project-172--18>[3.1][SPARK-29721][SQL]
Prune unnecessary nested fields from Generate without Project (+172, -18)>
<https://github.com/apache/spark/commit/aa8776bb5912f695b5269a3d856111aa49419d1b>

This patch prunes unnecessary nested fields from Generate which has no
Project on top of it.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-31253sql-add-metrics-to-aqe-shuffle-reader-229--106>[3.1][SPARK-31253][SQL]
Add metrics to AQE shuffle reader (+229, -106)>
<https://github.com/apache/spark/commit/34c7ec8e0cb395da50e5cbeee67463414dacd776>

This PR adds SQL metrics to the AQE shuffle reader (CustomShuffleReaderExec),
so that it's easier to debug AQE with SQL web UI.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#70spark-30532-dataframestatfunctions-to-work-with-tablecolumn-syntax-31--8>[3.0][SPARK-30532]
DataFrameStatFunctions to work with TABLE.COLUMN syntax (+31, -8)>
<https://github.com/apache/spark/commit/22bb6b0fddb3ecd3ac0ad2b41a5024c86b8a6fc7>

This PR makes approxQuantile, freqItems, cov and corr in
DataFrameStatFunctions to support fully qualified column names. For
example, df2.as("t2").crossJoin(df1.as("t1")).stat.approxQuantile("t1.num",
Array(0.1), 0.0) now works.
[API][3.0][SPARK-31113][SQL] Add SHOW VIEWS command (+605, -10)>
<https://github.com/apache/spark/commit/a28ed86a387b286745b30cd4d90b3d558205a5a7>

Add the SHOW VIEWS command to list the metadata for views. The SHOW
VIEWS statement
returns all the views for an optionally specified database. Additionally,
the output of this statement may be filtered by an optional matching
pattern. If no database is specified then the views are returned from the
current database. If the specified database is global temporary view
database, we will list global temporary views. Note that the command also
lists local temporary views regardless of a given database.

For example:

-- List all views in default database
SHOW VIEWS;
  +-------------+------------+--------------+--+
  | namespace   | viewName   | isTemporary  |
  +-------------+------------+--------------+--+
  | default     | sam        | false        |
  | default     | sam1       | false        |
  | default     | suj        | false        |
  |             | temp2      | true         |
  +-------------+------------+--------------+--+

-- List all views from userdb database
SHOW VIEWS FROM userdb;
  +-------------+------------+--------------+--+
  | namespace   | viewName   | isTemporary  |
  +-------------+------------+--------------+--+
  | userdb      | user1      | false        |
  | userdb      | user2      | false        |
  |             | temp2      | true         |
  +-------------+------------+--------------+--+

<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#70spark-31224sql-add-view-support-to-show-create-table-151--133>[3.0][SPARK-31224][SQL]
Add view support to SHOW CREATE TABLE (+151, -133)>
<https://github.com/apache/spark/commit/d782a1c4565c401a02531db3b8aa3cb6fc698fb1>

This PR fixes the regression introduced in Spark 3.0. Now, both SHOW CREATE
TABLE and SHOW CREATE TABLE AS SERDE can support view.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#api70spark-31318sql-split-parquetavro-configs-for-rebasing-datestimestamps-in-read-and-in-write-65--33>[API][3.0][SPARK-31318][SQL]
Split Parquet/Avro configs for rebasing dates/timestamps in read and in
write (+65, -33)>
<https://github.com/apache/spark/commit/c5323d2e8d74edddefcb0bb7e965d12de45abb18>

The PR replaces the configs spark.sql.legacy.parquet.rebaseDateTime.enabled
 and spark.sql.legacy.avro.rebaseDateTime.enabled by the new configs
spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled,
spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled,
spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled,
spark.sql.legacy.avro.rebaseDateTimeInRead.enabled. Thus, users are able to
load dates/timestamps saved by Spark 2.4/3.0, and save to Parquet/Avro
files which are compatible with Spark 3.0/2.4 without rebasing.

spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled (Default: false)

   - When true, rebase dates/timestamps from Proleptic Gregorian calendar
   to the hybrid calendar (Julian + Gregorian) in writing. The rebasing is
   performed by converting micros/millis/days to a local date/timestamp in the
   source calendar, interpreting the resulted date/timestamp in the target
   calendar, and getting the number of micros/millis/days since the epoch
   1970-01-01 00:00:00Z.

spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled (Default: false)

   - When true, rebase dates/timestamps from the hybrid calendar to the
   Proleptic Gregorian calendar in read. The rebasing is performed by
   converting micros/millis/days to a local date/timestamp in the source
   calendar, interpreting the resulted date/timestamp in the target calendar,
   and getting the number of micros/millis/days since the epoch 1970-01-01
   00:00:00Z.

spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled (Default: false)

   - When true, rebase dates/timestamps from Proleptic Gregorian calendar
   to the hybrid calendar (Julian + Gregorian) in writing. The rebasing is
   performed by converting micros/millis/days to a local date/timestamp in the
   source calendar, interpreting the resulted date/timestamp in the target
   calendar, and getting the number of micros/millis/days since the epoch
   1970-01-01 00:00:00Z.

spark.sql.legacy.avro.rebaseDateTimeInRead.enabled (Default: false)

   - When true, rebase dates/timestamps from the hybrid calendar to the
   Proleptic Gregorian calendar in read. The rebasing is performed by
   converting micros/millis/days to a local date/timestamp in the source
   calendar, interpreting the resulted date/timestamp in the target calendar,
   and getting the number of micros/millis/days since the epoch 1970-01-01
   00:00:00Z.

<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#70spark-31322sql-rename-queryplancollectinplanandsubqueries-to-collectwithsubqueries-5--5>[3.0][SPARK-31322][SQL]
rename QueryPlan.collectInPlanAndSubqueries to collectWithSubqueries (+5,
-5)>
<https://github.com/apache/spark/commit/09f036a14cee4825edc73b463e1eebe85ff1c915>

Rename QueryPlan.collectInPlanAndSubqueries [which was introduced in the
unreleased Spark 3.0] to collectWithSubqueries. QueryPlan's APIs are
internal but they are the core of catalyst and we'd better make the API
name clearer before we release it.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#70spark-31327sql-write-spark-version-into-avro-file-metadata-116--5>[3.0][SPARK-31327][SQL]
Write Spark version into Avro file metadata (+116, -5)>
<https://github.com/apache/spark/commit/6b1ca886c0066f4e10534336f3fce64cdebc79a5>

Write Spark version into Avro file metadata. The version info is very
useful for backward compatibility.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#70spark-31328sql-fix-rebasing-of-overlapped-local-timestamps-during-daylight-saving-time-120--90>[3.0][SPARK-31328][SQL]
Fix rebasing of overlapped local timestamps during daylight saving time
(+120, -90)>
<https://github.com/apache/spark/commit/820bb9985a76a567b79ffb01bbd2d32788e1dba0>

The PR fixes the bug of losing DST [Daily Saving Time] offset info in
rebasing timestamps via local date-time. This bug could lead to
DateTimeUtils.rebaseGregorianToJulianMicros() and
DateTimeUtils.rebaseJulianToGregorianMicros() returning wrong results when
handling the daylight saving offset.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#ml>
ML
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-31222ml-make-anovatest-sparsity-aware-173--112>[3.1][SPARK-31222][ML]
Make ANOVATest Sparsity-Aware (+173, -112)>
<https://github.com/apache/spark/commit/1dce6c1fd45de3d9cf911842f52494051692c48b>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-31223ml-set-seed-in-nprandom-to-regenerate-test-data-282--256>[3.1][SPARK-31223][ML]
Set seed in np.random to regenerate test data (+282, -256)>
<https://github.com/apache/spark/commit/d81df56f2dbd9757c87101fa32c28cf0cd96f278>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#80spark-31283ml-simplify-chisq-by-adding-a-common-method-68--93>[3.1][SPARK-31283][ML]
Simplify ChiSq by adding a common method (+68, -93)>
<https://github.com/apache/spark/commit/34d6b90449be43b865d06ecd3b4a7e363249ba9c>
[API][3.1][SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR (+484, -36)>
<https://github.com/apache/spark/commit/0d37f794ef9451199e1b757c1015fc7a8b3931a5>

Add SparkR wrapper for FMClassifier.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#python>
PYTHON
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-25-~Mar-31,-2020#66spark-31186spark-31441pysparksql-topandas-should-not-fail-on-duplicate-column-names-56--10->[2.4][SPARK-31186][SPARK-31441][PYSPARK][SQL]
toPandas should not fail on duplicate column names (+56, -10)>
<https://github.com/apache/spark/commit/559d3e4051500d5c49e9a7f3ac33aac3de19c9c6>
 >
<https://github.com/apache/spark/commit/87be3641eb3517862dd5073903d5b37275852066>

This PR fixes the toPandas API to make it work on dataframe with duplicate
column names.
[3.0][SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown
through Aggregate (+20, -2)>
<https://github.com/apache/spark/commit/1f0287148977adb416001cb0988e919a2698c8e0>

This PR is to fix a new regression in Spark 3.0, in which the predicates
using PythonUDFs will be pushed down through Aggregate. Since PythonUDFs
can't be evaluated on Filter, the rule "predicate pushdown through
Aggregate" should skip the predicates that contain PythonUDFs.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#80spark-31308pyspark-merging-pyfiles-to-files-argument-for-non-pyspark-applications-6--4>[3.1][SPARK-31308][PYSPARK]
Merging pyFiles to files argument for Non-PySpark applications (+6, -4)>
<https://github.com/apache/spark/commit/20fc6fa8398b9dc47b9ae7df52133a306f89b25f>

Add python dependencies in SparkSubmit even if it is not a Python
application. For some Spark applications(e.g. Livy remote SparkContext
application, which is actually an embedded REPL for Scala/Python/R), they
require not only jar dependencies, but also Python dependencies.
UI
<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#api80spark-31325sqlweb-ui-control-the-plan-explain-mode-in-the-events-of-sql-listeners-via-sqlconf-100--3>[API][3.1][SPARK-31325][SQL][WEB
UI] Control the plan explain mode in the events of SQL listeners via
SQLConf (+100, -3)>
<https://github.com/apache/spark/commit/d98df7626b8a88cb9a1fee4f454b19333a9f3ced>

Add a new config spark.sql.ui.explainMode to control the query explain mode
used in the Spark SQL UI. The default value of this config is formatted,
which would display the query details content in a formatted way.

spark.sql.ui.explainMode (default: "formatted")

   - Configures the query explain mode used in the Spark SQL UI. The value
   can be 'simple', 'extended', 'codegen', 'cost', or 'formatted'.

<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#r>
R
<https://github.com/databricks/runtime/wiki/OSS-Digest-Apr-1-~-Apr-7,-2020#api70spark-31290r-add-back-the-deprecated-r-apis-200--4>[API][3.0][SPARK-31290][R]
Add back the deprecated R APIs (+200, -4)>
<https://github.com/apache/spark/commit/fd0b2281272daba590c6bb277688087d0b26053f>

This PR is to add back the following R APIs which were removed in the
unreleased Spark 3.0:

   - sparkR.init
   - sparkRSQL.init
   - sparkRHive.init
   - registerTempTable
   - createExternalTable
   - dropTempTable



-- 
<https://databricks.com/sparkaisummit/north-america>

[OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

Reply via email to