Re: [DISCUSS] SPIP: ShuffleManager short name registration via SparkPlugin

2023-11-05 Thread Alessandro Bellina
Thanks for the comments Reynold. This is an ease of use change, and it is
not absolutely required (as other ease of use changes are not required
either). That said, do we not want to invest in making Spark easier to
configure for the average user, or even the user that is trying out Spark?

Here are my thoughts:

- Why can we use short names for SortShuffleManager ("sort"), but the same
can't be extended? If spark.shuffle.manager is meant to be a pluggable API,
it seems this mapping should be pluggable as well.

- Plugin developers (like my project) would like to produce a simple plugin
jar that can be used for all versions of Spark we support, but
ShuffleManager APIs can change in non-binary compatible ways (it's a
private API). As a result we document setting spark.shuffle.manager to a
fully qualified class that is built for each version of Spark we bundle,
guaranteeing a binary-compatible implementation. Having the ability to
produce a short name for a fully qualified shuffle manager would remove
having to look up this mapping.

- ShuffleManager is very flexible (for good reasons) and it can be used to
move shuffle in several ways, such as RDMA, caching, external stores, etc.
With this flexibility comes working with other open source projects (such
as UCX) that have their own configuration system. In this specific example,
environment variables are needed to setup UCX for use from the JVM and with
defaults that are particular to our shuffle usage. These configurations, as
of today, need to be looked up by the user and applied to their
application, and having a way to setup defaults would greatly improve the
user experience.

Thanks again for your feedback!

Alessandro

On Sat, Nov 4, 2023 at 6:04 PM Reynold Xin  wrote:

> Why do we need this? The reason data source APIs need it is because it
> will be used by very unsophisticated end users and used all the time (for
> each connection / query). Shuffle is something you set up once, presumably
> by fairly sophisticated admins / engineers.
>
>
>
> On Sat, Nov 04, 2023 at 2:42 PM, Alessandro Bellina 
> wrote:
>
>> Hello devs,
>>
>> I would like to start discussion on the SPIP "ShuffleManager short name
>> registration via SparkPlugin"
>>
>> The idea behind this change is to allow a driver plugin (spark.plugins)
>> to export ShuffleManagers via short names, along with sensible default
>> configurations. Users can then use this short name to enable this
>> ShuffleManager + configs using spark.shuffle.manager.
>>
>> SPIP:
>> https://docs.google.com/document/d/1flijDjMMAAGh2C2k-vg1u651RItaRquLGB_sVudxf6I/edit#heading=h.vqpecs4nrsto
>> JIRA: https://issues.apache.org/jira/browse/SPARK-45792
>>
>> I look forward to hearing your feedback.
>>
>> Thanks
>>
>> Alessandro
>>
>
>


ASF board report draft for Nov 2023

2023-11-05 Thread Matei Zaharia
It’s time to send our project’s quarterly report to the ASF board on Wednesday 
November 8th. Here’s what I wrote as a draft; let me know any suggested changes.

=

Issues for the board:

- None

Project status:

- We released Apache Spark 3.5 on September 15, a feature release with over 
1300 patches. This release introduced more scenarios with general availability 
for Spark Connect, like Scala and Go client, distributed training and inference 
support, and enhancement of compatibility for Structured streaming. It also 
introduced new PySpark and SQL functionality, including the SQL IDENTIFIER 
clause, named argument support for SQL function calls, SQL function support for 
HyperLogLog approximate aggregations, and Python user-defined table functions; 
simplified distributed training with DeepSpeed; introduced watermark 
propagation among operators; and added the dropDuplicatesWithinWatermark 
operation in Structured Streaming.
- We made a patch release, Spark 3.3.3, on August 21, 2023.
- Apache Spark 4.0.0-SNAPSHOT is now ready for Java 21. [SPARK-43831]
- The vote on "Updating documentation hosted for EOL and maintenance releases" 
has passed.
- The vote on the Spark Project Improvement Proposals (SPIPs) for "State Data 
Source - Reader" has passed.
- The PMC has voted to add two new PMC members, Yuanjian Li and Yikun Jiang, 
and one new committer, Jiaan Geng, to the project.

Trademarks:

- No changes since the last report.

Latest releases:

- Spark 3.5.0 was released on September 13, 2023
- Spark 3.3.3 was released on August 21, 2023
- Spark 3.4.1 was released on June 23, 2023

Committers and PMC:

- The latest committer was added on Oct 2nd, 2023 (Jiaan Geng).
- The latest PMC members were added on Oct 2nd, 2023 (Yuanjian Li and Yikun 
Jiang).

=

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org