Re: Spark Materialized Views: Improve Query Performance and Data Management

Jungtaek Lim Fri, 03 May 2024 19:07:05 -0700

(remove user@ as the topic is not aiming to user group)

I would like to make a clarification of SPIP as there have been multiple
times of improper proposals and the ticket also mentions SPIP without
fulfilling effective requirements.
SPIP is only effective when there is a dedicated individual or group to
work on the project, with a concrete plan on design and implementation.
Here the "proposal" does not mean "feature request", but a proposal about
development.


https://spark.apache.org/improvement-proposals.html
I'm quoting a couple of relevant sentences here to explain what is the
requirement for SPIP.

The purpose of an SPIP is to inform and involve the user community in major
> improvements to the Spark codebase *throughout the development process*,
> to increase the likelihood that user needs are met.


SPIP Author is any community member who authors a SPIP and *is committed to
> pushing the change through the entire process*. SPIP authorship can be
> transferred.


SPIP author is the one who would need to lead the effort of "design" and
"code work". The format of SPIP doc isn't strictly requiring the design but
most likely there is at least a high level of design and in many cases
there is a separate doc for detailed design (This is optional but people
tend to provide the doc for the project with non-trivial design).

Hope this clarifies the meaning of SPIP.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sat, May 4, 2024 at 5:11 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> I have raised a ticket SPARK-48117
> <https://issues.apache.org/jira/browse/SPARK-48117> for enhancing Spark
> capabilities with Materialised Views (MV). Currently both Hive and
> Databricks support this. I have added these potential benefits  to the
> ticket
>
> -* Improved Query Performance (especially for Streaming Data):*
> Materialized Views can significantly improve query performance,
> particularly for use cases involving Spark Structured Streaming. When
> dealing with continuous data streams, materialized views can pre-compute
> and store frequently accessed aggregations or transformations. Subsequent
> queries on the materialized view can retrieve the results much faster
> compared to continuously processing the entire streaming data. This is
> crucial for real-time analytics where low latency is essential.
> *Enhancing Data Management:* They offer a way to pre-aggregate or
> transform data, making complex queries more efficient.
> - *Reduced Data Movement*: Materialized Views can be materialized on
> specific clusters or storage locations closer to where the data will be
> consumed. This minimizes data movement across the network, further
> improving query performance and reducing overall processing time.
> - *Simplified Workflows:* Developers and analysts can leverage
> pre-defined Materialized Views that represent specific business logic or
> data subsets. This simplifies data access, reduces development time for
> queries that rely on these views, and fosters code reuse.
>
> Please have a look at the ticket and add your comments.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>    view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>

Re: Spark Materialized Views: Improve Query Performance and Data Management

Reply via email to