[
https://issues.apache.org/jira/browse/SPARK-55163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
vaquar khan updated SPARK-55163:
--------------------------------
Description:
{panel}
*This SPIP proposes adding a client-side schema cache for Spark Connect
DataFrames.*
Currently, every call to {{df.columns}} or {{df.schema}} triggers a synchronous
gRPC analysis request to the server. While these are local and near-instant in
Spark Classic, in Connect they average 277 ms on standard cloud setups (like
AWS t3.medium). This makes iterative work extremely slow; we've measured a
13-second lag for 50 metadata calls in a typical ETL pipeline.
This delay is forcing developers to use a "Shadow Schema" pattern, where they
manually track column names in local lists to avoid the RPC overhead. Since
Spark DataFrames are immutable, we can fix this by caching the resolved schema
on the client after the first request. Our POC shows this reduces the 13-second
lag to about 250 ms (a 51× speedup) without breaking the core Spark Connect
model.
I have followed the official SPIP template for the detailed breakdown below.
*SIP*
[
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
*Benchmark* -
[https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
*Note for GSoC -*
To set clear expectations for your GSoC timeline, and as a heads-up to the
broader Spark developer community:
Because the underlying SPIP (SPARK-55163) is still actively being discussed and
has not yet received formal PMC approval, your GSoC project will function
purely as an experimental prototype.
Your open Pull Requests will be used by mentors to evaluate your GSoC
deliverables and milestones. However, please be aware that your code will not
be merged into the mainline Apache Spark repository during the GSoC program.
Successfully completing your GSoC project and passing the evaluations is tied
to the quality of your prototype and testing, not to getting the code merged.
Your prototype will be incredibly valuable in helping the community benchmark
the latency improvements for Spark Connect. I look forward to reviewing your
finalized proposal!
{panel}
was:
{panel}
*This SPIP proposes adding a client-side schema cache for Spark Connect
DataFrames.*
Currently, every call to {{df.columns}} or {{df.schema}} triggers a synchronous
gRPC analysis request to the server. While these are local and near-instant in
Spark Classic, in Connect they average 277 ms on standard cloud setups (like
AWS t3.medium). This makes iterative work extremely slow; we've measured a
13-second lag for 50 metadata calls in a typical ETL pipeline.
This delay is forcing developers to use a "Shadow Schema" pattern, where they
manually track column names in local lists to avoid the RPC overhead. Since
Spark DataFrames are immutable, we can fix this by caching the resolved schema
on the client after the first request. Our POC shows this reduces the 13-second
lag to about 250 ms (a 51× speedup) without breaking the core Spark Connect
model.
I have followed the official SPIP template for the detailed breakdown below.
*SIP*
[
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
*Benchmark* -
[https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
*Note -*
To set clear expectations for your GSoC timeline, and as a heads-up to the
broader Spark developer community:
Because the underlying SPIP (SPARK-55163) is still actively being discussed and
has not yet received formal PMC approval, your GSoC project will function
purely as an experimental prototype.
Your open Pull Requests will be used by mentors to evaluate your GSoC
deliverables and milestones. However, please be aware that your code will not
be merged into the mainline Apache Spark repository during the GSoC program.
Successfully completing your GSoC project and passing the evaluations is tied
to the quality of your prototype and testing, not to getting the code merged.
Your prototype will be incredibly valuable in helping the community benchmark
the latency improvements for Spark Connect. I look forward to reviewing your
finalized proposal!
{panel}
> SPIP: Client-Side Metadata Caching for Spark Connect
> ----------------------------------------------------
>
> Key: SPARK-55163
> URL: https://issues.apache.org/jira/browse/SPARK-55163
> Project: Spark
> Issue Type: Improvement
> Components: Connect
> Affects Versions: 3.4.0
> Reporter: vaquar khan
> Priority: Major
> Labels: connect, gsoc2026, mentor, pull-request-available, spark
>
> {panel}
> *This SPIP proposes adding a client-side schema cache for Spark Connect
> DataFrames.*
> Currently, every call to {{df.columns}} or {{df.schema}} triggers a
> synchronous gRPC analysis request to the server. While these are local and
> near-instant in Spark Classic, in Connect they average 277 ms on standard
> cloud setups (like AWS t3.medium). This makes iterative work extremely slow;
> we've measured a 13-second lag for 50 metadata calls in a typical ETL
> pipeline.
> This delay is forcing developers to use a "Shadow Schema" pattern, where they
> manually track column names in local lists to avoid the RPC overhead. Since
> Spark DataFrames are immutable, we can fix this by caching the resolved
> schema on the client after the first request. Our POC shows this reduces the
> 13-second lag to about 250 ms (a 51× speedup) without breaking the core Spark
> Connect model.
>
> I have followed the official SPIP template for the detailed breakdown below.
> *SIP*
> [
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
> *Benchmark* -
> [https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
> *Note for GSoC -*
> To set clear expectations for your GSoC timeline, and as a heads-up to the
> broader Spark developer community:
> Because the underlying SPIP (SPARK-55163) is still actively being discussed
> and has not yet received formal PMC approval, your GSoC project will function
> purely as an experimental prototype.
> Your open Pull Requests will be used by mentors to evaluate your GSoC
> deliverables and milestones. However, please be aware that your code will not
> be merged into the mainline Apache Spark repository during the GSoC program.
> Successfully completing your GSoC project and passing the evaluations is tied
> to the quality of your prototype and testing, not to getting the code merged.
> Your prototype will be incredibly valuable in helping the community benchmark
> the latency improvements for Spark Connect. I look forward to reviewing your
> finalized proposal!
> {panel}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]