Hi Wenchen,
Apologies... I thought it was Santosh asking for the PR. and race
condition...
Yes the race condition causes only some partitions to be retried ..
May be a checksum can be used as fallback mechanism?
I suppose checksum would be implemented on executor sides..? Then there
might be condit
+1, the proposal will unify constraint management in DSv2 and reduce
redundant work across connectors
On Thu, Feb 13, 2025 at 9:20 PM Anton Okolnychyi
wrote:
> Hi folks,
>
> I'd like to start a discussion on SPARK-51207 that aims to extend the DSv2
> API to let users define, modify, and enforce
Hi folks,
I'd like to start a discussion on SPARK-51207 that aims to extend the DSv2
API to let users define, modify, and enforce table constraints in
connectors that support them.
SPIP [1] contains proposed API changes and parser extensions. Any feedback
is more than welcome!
Wenchen was kind e
Hi,
Following up on this issue.
The PR opened is:
1)https://github.com/apache/spark/pull/49708
The PR IMO fixes the 3 issues which I have described below.
The race condition fix requires use of read write lock.
2) Jira is:
https://issues.apache.org/jira/browse/SPARK-51016
A checksum approach woul
The bugrepro patch , when applied on current master, will show failure with
incorrect results.
While on the PR branch , it will pass.
The number of iterations in the test is 100.
Regards
Asif
On Thu, Feb 13, 2025 at 7:35 PM Asif Shahid wrote:
> Hi,
> Following up on this issue.
> The PR opened
> Have identified a race condition , in the current DagSchdeduler code base
Does it lead to the failure of full-stage retry? Can you also share your PR
so that we can take a closer look?
BTW, I'm working with my colleagues on this runtime checksum idea. We have
something similar in our internal i
hm, I tried the attached code. This code tries to simulates handling TIME
data in Spark using Parquet files. Since Spark does not support a direct
TIME datatype, it follows these steps:
- Stores time as a STRING in a Parquet file using PyArrow.
- Reads the Parquet file using PyArrow, Pandas,
There's also PolarSpark that offers PySpark API with Polars engine
https://github.com/khalidmammadov/polarspark
PS. It's also in PyPi
On Wed, 12 Feb 2025, 19:56 Sem, wrote:
> DuckDB provides "PySpark syntax" on top of fast single node engine:
>
> https://duckdb.org/docs/clients/python/spark_ap