Re: Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

2025-02-13 Thread Asif Shahid
Hi Wenchen, Apologies... I thought it was Santosh asking for the PR. and race condition... Yes the race condition causes only some partitions to be retried .. May be a checksum can be used as fallback mechanism? I suppose checksum would be implemented on executor sides..? Then there might be condit

Re: [DISCUSS] SPIP: Constraints in DSv2

2025-02-13 Thread Gengliang Wang
+1, the proposal will unify constraint management in DSv2 and reduce redundant work across connectors On Thu, Feb 13, 2025 at 9:20 PM Anton Okolnychyi wrote: > Hi folks, > > I'd like to start a discussion on SPARK-51207 that aims to extend the DSv2 > API to let users define, modify, and enforce

[DISCUSS] SPIP: Constraints in DSv2

2025-02-13 Thread Anton Okolnychyi
Hi folks, I'd like to start a discussion on SPARK-51207 that aims to extend the DSv2 API to let users define, modify, and enforce table constraints in connectors that support them. SPIP [1] contains proposed API changes and parser extensions. Any feedback is more than welcome! Wenchen was kind e

Re: Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

2025-02-13 Thread Asif Shahid
Hi, Following up on this issue. The PR opened is: 1)https://github.com/apache/spark/pull/49708 The PR IMO fixes the 3 issues which I have described below. The race condition fix requires use of read write lock. 2) Jira is: https://issues.apache.org/jira/browse/SPARK-51016 A checksum approach woul

Re: Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

2025-02-13 Thread Asif Shahid
The bugrepro patch , when applied on current master, will show failure with incorrect results. While on the PR branch , it will pass. The number of iterations in the test is 100. Regards Asif On Thu, Feb 13, 2025 at 7:35 PM Asif Shahid wrote: > Hi, > Following up on this issue. > The PR opened

Re: Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

2025-02-13 Thread Wenchen Fan
> Have identified a race condition , in the current DagSchdeduler code base Does it lead to the failure of full-stage retry? Can you also share your PR so that we can take a closer look? BTW, I'm working with my colleagues on this runtime checksum idea. We have something similar in our internal i

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-13 Thread Mich Talebzadeh
hm, I tried the attached code. This code tries to simulates handling TIME data in Spark using Parquet files. Since Spark does not support a direct TIME datatype, it follows these steps: - Stores time as a STRING in a Parquet file using PyArrow. - Reads the Parquet file using PyArrow, Pandas,

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-13 Thread Khalid Mammadov
There's also PolarSpark that offers PySpark API with Polars engine https://github.com/khalidmammadov/polarspark PS. It's also in PyPi On Wed, 12 Feb 2025, 19:56 Sem, wrote: > DuckDB provides "PySpark syntax" on top of fast single node engine: > > https://duckdb.org/docs/clients/python/spark_ap