Re: Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

2025-02-12 Thread Santosh Pingale
> Maybe we should do it at runtime: if Spark retries a shuffle stage but the data becomes different (e.g. use checksum to check it), then Spark should retry all the partitions of this stage. Having bitten hard recently by this behavior in spark and after having gone down the rabbit hole to investi

Re: Behaviour of operators like Outer Join when using indeterministic joining keys seems to be full of contradictions

2025-02-12 Thread Asif Shahid
Hi, Following up on the issue with some information: 1) have opened a PR related to fix the isInDeterminate method in Stage , RDD etc. 2) After tweaking the source code , have been able to reproduce the data loss issue reliably, but cannot productize the test, as it requires a lot of inline interce

Re: ASF board report draft for February 2025

2025-02-12 Thread Mich Talebzadeh
✅ *"Thanks, Matei. ✅ Looks like a plan!* *📌 We resurrected the old thread! * *https://lists.apache.org/thread/wwjyp1bhryvx7ytooj1lqtd8kgzxb6vq * 🔗 Hopefully, there will be more traction this round. HTH Dr Mich Talebzadeh, Archit

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-12 Thread Sem
DuckDB provides "PySpark syntax" on top of fast single node engine: https://duckdb.org/docs/clients/python/spark_api.html As I remember, DuckDB is much faster than pandas on a single node and it already provides a spark-compatible API. On 2/10/25 1:02 PM, José Müller wrote: Hi all, I'm new

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-12 Thread Max Gekk
Hello Mich, > However, if you only need to work with time, you can do like below 1. Let's say a Spark SQL user would like to load TIME values stored in files in the parquet format which supports the TIME logical type https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time. None

Re: ASF board report draft for February 2025

2025-02-12 Thread Matei Zaharia
I posted the report, but thanks for the feedback. Hopefully we can get enough coverage for DRA and the UI issues. > On Feb 11, 2025, at 2:33 AM, Mich Talebzadeh > wrote: > > Let us carry on on that thread. > > Need to catch-up > > HTH > > Dr Mich Talebzadeh, > Architect | Data Science | Fi

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-12 Thread Mich Talebzadeh
Not entirely convinced we need it! For example, Oracle does not have it.Oracle treats date and time as a single entity, as they are often used together in real-world applications. This approach simplifies many operations, such as sorting, filtering, and calculations involving both date and time. H

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-12 Thread Sakthi
Thanks for the proposal, Max. This looks very promising. I'd also be happy to contribute if it helps with task completion! Regards, Sakthi On Wed, Feb 12, 2025 at 10:36 AM Max Gekk wrote: > Hi Dongjoon, > > > According to SPIP, is this targeting Apache Spark 4.2.0? > > Some tasks could be done

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-12 Thread Max Gekk
Hi Dongjoon, > According to SPIP, is this targeting Apache Spark 4.2.0? Some tasks could be done in parallel, but if only one person will work on this sequentially, in the worst case it might be finished close to 4.2.0. Best regards, Max Gekk On Wed, Feb 12, 2025 at 5:48 PM Dongjoon Hyun wrote

Re: [DISCUSS] SPIP: Add the TIME data type

2025-02-12 Thread Dongjoon Hyun
According to SPIP, is this targeting Apache Spark 4.2.0? > Q7. How long will it take? > In total it might take around 9 months. Dongjoon. On 2025/02/12 09:38:56 Max Gekk wrote: > Hi All, > > I would like to propose a new data type TIME which represents only time > values without the date part

Re: SPARK-51166: Prepare Apache Spark 4.1.0

2025-02-12 Thread Dongjoon Hyun
Here is an update. The following are moved to 4.1.0 (SPARK-51166) from 4.0.0 (SPARK-44111) for now because those issues are not ready with proper PRs. SPARK-47110 Reenble AmmoniteTest tests in Maven builds SPARK-48139 Re-enable `SparkSessionE2ESuite.interrupt tag` SPARK-48163 Fix Flak

[DISCUSS] SPIP: Add the TIME data type

2025-02-12 Thread Max Gekk
Hi All, I would like to propose a new data type TIME which represents only time values without the date part comparing to TIMESTAMP_NTZ. New type should improve: - migrations of SQL code from other DBMS where such type is supported - read/write it from/to data sources such as parquet - conform to