[DISCUSS][Spark SQL] Update API

2024-09-23 Thread Szehon Ho
Hi all,

In https://github.com/apache/spark/pull/47233, we are looking to add a
Spark DataFrame API for functional equivalence to Spark SQL's UPDATE
statement.

There are open discussions on the PR about location/format of the API, and
we wanted to ask on devlist to get more opinions.

One consideration, is that Update SQL is an isolated, terminal operation
only on DSV2 tables that cannot be chained to other operations.

I made a quick write up about the background and discussed options in
https://docs.google.com/document/d/1AjkxOU06pFEzFmSbepfxdHoUGtvNAk6X1WY3zHGTW_o/edit.
It is my first one, so please let me know if I missed something.

Look forward to hearing from more Spark devs on thoughts, either in the PR,
document, or reply to this email.

Thank you,
Szehon


Re: Re: [Discuss] SPIP: Support NanoSecond Timestamps

2025-03-14 Thread Szehon Ho
+1 to the idea as well, as Iceberg V3 is coming with time with nanos, and
Spark would not be able to read this type without this.

Thanks
Szehon

On Fri, Mar 14, 2025 at 3:34 PM Wenchen Fan  wrote:

> In general, I think it's good for Spark to support the common data types
> in the ecosystem, as it's the only way to fully integrate with the
> ecosystem. So +1.
>
> On Fri, Mar 14, 2025 at 8:56 AM 谭琦  wrote:
>
>> Updated. Thanks.
>>
>> On 2025/03/13 23:56:20 Jungtaek Lim wrote:
>> > Hi, would you mind allowing comments on the doc? Thanks!
>> >
>> > On Fri, Mar 14, 2025 at 8:50 AM Qi Tan  wrote:
>> >
>> > > Hello everybody,
>> > >
>> > > I would like to start a discussion on SPARK-50532
>> > >  to enable Spark
>> to
>> > > support nanoseconds. Here attached the spip doc
>> > > <
>> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?usp=sharing
>> >
>> > > . Huaxin was kind enough to shepherd this effort.
>> > >
>> > > Thanks for your attention. Any feedback is more than welcome!
>> > >
>> > > Qi Tan
>> > >
>> >
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] SPIP: Constraints in DSv2

2025-04-05 Thread Szehon Ho
+1 (non binding)

Agree with Anton, data sources like the open table formats define the
requirement, and definitely need engines to write to it accordingly.

Thanks,
Szehon

On Fri, Mar 21, 2025 at 1:31 PM Anton Okolnychyi 
wrote:

> -1 (non-binding): Breaks the Chain of Responsibility. Constraints should
>> be defined and enforced by the data sources themselves, not Spark. Spark is
>> a processing engine, and enforcing constraints at this level blurs
>> architectural boundaries, making Spark responsible for something it does
>> not control.
>>
>
> I disagree that this breaks the chain of responsibility. It may be quite
> the opposite, in fact. Spark is already responsible for enforcing NOT NULL
> constraints by adding AssertNotNull for required columns today. Connectors
> like Iceberg and Delta store constraint definitions but rely on engines
> like Spark to enforce them during INSERT, DELETE, UPDATE, and MERGE
> operations. Without this API, each connector would need to reimplement the
> same logic, creating duplication.
>
> The proposal is aligned with the SQL standard and other relational
> databases. In my view, it simply makes Spark a better engine, facilitates
> data accuracy and consistency, and enables performance optimizations.
>
> - Anton
>
> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua <
> angel.alvarez.pas...@gmail.com> пише:
>
>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints should
>> be defined and enforced by the data sources themselves, not Spark. Spark is
>> a processing engine, and enforcing constraints at this level blurs
>> architectural boundaries, making Spark responsible for something it does
>> not control.
>>
>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh ()
>> escribió:
>>
>>> +1
>>>
>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee 
>>> wrote:
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang 
>>> wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi <
>>> aokolnyc...@gmail.com> wrote:
>>> 
>>>  Hi all,
>>> 
>>>  I would like to start a vote on adding support for constraints to
>>> DSv2.
>>> 
>>>  Discussion thread:
>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj
>>>  SPIP:
>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo
>>>  PR with the API changes: https://github.com/apache/spark/pull/50253
>>>  JIRA: https://issues.apache.org/jira/browse/SPARK-51207
>>> 
>>>  Please vote on the SPIP for the next 72 hours:
>>> 
>>>  [ ] +1: Accept the proposal as an official SPIP
>>>  [ ] +0
>>>  [ ] -1: I don’t think this is a good idea because …
>>> 
>>>  - Anton
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [VOTE] SPIP: Support NanoSecond Timestamps

2025-04-05 Thread Szehon Ho
Trying to catch up on this, Serge's suggestion in the doc seems the best
way forward,
https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?disco=AAABe5AUnWU.
Spark would support the full ANSI SQL timestamp range, and Iceberg /
Parquet/ other data source will throw runtime error if it trying to write a
value outside its supported range, until we get a wider timestamp type in
Parquet (Iceberg's V3 timestamp_ns type is just built on top of that)

Thanks,
Szehon

On Thu, Mar 27, 2025 at 9:45 PM Micah Kornfield 
wrote:

> I think the key issue is the format. The proposed 10-byte format doesn't
>> seem like a standard and the one in Iceberg/Parquet does not support the
>> required range by ANSI SQL: year 0001 to year . We should address this
>> issue first. Note that Parquet has an INT96 timestamp that supports
>> nanosecond precision, but it's deprecated. Shall we work with the Parquet
>> community to revive it?
>
>
> It would be great to discuss a plan for this in parquet.  This has come up
> in passing in some of the recent parquet syncs.  I don't think resurrecting
> int96 is necessarily a great idea since it is defined in terms of Julian
> days [1], and most systems these days are standardizing on
> proleptic-Gregorian.
>
> A fair number of OSS implementations that do interact with int96 I've seen
> do conversion assuming all timestamps are post Unix epoch timestamps and
> therefore have errors/idiosyncrasies when translating dates prior to the
> Gregorian cutover.
>
> Cheers,
> Micah
>
> [1] https://github.com/apache/parquet-format/pull/49
>
> On Thu, Mar 27, 2025 at 7:02 PM Wenchen Fan  wrote:
>
>> Maybe we should discuss the key issues on the dev list as it's easy to
>> lose track of Google Doc comments.
>>
>> I think all the proposals for adding new data types need to prove that
>> the new data type is common/standard in the ecosystem. This means 3 things:
>> - it has common/standard semantic. TIMESTAMP with nanosecond precision is
>> definitely a standard data type, in both ANSI SQL and mainstream databases.
>> - it has common/standard storage format. Parquet/Iceberg supports
>> nanosecond timestamp using int64, which is different from what is proposed
>> here.
>> - it has common/standard processing methods. The java datetime library
>> Spark is using now already support nanosecond, so we are fine here.
>>
>> I think the key issue is the format. The proposed 10-byte format doesn't
>> seem like a standard and the one in Iceberg/Parquet does not support the
>> required range by ANSI SQL: year 0001 to year . We should address this
>> issue first. Note that Parquet has an INT96 timestamp that supports
>> nanosecond precision, but it's deprecated. Shall we work with the Parquet
>> community to revive it?
>>
>> On Fri, Mar 28, 2025 at 7:03 AM DB Tsai  wrote:
>>
>>> Thanks!!!
>>>
>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>
>>> On Mar 27, 2025, at 3:56 PM, Qi Tan  wrote:
>>>
>>> Thanks DB,
>>>
>>> I just noticed a few more comments came in after I initiated the vote.
>>> I'm going to postpone the voting process and address those outstanding
>>> comments.
>>>
>>> Qi Tan
>>>
>>> DB Tsai  于2025年3月27日周四 15:12写道:
>>>
 Hello Qi,

 I'm supportive of the NanoSecond Timestamps proposal; however, before
 we initiate the vote, there are a few outstanding comments in the SPIP
 document that haven't been addressed yet. Since the vote is on the document
 itself, could we resolve these items beforehand?

 For example:

-

The default precision of TimestampNsNTZType is set to 6, which
overlaps with the existing TimestampNTZ.
-

The specified range exceeds the capacity of an int64, but the
document doesn't clarify how this type will be represented in memory or
serialized in data sources.
-

Schema inference details for data sources are missing.

 These points still need discussion.

 I appreciate your efforts in putting the doc together and look forward
 to your contribution!

 Thanks,
 DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

 On Mar 27, 2025, at 1:24 PM, huaxin gao  wrote:

 +1

 On Thu, Mar 27, 2025 at 1:22 PM Qi Tan  wrote:

> Hi all,
>
> I would like to start a vote on adding support for nanoseconds
> timestamps.
>
> *Discussion thread: *
> https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of
> *SPIP:*
> https://docs.google.com/document/d/1wjFsBdlV2YK75x7UOk2HhDOqWVA0yC7iEiqOMnNnxlA/edit?usp=sharing
> *JIRA:*  https://issues.apache.org/jira/browse/SPARK-50532
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because
>


>>>


Re: [VOTE] SPIP: Declarative Pipelines

2025-04-09 Thread Szehon Ho
+1 (non-binding)

Thanks
Szehon

On Wed, Apr 9, 2025 at 3:42 PM Hyukjin Kwon  wrote:

> I will shephard.
>
> On Thu, 10 Apr 2025 at 07:28, Anton Okolnychyi 
> wrote:
>
>> +1 (non-binding)
>>
>> - Anton
>>
>> ср, 9 квіт. 2025 р. о 15:01 Jungtaek Lim 
>> пише:
>>
>>> Btw who is going to shephard this SPIP? I don't see this in the
>>> doc/JIRA/discussion thread. I understand there are PMC members in the
>>> author list, but probably good to be explicit about "who" is
>>> shepherding this SPIP.
>>>
>>> On Wed, Apr 9, 2025 at 11:22 PM Sandy Ryza  wrote:
>>>
 We started to get some votes on the discussion thread, so I'd like to
 move to a formal vote on adding support for declarative pipelines.

 *Discussion thread: *
 https://lists.apache.org/thread/lsv8f829ps0bog41fjoqc45xk7m574ly
 *SPIP:*
 https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4
 *JIRA:* https://issues.apache.org/jira/browse/SPARK-51727

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 -Sandy




Re: [DISCUSS] SPIP: Declarative Pipelines

2025-04-09 Thread Szehon Ho
+1 really excited to finally see Materialized View finally make its way to
Spark, as many other ecosystem projects (Trino, Starrocks, soon Iceberg)
already supporting it.

Thanks
Szehon

On Wed, Apr 9, 2025 at 2:33 AM Martin Grund 
wrote:

> +1
>
> On Wed, Apr 9, 2025 at 9:37 AM Mich Talebzadeh 
> wrote:
>
>> +1
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>>
>>
>> On Wed, 9 Apr 2025 at 08:07, Peter Toth  wrote:
>>
>>> +1
>>>
>>> On Wed, Apr 9, 2025 at 8:51 AM Cheng Pan  wrote:
>>>
 +1 (non-binding)

 Glad to see Spark SQL extended to streaming use cases.

 Thanks,
 Cheng Pan



 On Apr 9, 2025, at 14:43, Anton Okolnychyi 
 wrote:

 +1

 вт, 8 квіт. 2025 р. о 23:36 Jacky Lee  пише:

> +1 I'm delighted that it will be open-sourced, enabling greater
> integration with Iceberg/Delta to unlock more value.
>
> Jungtaek Lim  于2025年4月9日周三 10:47写道:
> >
> > +1 looking forward to seeing this make progress!
> >
> > On Wed, Apr 9, 2025 at 11:32 AM Yang Jie 
> wrote:
> >>
> >> +1
> >>
> >> On 2025/04/09 01:07:57 Hyukjin Kwon wrote:
> >> > +1
> >> >
> >> > I am actually pretty excited to have this. Happy to see this
> being proposed.
> >> >
> >> > On Wed, 9 Apr 2025 at 01:55, Chao Sun  wrote:
> >> >
> >> > > +1. Super excited about this effort!
> >> > >
> >> > > On Tue, Apr 8, 2025 at 9:47 AM huaxin gao <
> huaxin.ga...@gmail.com> wrote:
> >> > >
> >> > >> +1 I support this SPIP because it simplifies data pipeline
> management and
> >> > >> enhances error detection.
> >> > >>
> >> > >>
> >> > >> On Tue, Apr 8, 2025 at 9:33 AM Dilip Biswal <
> dkbis...@gmail.com> wrote:
> >> > >>
> >> > >>> Excited to see this heading toward open source — materialized
> views and
> >> > >>> other features will bring a lot of value.
> >> > >>> +1 (non-binding)
> >> > >>>
> >> > >>> On Mon, Apr 7, 2025 at 10:37 AM Sandy Ryza 
> wrote:
> >> > >>>
> >> >  Hi Khalid – the CLI in the current proposal will need to be
> built on
> >> >  top of internal APIs for constructing and launching pipeline
> executions.
> >> >  We'll have the option to expose these in the future.
> >> > 
> >> >  It would be worthwhile to understand the use cases in more
> depth before
> >> >  exposing these, because APIs are one-way doors and can be
> costly to
> >> >  maintain.
> >> > 
> >> >  On Sat, Apr 5, 2025 at 11:59 PM Khalid Mammadov <
> >> >  khalidmammad...@gmail.com> wrote:
> >> > 
> >> > > Looks great!
> >> > > QQ: will user able to run this pipeline from normal code?
> I.e. can I
> >> > > trigger a pipeline from *driver* code based on some
> condition etc. or
> >> > > it must be executed via separate shell command ?
> >> > > As a background Databricks imposes similar limitation where
> as you
> >> > > cannot run normal Spark code and DLT on the same cluster
> for some reason
> >> > > and forces to use two clusters increasing the cost and
> latency.
> >> > >
> >> > > On Sat, 5 Apr 2025 at 23:03, Sandy Ryza 
> wrote:
> >> > >
> >> > >> Hi all – starting a discussion thread for a SPIP that I've
> been
> >> > >> working on with Chao Sun, Kent Yao, Yuming Wang, and Jie
> Yang: [JIRA
> >> > >> ] [Doc
> >> > >> <
> https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0
> >
> >> > >> ].
> >> > >>
> >> > >> The SPIP proposes extending Spark's lazy, declarative
> execution model
> >> > >> beyond single queries, to pipelines that keep multiple
> datasets up to date.
> >> > >> It introduces the ability to compose multiple
> transformations into a single
> >> > >> declarative dataflow graph.
> >> > >>
> >> > >> Declarative pipelines aim to simplify the development and
> management
> >> > >> of data pipelines, by  removing the need for manual
> orchestration of
> >> > >> dependencies and making it possible to catch many errors
> before any
> >> > >> execution steps are launched.
> >> > >>
> >> > >> Declarative pipelines can include both batch and streaming
> >> > >> computations, leveraging Structured Streaming for stream
> processing and new
> >> > >> materialized view syntax for batch processing. Tight
> integration with Spark
> >> > >> SQL's analyzer enables deeper analysis and earlier error
>>>

Re: [DISCUSS] SPIP: Add geospatial types to Spark

2025-03-30 Thread Szehon Ho
gt;> ** Why Spark Should Avoid Hardcoding Domain-Specific Types like geo types **
>>> 
>>>1.  Domain types evolve quickly.
>>> 
>>> In geospatial, we already have geometry, geography, raster, trajectory, 
>>> point clouds — and the list keeps growing. In AI/ML, we’re seeing tensors, 
>>> vectors, and multi-dimensional arrays. Spark’s strength has always been in 
>>> its general-purpose architecture and extensibility. Introducing hardcoded 
>>> support for fast-changing domain-specific types risks long-term maintenance 
>>> issues and eventual incompatibility with emerging standards.
>>> 
>>>2.  Geospatial in Java and Python is a dependency hell.
>>> 
>>> There are multiple competing geometry libraries with incompatible APIs. No 
>>> widely adopted Java library supports geography types. The most 
>>> authoritative CRS dataset (EPSG) is not Apache-compatible. The json format 
>>> for CRS definitions (projjson) is only fully supported in PROJ, a C++ 
>>> library without a Java equivalent and no formal OGC standard status. On the 
>>> Python side, this might involve Shapely and GeoPandas dependencies.
>>> 
>>>3.  Sedona already supports Geo fully in (Geo)Parquet.
>>> 
>>> Sedona has supported reading, writing, metadata preservation, and data 
>>> skipping for GeoParquet (predecessor of Parquet Geo) for over two years 
>>> [2][3]. These features are production-tested and widely used.
>>> 
>>> ** Proposed Path Forward: Geo Support via Spark Extensions **
>>> 
>>> To enable seamless Parquet integration without burdening Spark core, here 
>>> are two options:
>>> 
>>> Option 1:
>>> Sedona offers a dedicated `parquet-geo` DataSource that handles type 
>>> encoding, metadata, and data skipping. No changes to Spark are required. 
>>> This is already underway and will be maintained by the Sedona community to 
>>> keep up with the evolving Geo standards.
>>> 
>>> Option 2:
>>> Spark provides hooks to inject:
>>> - custom logical types / user-defined types (UDTs)
>>> - custom statistics and filter pushdowns
>>> Sedona can then extend the built-in `parquet` DataSource to integrate geo 
>>> type metadata, predicate pushdown, and serialization seamlessly.
>>> 
>>> For Iceberg, we’ve already published a proof-of-concept connector [4] 
>>> showing Sedona, Spark, and Iceberg working together without any Spark core 
>>> changes [5].
>>> 
>>> ** On the Bigger Picture **
>>> 
>>> I also agree with your long-term vision. I believe Spark is on the path to 
>>> becoming a foundational compute engine — much like Postgres or Pandas — 
>>> where the core remains focused and stable, while powerful domain-specific 
>>> capabilities emerge from its ecosystem.
>>> 
>>> To support this future, Spark could prioritize flexible extension hooks so 
>>> that third-party libraries can thrive — just like we’ve seen with PostGIS, 
>>> pgvector, TimescaleDB in the Postgres ecosystem, and GeoPandas in the 
>>> Pandas ecosystem.
>>> 
>>> Sedona is following this model by building geospatial support around Spark 
>>> — not inside it — and we’d love to continue collaborating in this spirit.
>>> 
>>> Happy to work together on providing Geo support in Parquet!
>>> 
>>> Best,
>>> Jia
>>> 
>>> References
>>> 
>>> [1] GeoParquet project:
>>> https://github.com/opengeospatial/geoparquet
>>> 
>>> [2] Sedona’s GeoParquet DataSource implementation:
>>> https://github.com/apache/sedona/tree/master/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet
>>> 
>>> [3] Sedona’s GeoParquet documentation:
>>> https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/
>>> 
>>> [4] Sedona-Iceberg connector (PoC):
>>> https://github.com/wherobots/sedona-iceberg-connector
>>> 
>>> [5] Spark-Sedona-Iceberg working example:
>>> https://github.com/wherobots/sedona-iceberg-connector/blob/main/src/test/scala/com/wherobots/sedona/TestGeospatial.scala#L53
>>> 
>>> 
>>> On 2025/03/29 19:27:08 Menelaos Karavelas wrote:
>>>> To continue along the line of thought of Szehon:
>>>> 
>>>> I am really excited that the Parquet and Iceberg communities have adopted 
>>>> geospatial logical types and of cou

Re: [VOTE] Release Spark 4.0.0 (RC4)

2025-04-23 Thread Szehon Ho
One more small fix (on another topic) for the next RC:
https://github.com/apache/spark/pull/50685

Thanks!
Szehon

On Tue, Apr 22, 2025 at 10:07 AM Rozov, Vlad 
wrote:

> Correct, to me it looks like a Spark bug
> https://issues.apache.org/jira/browse/SPARK-51821 that may be hard to
> trigger and is reproduce using the test case provided in
> https://github.com/apache/spark/pull/50594:
>
> 1. Spark UninterruptibleThread “task” is interrupted by “test” thread
> while “task” thread is blocked in NIO operation.
> 2. NIO operation is interruptible (channel  is InterruptibleChannel). In
> case of Parquet, it is WritableByteChannel.
> 3. As part of handling InterruptedException, channel interrupts the “task”
> thread (
> https://github.com/apache/hadoop/blob/5770647dc73d552819963ba33f50be518058ee03/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1029
> )
>
> Thank you,
>
> Vlad
>
>
> On Apr 22, 2025, at 1:53 AM, Wenchen Fan  wrote:
>
> Correct me if I'm wrong: this is a long-standing Spark bug that is very
> hard to trigger, but the new Parquet version happens to hit the trigger
> condition and exposes the bug. If this is the case, I'm +1 to fix the Spark
> bug instead of downgrading the Parquet version.
>
> Let's move the technical discussions to
> https://github.com/apache/spark/pull/50594.
>
> On Tue, Apr 22, 2025 at 11:20 AM Manu Zhang 
> wrote:
>
>> I don't think PARQUET-2432 has any issue itself. It looks to have
>> triggered a deadlock case like https://github.com/apache/spark/pull/50594.
>>
>> I'd suggest that we fix forward if possible.
>>
>> Thanks,
>> Manu
>>
>> On Mon, Apr 21, 2025 at 11:19 PM Rozov, Vlad 
>> wrote:
>>
>>> The deadlock is reproducible without Parquet. Please see
>>> https://github.com/apache/spark/pull/50594.
>>>
>>> Thank you,
>>>
>>> Vlad
>>>
>>> On Apr 21, 2025, at 1:59 AM, Cheng Pan  wrote:
>>>
>>> The deadlock is introduced by PARQUET-2432(1.14.0), if we decide
>>> downgrade, the latest workable version is Parquet 1.13.1.
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>>
>>> On Apr 21, 2025, at 16:53, Wenchen Fan  wrote:
>>>
>>> +1 to downgrade to Parquet 1.15.0 for Spark 4.0. According to
>>> https://github.com/apache/spark/pull/50583#issuecomment-2815243571 ,
>>> the Parquet CVE does not affect Spark.
>>>
>>> On Mon, Apr 21, 2025 at 2:45 PM Hyukjin Kwon 
>>> wrote:
>>>
 That's nice but we need to wait for them to release, and upgrade right?
 Let's revert the parquet upgrade out of 4.0 branch since we're not directly
 affected by the CVE anyway.

 On Mon, 21 Apr 2025 at 15:42, Yuming Wang  wrote:

> It seems this patch(https://github.com/apache/parquet-java/pull/3196)
> can avoid deadlock issue if using Parquet 1.15.1.
>
> On Wed, Apr 16, 2025 at 5:39 PM Niranjan Jayakar
>  wrote:
>
>> I found another bug introduced in 4.0 that breaks Spark connect
>> client x server compatibility:
>> https://github.com/apache/spark/pull/50604.
>>
>> Once merged, this should be included in the next RC.
>>
>> On Thu, Apr 10, 2025 at 5:21 PM Wenchen Fan 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 4.0.0.
>>>
>>> The vote is open until April 15 (PST) and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 4.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>>
>>> The tag to be voted on is v4.0.0-rc4 (commit
>>> e0801d9d8e33cd8835f3e3beed99a3588c16b776)
>>> https://github.com/apache/spark/tree/v4.0.0-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1480/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc4-docs/
>>>
>>> The list of bug fixes going into 4.0.0 can be found at the following
>>> URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>>
>>> This release is using the release script of the tag v4.0.0-rc4.
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then
>>> reporting any regressions.
>>>
>>> If you

Re: [VOTE] SPIP: Add geospatial types to Spark

2025-05-05 Thread Szehon Ho
+1 (non binding)

Thanks
Szehon

On Mon, May 5, 2025 at 11:17 AM DB Tsai  wrote:

> +1, geospatial types will be a great feature for Spark. Thanks for working
> on it.
>
> On May 5, 2025, at 11:04 AM, Menelaos Karavelas <
> menelaos.karave...@gmail.com> wrote:
>
> I started the discussion on adding geospatial types to Spark on March
> 28th.
> Since then there has been some discussion in the dev mailing list, as well
> as in the SPIP doc.
>
> At this point I would like to move to a formal vote on adding support for
> geospatial types to Spark.
>
> *Discussion thread:*
> https://lists.apache.org/thread/07ozv8ccfddd5tnfl8t74dr4tvnjdkpg
>
> *SPIP:*
>
> https://docs.google.com/document/d/1cYSNPGh95OjnpS0k_KDHGM9Ae3j-_0Wnc_eGBZL4D3w/edit?tab=t.0
>
> *JIRA:*
> https://issues.apache.org/jira/browse/SPARK-51658
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because
>
> Menelaos Karavelas
>
>


Re: 4.0.0 RC1 is coming

2025-02-21 Thread Szehon Ho
Hi

Sorry for late reply, we identified another serious issue with the newly
added Call Procedure, can we add it to the list?

SPARK-51273: Spark Connect Call Procedure runs the procedure twice
.  I have a PR
 to fix it.

I know its a new functionality that Iceberg (and other V2 data source) are
waiting for in Spark 4.0 to implement their Spark procedures, and it would
be great to fix it before the release.  Running twice can lead to
correctness issues.

Thanks
Szehon



On Sun, Feb 16, 2025 at 10:36 PM Jungtaek Lim 
wrote:

> I'm working on SPARK-51187
> , to gracefully rename
> the improper config we introduced in SPARK-49699
> . Unfortunately, the
> config was released in Apache Spark 3.5.4, hence we need a graceful way on
> this rather than blindly renaming.
>
> Also, on my radar of reviews, it'd be ideal to include SPARK-50655
>  into Apache Spark
> 4.0.0, otherwise we will need to deal with additional work on storage
> format change.
>
> Thanks for driving the huge release!
>
>
>
>
> On Mon, Feb 17, 2025 at 2:05 PM Wenchen Fan  wrote:
>
>> Hi all,
>>
>> RC1 was scheduled for Feb 15, but I'll cut in on Feb 18 to have 3 working
>> days during the vote period, due to Feb 15 and 16 being the weekend, and
>> Feb 17 being a holiday in the US.
>>
>> The RC1 vote likely won't pass because of some ongoing work but I think
>> it's better to kick off the release process as scheduled.
>>
>> The ongoing work that I'm aware of:
>>
>>- SPARK-38388, SPARK-51016: correctness issue caused by
>>indeterministic query
>>- SPARK-50992: OOM issue caused by AQE UI
>>- SPARK-46057: SQL UDF. SQL table function is still WIP but we can
>>probably re-target it for 4.1. cc @Allison Wang
>>
>>- SPARK-48918: Unified Scala interface for Classic and Connect. A few
>>sub-tasks are still open, do we need to complete them in 4.0? @Herman
>>van Hövell tot Westerflier  @Paddy Xu
>>- SPARK-46815: Arbitrary state API v2. A few sub-tasks are still
>>open, do we need to complete them in 4.0? @Anish Shrigondekar
>>
>>- SPARK-24497: Recursive CTE. The performance issue is hard to fix,
>>we will likely retarget it for 4.1.
>>- from_json performance regression: we should either support CSE for
>>Filter in whole stage codegen (PR
>>) or revert the codegen
>>support of from_json.
>>
>> Please reply to this email if you have other ongoing work to add to this
>> list.
>>
>> Thanks,
>> Wenchen
>>
>>


Re: [DISCUSS] SPIP: Add geospatial types to Spark

2025-03-28 Thread Szehon Ho
Thanks Menelaos, this is exciting !  Is there a google doc we can comment,
or just on the JIRA?

Thanks
Szehon

On Fri, Mar 28, 2025 at 1:41 PM Ángel Álvarez Pascua <
angel.alvarez.pas...@gmail.com> wrote:

> Sorry, I only had a quick look at the proposal, looked for WKT and didn't
> find anything.
>
> It's been years since I worked on geospatial projects and I'm not an
> expert (at all). Maybe starting with something simple but useful like
> conversion WKT<=>WKB?
>
>
> El vie, 28 mar 2025, 21:27, Menelaos Karavelas <
> menelaos.karave...@gmail.com> escribió:
>
>> In the SPIP Jira the proposal is to add the expressions ST_AsBinary,
>> ST_GeomFromWKB, and ST_GeogFromWKB.
>> Is there anything else that you think should be added?
>>
>> Regarding WKT, what do you think should be added?
>>
>> - Menelaos
>>
>>
>> On Mar 28, 2025, at 1:02 PM, Ángel Álvarez Pascua <
>> angel.alvarez.pas...@gmail.com> wrote:
>>
>> What about adding support for WKT
>> 
>> /WKB
>> 
>> ?
>>
>> El vie, 28 mar 2025 a las 20:50, Ángel Álvarez Pascua (<
>> angel.alvarez.pas...@gmail.com>) escribió:
>>
>>> +1 (non-binding)
>>>
>>> El vie, 28 mar 2025, 18:48, Menelaos Karavelas <
>>> menelaos.karave...@gmail.com> escribió:
>>>
 Dear Spark community,

 I would like to propose the addition of new geospatial data types
 (GEOMETRY and GEOGRAPHY) which represent geospatial values as recently
 added as new logical types in the Parquet specification.

 The new types should improve Spark’s ability to read the new Parquet
 logical types and perform some minimal meaningful operations on them.

 SPIP: https://issues.apache.org/jira/browse/SPARK-51658

 Looking forward to your comments and feedback.


 Best regards,

 Menelaos Karavelas


>>


Re: [DISCUSS] SPIP: Add geospatial types to Spark

2025-03-29 Thread Szehon Ho
Thank you Menelaos, will do!To give a little background, Jia and Sedona community, also GeoParquet community, and others really put much effort contributing to defining the Parquet and Iceberg geo types, which couldn't be done without their experience and help! But I do agree with Wenchen , now that the types are in most common data sources in ecosystem , I think Apache Spark as a common platform needs to have this type definition for inter-op, otherwise users of vanilla Spark cannot work with those data sources with stored geospatial data.  (Imo a similar rationale in adding timestamp nano in the other ongoing SPIP.).  And like Wenchen said, the SPIP’s goal doesnt seem to be to fragment the ecosystem by implementing Sedona’s advanced geospatial analytic tech in Spark itself, which you may be right belongs in pluggable frameworks.  Menelaus may explain more about the SPIP goal.I do hope there can be more collaboration across communities (like in Iceberg/Parquet collaboration) in getting Sedona community’s experience in making sure these type definitions are optimal , and compatible for Sedona.Thanks!SzehonOn Mar 29, 2025, at 8:04 AM, Menelaos Karavelas  wrote:Hello Szehon,I just created a Google doc and also linked it in the JIRA:SPIP: Add geospatial types in Sparkdocs.google.comPlease feel free to comment on it.Best,MenelaosOn Mar 28, 2025, at 2:19 PM, Szehon Ho  wrote:Thanks Menelaos, this is exciting !  Is there a google doc we can comment, or just on the JIRA?ThanksSzehonOn Fri, Mar 28, 2025 at 1:41 PM Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> wrote:Sorry, I only had a quick look at the proposal, looked for WKT and didn't find anything.It's been years since I worked on geospatial projects and I'm not an expert (at all). Maybe starting with something simple but useful like conversion WKT<=>WKB?  El vie, 28 mar 2025, 21:27, Menelaos Karavelas <menelaos.karave...@gmail.com> escribió:In the SPIP Jira the proposal is to add the expressions ST_AsBinary, ST_GeomFromWKB, and ST_GeogFromWKB.Is there anything else that you think should be added?Regarding WKT, what do you think should be added?- MenelaosOn Mar 28, 2025, at 1:02 PM, Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> wrote:What about adding support for WKT/WKB?El vie, 28 mar 2025 a las 20:50, Ángel Álvarez Pascua (<angel.alvarez.pas...@gmail.com>) escribió:+1 (non-binding)El vie, 28 mar 2025, 18:48, Menelaos Karavelas <menelaos.karave...@gmail.com> escribió:Dear Spark community,I would like to propose the addition of new geospatial data types (GEOMETRY and GEOGRAPHY) which represent geospatial values as recently added as new logical types in the Parquet specification.The new types should improve Spark’s ability to read the new Parquet logical types and perform some minimal meaningful operations on them.SPIP: https://issues.apache.org/jira/browse/SPARK-51658Looking forward to your comments and feedback.Best regards,Menelaos Karavelas





Re: [VOTE] Release Spark 4.1.0-preview1 (RC1)

2025-07-11 Thread Szehon Ho
+1 (non-binding)

Checked signature, checksum, basic functionality of
spark-4.1.0-preview1-bin-hadoop3

Thanks for setting this up !
Szehon

On Thu, Jul 10, 2025 at 11:18 PM Yang Jie  wrote:

> +1
>
> On 2025/07/11 04:23:27 Ángel Álvarez Pascua wrote:
> > +1 (non-binding)
> >
> > El jue, 10 jul 2025, 21:07, Jules Damji 
> escribió:
> >
> > > +1 (non-binding)
> > > —
> > > Sent from my iPhone
> > > Pardon the dumb thumb typos :)
> > >
> > > On Jul 10, 2025, at 8:04 AM, Peter Toth  wrote:
> > >
> > > 
> > > +1
> > >
> > > On Thu, Jul 10, 2025 at 9:12 AM Kent Yao  wrote:
> > >
> > >> Thank you all for the verification, +1
> > >>
> > >> Kent
> > >>
> > >> 在 2025年7月10日星期四,Hyukjin Kwon  写道:
> > >>
> > >>> It was a mistake in the email. The artifact shouldn't have a problem.
> > >>>
> > >>> On Thu, 10 Jul 2025 at 16:00, Hyukjin Kwon 
> wrote:
> > >>>
> >  oh yeah. I think I should change the email contents.
> > 
> >  On Thu, 10 Jul 2025 at 15:02, Saruta, Kousuke
> >   wrote:
> > 
> > > Using dev1 rather than preview1 seems intended.
> > >
> > >
> > >
> https://github.com/apache/spark/blob/v4.1.0-preview1-rc1/dev/create-release/release-build.sh#L127
> > >
> > >
> > >
> > > *送信元**: *Jungtaek Lim 
> > > *日付**: *2025年7月10日 木曜日 14:30
> > > *宛先**: *Kent Yao 
> > > *Cc: *Anton Okolnychyi , Max Gekk <
> > > max.g...@gmail.com>, Sandy Ryza ,
> > > Wenchen Fan , Kousuke Saruta <
> saru...@apache.org>,
> > > "dev@spark.apache.org" 
> > > *件名**: *RE: [EXTERNAL] [VOTE] Release Spark 4.1.0-preview1 (RC1)
> > >
> > >
> > >
> > > *CAUTION*: This email originated from outside of the organization.
> Do
> > > not click links or open attachments unless you can confirm the
> sender and
> > > know the content is safe.
> > >
> > >
> > >
> > > I think we used "dev1" for 4.0.0 "preview1" as well. I guess this
> is
> > > based on the name convention on python?
> > >
> > >
> > >
> > > On Thu, Jul 10, 2025 at 1:07 PM Kent Yao  wrote:
> > >
> > > -1,
> > >
> > >
> > >
> > > There is a 404 for
> > >
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview1-rc1-bin/pyspark-4.1.0-preview1.tar.gz
> > >
> > >
> > >
> > > pip install
> > >
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview1-rc1-bin/pyspark-4.1.0-preview1.tar.gz
> > > Collecting
> > >
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview1-rc1-bin/pyspark-4.1.0-preview1.tar.gz
> > >   ERROR: HTTP error 404 while getting
> > >
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview1-rc1-bin/pyspark-4.1.0-preview1.tar.gz
> > > ERROR: Could not install requirement
> > >
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview1-rc1-bin/pyspark-4.1.0-preview1.tar.gz
> > > because of HTTP error 404 Client Error: Not Found for url:
> > >
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview1-rc1-bin/pyspark-4.1.0-preview1.tar.gz
> > > for URL
> > >
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview1-rc1-bin/pyspark-4.1.0-preview1.tar.gz
> > >
> > >
> > >
> > > Can you check?
> > >
> > >
> > >
> > >
> > >
> > > Kent
> > >
> > >
> > >
> > >
> > >
> > > Jungtaek Lim  于2025年7月10日周四 07:58写道:
> > >
> > > +1 (non-binding) Let's give it a try!
> > >
> > >
> > >
> > > On Thu, Jul 10, 2025 at 12:24 AM Anton Okolnychyi <
> > > aokolnyc...@gmail.com> wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Wed, Jul 9, 2025 at 8:07 AM Max Gekk 
> wrote:
> > >
> > > +1
> > >
> > >
> > >
> > > On Wed, Jul 9, 2025 at 4:04 PM Sandy Ryza
> > >  wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Wed, Jul 9, 2025 at 6:57 AM Wenchen Fan 
> > > wrote:
> > >
> > > +1
> > >
> > >
> > >
> > > On Wed, Jul 9, 2025 at 1:16 AM Kousuke Saruta 
> > > wrote:
> > >
> > > +1
> > >
> > >
> > >
> > > 2025年7月9日(水) 2:12 Rozov, Vlad :
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > Thank you,
> > >
> > >
> > >
> > > Vlad
> > >
> > >
> > >
> > > *From: *Dongjoon Hyun 
> > > *Date: *Tuesday, July 8, 2025 at 8:09 AM
> > > *To: *Hyukjin Kwon 
> > > *Cc: *"dev@spark.apache.org" 
> > > *Subject: *RE: [EXTERNAL] [VOTE] Release Spark 4.1.0-preview1 (RC1)
> > >
> > >
> > >
> > > +1
> > >
> > >
> > >
> > > Dongjoon
> > >
> > >
> > >
> > > On Tue, Jul 8, 2025 at 05:41 Hyukjin Kwon 
> > > wrote:
> > >
> > > Alright. +1 from myself :-).
> > >
> > >
> > >
> > > On Tue, Jul 8, 2025 at 9:39 PM  wrote:
> > >
> > > Please vote on releasing the following candidate as Apache

Re: [VOTE] SPIP: Monthly preview release

2025-07-03 Thread Szehon Ho
+1 (non-binding)

Thanks for the proposal, hope one day to get faster releases in Spark.

Thanks
Szehon

On Thu, Jul 3, 2025 at 6:58 AM Sandy Ryza 
wrote:

> +1 (non-binding)
>
> On Thu, Jul 3, 2025 at 6:47 AM Jules Damji  wrote:
>
>> +1 (non-binding)
>> —
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> > On Jul 2, 2025, at 11:44 PM, L. C. Hsieh  wrote:
>> >
>> > +1
>> >
>> >> On Wed, Jul 2, 2025 at 9:38 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I would like to start a vote on the monthly preview releases.
>> >>
>> >> Discussion thread:
>> https://lists.apache.org/thread/1hmsb3g7lm5k2f9xnp6x2hmss8yrd5h8
>> >> SPIP:
>> https://docs.google.com/document/d/1ysJ16z_NUfIdsYqq1Qq7k8htmMWFpo8kXqX-8lGzCGc/edit?tab=t.0#heading=h.89yty49abp67
>> >> JIRA: https://issues.apache.org/jira/browse/SPARK-52625
>> >>
>> >> Please vote on the SPIP for the next 72 hours:
>> >>
>> >> [ ] +1: Accept the proposal as an official SPIP
>> >> [ ] +0
>> >> [ ] -1: I don’t think this is a good idea because …
>> >>
>> >>
>> >> Starting with my own +1.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Release Spark 4.0.0 (RC5)

2025-05-12 Thread Szehon Ho
+1 (non binding)

Checked license, signature, checksum, ran basic test on
spark-4.0.0-bin-hadoop3.

Thanks
Szehon

On Mon, May 12, 2025 at 9:02 PM Sakthi  wrote:

> +1 (non-binding)
>
> On Mon, May 12, 2025 at 7:38 PM Jungtaek Lim 
> wrote:
>
>> +1 (non-binding)
>>
>> Thanks Wenchen for driving the release!
>>
>> On Tue, May 13, 2025 at 11:35 AM Yang Jie  wrote:
>>
>>> +1, thank you Wenchen
>>>
>>> On 2025/05/13 02:11:02 "Rozov, Vlad" wrote:
>>> > +1 (non-binding)
>>> >
>>> > Thank you,
>>> >
>>> > Vlad
>>> >
>>> > On May 12, 2025, at 5:44 PM, huaxin gao 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Mon, May 12, 2025 at 5:34 PM Hyukjin Kwon >> > wrote:
>>> > +1
>>> >
>>> > On Tue, 13 May 2025 at 03:24, Xinrong Meng >> xinr...@apache.org>> wrote:
>>> > +1
>>> >
>>> > Thank you Wenchen!
>>> >
>>> > On Mon, May 12, 2025 at 10:03 AM Yuming Wang >> > wrote:
>>> > +1
>>> >
>>> > On Tue, May 13, 2025 at 12:07 AM Gengliang Wang >> > wrote:
>>> > +1
>>> >
>>> > On Mon, May 12, 2025 at 6:52 AM Wenchen Fan >> > wrote:
>>> > I'll start with my own +1.
>>> >
>>> > All the known blockers are fixed, and I verified that the new Spark
>>> Connect distribution works as expected.
>>> >
>>> > On Fri, May 9, 2025 at 8:16 PM Wenchen Fan >> > wrote:
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 4.0.0.
>>> >
>>> > The vote is open until May 15 (PST) and passes if a majority +1 PMC
>>> votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 4.0.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see https://spark.apache.org/
>>> >
>>> > The tag to be voted on is v4.0.0-rc5 (commit
>>> f35a2ee6dc7833ea0cff757147132c9fdc26c113)
>>> > https://github.com/apache/spark/tree/v4.0.0-rc5
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc5-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1483/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc5-docs/
>>> >
>>> > The list of bug fixes going into 4.0.0 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>> >
>>> > This release is using the release script of the tag v4.0.0-rc5.
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [VOTE] Release Spark 4.0.0 (RC6)

2025-05-13 Thread Szehon Ho
+1 (non-binding)

Checked signature, checksum, basic test on spark-4.0.0-bin-hadoop3

Thanks
Szehon

On Tue, May 13, 2025 at 8:44 PM Yang Jie  wrote:

> +1
>
> On 2025/05/14 00:21:11 Ruifeng Zheng wrote:
> > +1
> >
> > On Wed, May 14, 2025 at 7:01 AM Gengliang Wang  wrote:
> >
> > > +1
> > >
> > > On Tue, May 13, 2025 at 3:57 PM Hyukjin Kwon 
> wrote:
> > >
> > >> +1
> > >>
> > >> On Wed, 14 May 2025 at 07:29, Wenchen Fan 
> wrote:
> > >>
> > >>> Same as before, I'll start with my own +1.
> > >>>
> > >>> On Wed, May 14, 2025 at 12:28 AM Wenchen Fan 
> > >>> wrote:
> > >>>
> >  Please vote on releasing the following candidate as Apache Spark
> >  version 4.0.0.
> > 
> >  The vote is open until May 16 (PST) and passes if a majority +1 PMC
> >  votes are cast, with a minimum of 3 +1 votes.
> > 
> >  [ ] +1 Release this package as Apache Spark 4.0.0
> >  [ ] -1 Do not release this package because ...
> > 
> >  To learn more about Apache Spark, please see
> https://spark.apache.org/
> > 
> >  The tag to be voted on is v4.0.0-rc6 (commit
> >  9a99ecb03a2d35f5f38decd686b55511a5c7c535)
> >  https://github.com/apache/spark/tree/v4.0.0-rc6
> > 
> >  The release files, including signatures, digests, etc. can be found
> at:
> >  https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc6-bin/
> > 
> >  Signatures used for Spark RCs can be found in this file:
> >  https://dist.apache.org/repos/dist/dev/spark/KEYS
> > 
> >  The staging repository for this release can be found at:
> > 
> https://repository.apache.org/content/repositories/orgapachespark-1484/
> > 
> >  The documentation corresponding to this release can be found at:
> >  https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc6-docs/
> > 
> >  The list of bug fixes going into 4.0.0 can be found at the following
> >  URL:
> >  https://issues.apache.org/jira/projects/SPARK/versions/12353359
> > 
> >  This release is using the release script of the tag v4.0.0-rc6.
> > 
> >  FAQ
> > 
> >  =
> >  How can I help test this release?
> >  =
> > 
> >  If you are a Spark user, you can help us test this release by taking
> >  an existing Spark workload and running on this release candidate,
> then
> >  reporting any regressions.
> > 
> >  If you're working in PySpark you can set up a virtual env and
> install
> >  the current RC and see if anything important breaks, in the
> Java/Scala
> >  you can add the staging repository to your projects resolvers and
> test
> >  with the RC (make sure to clean up the artifact cache before/after
> so
> >  you don't end up building with a out of date RC going forward).
> > 
> > >>>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Spark 4.0.0 (RC7)

2025-05-19 Thread Szehon Ho
+1 (non-binding)

Checked signature, checksum, ran basic tests on spark-4.0.0-bin-hadoop3
Thanks
Szehon



On Mon, May 19, 2025 at 9:07 PM Denny Lee  wrote:

> +1 (non-binding)
>
> On Mon, May 19, 2025 at 9:02 PM Rozov, Vlad 
> wrote:
>
>> +1 (non-binding)
>>
>> Vlad
>>
>> On May 19, 2025, at 8:56 PM, Jules Damji  wrote:
>>
>> + 1 (non-binding)
>> —
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On May 19, 2025, at 5:26 PM, Gengliang Wang  wrote:
>>
>> 
>> +1
>>
>> On Mon, May 19, 2025 at 5:21 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Tue, May 20, 2025 at 8:47 AM Ruifeng Zheng 
>>> wrote:
>>>
 +1

 On Tue, May 20, 2025 at 7:04 AM Hyukjin Kwon 
 wrote:

> +1
>
> On Mon, 19 May 2025 at 21:27, Wenchen Fan  wrote:
>
>> Same as before, I'll start with my own +1.
>>
>> On Mon, May 19, 2025 at 8:25 PM Wenchen Fan 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 4.0.0.
>>>
>>> The vote is open until May 22 (PST) and passes if a majority +1 PMC
>>> votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 4.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>>
>>> The tag to be voted on is v4.0.0-rc7 (commit
>>> fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4)
>>> https://github.com/apache/spark/tree/v4.0.0-rc7
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc7-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1485/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc7-docs/
>>>
>>> The list of bug fixes going into 4.0.0 can be found at the following
>>> URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>>
>>> This release is using the release script of the tag v4.0.0-rc7.
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> you can add the staging repository to your projects resolvers and
>>> test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>
>>


Re: [VOTE] Release Spark 4.0.1 (RC1)

2025-09-04 Thread Szehon Ho
+1 (non binding)

Checked signature, checksum, and ran basic test on spark-4.0.1-bin-hadoop3.

Thanks Dongjoon
Szehon

On Tue, Sep 2, 2025 at 11:50 PM Jungtaek Lim 
wrote:

> +1 (non-binding)
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Wed, Sep 3, 2025 at 9:16 AM kazuyuki tanimura
>  wrote:
>
>>
>> +1 (non-binding)
>>
>> Thanks
>> Kazu
>>
>>
>> On Sep 2, 2025, at 2:17 PM, Holden Karau  wrote:
>>
>> +1
>>
>> On Tue, Sep 2, 2025 at 11:56 AM Rozov, Vlad 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> Thank you,
>>>
>>>
>>>
>>> Vlad
>>>
>>>
>>>
>>> *From: *Zhou Jiang 
>>> *Date: *Tuesday, September 2, 2025 at 10:10 AM
>>> *To: *Anish Shrigondekar 
>>> *Cc: *huaxin gao , Dongjoon Hyun <
>>> dongj...@apache.org>, "dev@spark.apache.org" 
>>> *Subject: *RE: [EXTERNAL] [VOTE] Release Spark 4.0.1 (RC1)
>>>
>>>
>>>
>>> +1 (non-binding)
>>>
>>>
>>>
>>> On Tue, Sep 2, 2025 at 10:07 AM Anish Shrigondekar
>>>  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Anish
>>>
>>>
>>>
>>> On Tue, Sep 2, 2025 at 8:42 AM huaxin gao 
>>> wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Tue, Sep 2, 2025 at 8:38 AM Dongjoon Hyun 
>>> wrote:
>>>
>>> +1
>>>
>>> Dongjoon
>>>
>>> On 2025/09/02 15:23:55 "L. C. Hsieh" wrote:
>>> > +1
>>> >
>>> > On Tue, Sep 2, 2025 at 6:08 AM Wenchen Fan 
>>> wrote:
>>> > >
>>> > > +1
>>> > >
>>> > > On Tue, Sep 2, 2025 at 1:48 PM  wrote:
>>> > >>
>>> > >> Please vote on releasing the following candidate as Apache Spark
>>> version 4.0.1.
>>> > >>
>>> > >> The vote is open until Fri, 05 Sep 2025 22:47:52 PDT and passes if
>>> a majority +1 PMC votes are cast, with
>>> > >> a minimum of 3 +1 votes.
>>> > >>
>>> > >> [ ] +1 Release this package as Apache Spark 4.0.1
>>> > >> [ ] -1 Do not release this package because ...
>>> > >>
>>> > >> To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>> > >>
>>> > >> The tag to be voted on is v4.0.1-rc1 (commit 29434ea766b):
>>> > >> https://github.com/apache/spark/tree/v4.0.1-rc1
>>> > >>
>>> > >> The release files, including signatures, digests, etc. can be found
>>> at:
>>> > >> https://dist.apache.org/repos/dist/dev/spark/v4.0.1-rc1-bin/
>>> > >>
>>> > >> Signatures used for Spark RCs can be found in this file:
>>> > >> https://downloads.apache.org/spark/KEYS
>>> > >>
>>> > >> The staging repository for this release can be found at:
>>> > >>
>>> https://repository.apache.org/content/repositories/orgapachespark-1501/
>>> > >>
>>> > >> The documentation corresponding to this release can be found at:
>>> > >> https://dist.apache.org/repos/dist/dev/spark/v4.0.1-rc1-docs/
>>> > >>
>>> > >> The list of bug fixes going into 4.0.1 can be found at the
>>> following URL:
>>> > >> https://issues.apache.org/jira/projects/SPARK/versions/12355941
>>> > >>
>>> > >> FAQ
>>> > >>
>>> > >> =
>>> > >> How can I help test this release?
>>> > >> =
>>> > >>
>>> > >> If you are a Spark user, you can help us test this release by taking
>>> > >> an existing Spark workload and running on this release candidate,
>>> then
>>> > >> reporting any regressions.
>>> > >>
>>> > >> If you're working in PySpark you can set up a virtual env and
>>> install
>>> > >> the current RC via "pip install
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.1-rc1-bin/pyspark-4.0.1.tar.gz
>>> "
>>> > >> and see if anything important breaks.
>>> > >> In the Java/Scala, you can add the staging repository to your
>>> project's resolvers and test
>>> > >> with the RC (make sure to clean up the artifact cache before/after
>>> so
>>> > >> you don't end up building with an out of date RC going forward).
>>> > >>
>>> > >>
>>> -
>>> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >>
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>>
>>> --
>>>
>>> *Zhou JIANG*
>>>
>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> 
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>>


Re: [VOTE] SPIP: Add llms.txt files to Spark Documentation

2025-09-18 Thread Szehon Ho
+1 (non-binding)

Thanks!
Szehon

On Thu, Sep 18, 2025 at 4:46 PM Jungtaek Lim 
wrote:

> (I missed to clarify, my +1 is non-binding, just to make easier to count)
>
> On Fri, Sep 19, 2025 at 8:44 AM Jungtaek Lim 
> wrote:
>
>> +1 this sounds promising given the trends of reliance in AI.
>>
>> On Wed, Sep 17, 2025 at 4:04 PM Kousuke Saruta 
>> wrote:
>>
>>> +1
>>>
>>> 2025年9月17日(水) 15:17 Dongjoon Hyun :
>>>
 +1

 Dongjoon

 On 2025/09/16 02:30:33 Jules Damji wrote:
 > + 1 (non-binding)
 > —
 > Sent from my iPhone
 > Pardon the dumb thumb typos :)
 >
 > > On Sep 15, 2025, at 3:26 PM, Allison Wang 
 wrote:
 > >
 > > 
 > > Hi all,
 > >
 > > I would like to start a vote on the SPIP: Add llms.txt files to
 Spark Documentation
 > >
 > > Discussion thread:
 https://lists.apache.org/thread/7rnhn9xfl4bgfg0p6mlwo55y5vmpb9f6
 > > SPIP:
 https://docs.google.com/document/d/1tRYdNTrIs8-JTgDthQ-7kcxEG7S91mNUVmUOfevW-cE/edit?tab=t.0#heading=h.wq8o4rl94dvr
 > > JIRA: https://issues.apache.org/jira/browse/SPARK-53528
 > >
 > > Please vote on the SPIP for the next 72 hours:
 > >
 > > [ ] +1: Accept the proposal as an official SPIP
 > > [ ] +0
 > > [ ] -1: I don’t think this is a good idea because …
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: [VOTE] Release Spark 4.1.0-preview2 (RC1)

2025-09-25 Thread Szehon Ho
+1 (non-binding).

Checked signature, checksum, basic sql query.

Thanks
Szehon

On Thu, Sep 25, 2025 at 8:31 AM Rozov, Vlad 
wrote:

> +1 (non-binding)
>
>
>
> Thank you,
>
>
>
> Vlad
>
>
>
> *From: *Peter Toth 
> *Date: *Thursday, September 25, 2025 at 7:14 AM
> *To: *"dev@spark.apache.org" 
> *Subject: *RE: [VOTE] Release Spark 4.1.0-preview2 (RC1)
>
>
>
> +1 (non-binding)
>
>
>
> On Thu, Sep 25, 2025 at 7:37 AM Yuming Wang  wrote:
>
> +1
>
>
>
> On Thu, Sep 25, 2025 at 10:34 AM Yang Jie  wrote:
>
> +1
>
> Thank you Hyukjin.
>
> On 2025/09/25 02:17:10 Jungtaek Lim wrote:
> > +1 (non-binding)
> >
> > Thanks Hyukjin!
> >
> > On Thu, Sep 25, 2025 at 3:36 AM John Zhuge  wrote:
> >
> > > +1 Thanks Hyukjin!
> > >
> > > On Wed, Sep 24, 2025 at 10:53 AM huaxin gao 
> > > wrote:
> > >
> > >> +1 Thanks Hyukjin for driving the release!
> > >>
> > >> On Wed, Sep 24, 2025 at 9:45 AM L. C. Hsieh  wrote:
> > >>
> > >>> +1
> > >>>
> > >>> Thanks Hyukjin.
> > >>>
> > >>> On Wed, Sep 24, 2025 at 9:18 AM Dongjoon Hyun 
> > >>> wrote:
> > >>> >
> > >>> > +1
> > >>> >
> > >>> > Thank you, Hyukjin.
> > >>> >
> > >>> > Dongjoon
> > >>> >
> > >>> > On 2025/09/24 12:48:48 Wenchen Fan wrote:
> > >>> > > +1
> > >>> > >
> > >>> > > On Wed, Sep 24, 2025 at 7:29 PM  wrote:
> > >>> > >
> > >>> > > > Please vote on releasing the following candidate as Apache
> Spark
> > >>> version
> > >>> > > > 4.1.0-preview2.
> > >>> > > >
> > >>> > > > The vote is open until Sat, 27 Sep 2025 05:26:22 PDT and
> passes if
> > >>> a
> > >>> > > > majority +1 PMC votes are cast, with
> > >>> > > > a minimum of 3 +1 votes.
> > >>> > > >
> > >>> > > > [ ] +1 Release this package as Apache Spark 4.1.0-preview2
> > >>> > > > [ ] -1 Do not release this package because ...
> > >>> > > >
> > >>> > > > To learn more about Apache Spark, please see
> > >>> https://spark.apache.org/
> > >>> > > >
> > >>> > > > The tag to be voted on is v4.1.0-preview2-rc1 (commit
> c5ff48cc2b2):
> > >>> > > > https://github.com/apache/spark/tree/v4.1.0-preview2-rc1
> > >>> > > >
> > >>> > > > The release files, including signatures, digests, etc. can be
> > >>> found at:
> > >>> > > >
> > >>>
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview2-rc1-bin/
> > >>> > > >
> > >>> > > > Signatures used for Spark RCs can be found in this file:
> > >>> > > > https://downloads.apache.org/spark/KEYS
> > >>> > > >
> > >>> > > > The staging repository for this release can be found at:
> > >>> > > >
> > >>>
> https://repository.apache.org/content/repositories/orgapachespark-1503/
> > >>> > > >
> > >>> > > > The documentation corresponding to this release can be found
> at:
> > >>> > > >
> > >>>
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview2-rc1-docs/
> > >>> > > >
> > >>> > > > The list of bug fixes going into 4.1.0-preview2 can be found
> at the
> > >>> > > > following URL:
> > >>> > > >
> https://issues.apache.org/jira/projects/SPARK/versions/12355581
> > >>> > > >
> > >>> > > > FAQ
> > >>> > > >
> > >>> > > > =
> > >>> > > > How can I help test this release?
> > >>> > > > =
> > >>> > > >
> > >>> > > > If you are a Spark user, you can help us test this release by
> > >>> taking
> > >>> > > > an existing Spark workload and running on this release
> candidate,
> > >>> then
> > >>> > > > reporting any regressions.
> > >>> > > >
> > >>> > > > If you're working in PySpark you can set up a virtual env and
> > >>> install
> > >>> > > > the current RC via "pip install
> > >>> > > >
> > >>>
> https://dist.apache.org/repos/dist/dev/spark/v4.1.0-preview2-rc1-bin/pyspark-4.1.0.dev2.tar.gz
> > >>> > > > "
> > >>> > > > and see if anything important breaks.
> > >>> > > > In the Java/Scala, you can add the staging repository to your
> > >>> project's
> > >>> > > > resolvers and test
> > >>> > > > with the RC (make sure to clean up the artifact cache
> before/after
> > >>> so
> > >>> > > > you don't end up building with an out of date RC going
> forward).
> > >>> > > >
> > >>> > > >
> > >>> -
> > >>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>> >
> -
> > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>> >
> > >>>
> > >>> -
> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>>
> > >>>
> > >
> > > --
> > > John Zhuge
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>