Re: Security model update

2025-04-10 Thread Sean Owen
Sure, how about here though? https://github.com/apache/spark-website/pull/602 On Mon, Apr 7, 2025 at 9:30 AM Arnout Engelen wrote: > On Mon, Apr 7, 2025 at 4:16 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> But I will note that that person’s reply to the ASF Security Team’s >>

Re: setuptools 78.0.0 does not work with pyspark 3.x releases

2025-04-05 Thread Sean Owen
;https://github.com/apache/spark/pull/50369> > > man. 24. mars 2025 kl. 17:32 skrev Sean Owen : > >> I think we're about to hear about this: >> >> setuptools 78.0.0, released yesterday, no longer allows dashes in keys in >> setup.cfg: >> https://setuptools

Re: [DISCUSS] SPARK-51318: Remove `jar` files from Apache Spark repository and disable affected tests

2025-03-25 Thread Sean Owen
>> ASF may request to pull out release that does not meet ASF policy >>>>>>>>>>>> (and >>>>>>>>>>>> having tests is not ASF policy). IMO, SPARK-51318 should be a >>>>>>>>>>>> blocker

setuptools 78.0.0 does not work with pyspark 3.x releases

2025-03-24 Thread Sean Owen
I think we're about to hear about this: setuptools 78.0.0, released yesterday, no longer allows dashes in keys in setup.cfg: https://setuptools.pypa.io/en/stable/history.html#v78-0-0 The pyspark packaging has 'description-file' instead of 'description_file' in its setup.cfg, and so will not insta

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-15 Thread Sean Owen
Mark et al - this thread has gone on way too long. Everyone has expressed their opinion. The result stands. Anyone who is really upset about it, please escalate to the board or something, but, this thread and decision point has now concluded. On Sat, Mar 15, 2025 at 1:16 PM Mark Hamstra wrote:

Re: [VOTE][RESULT] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Sean Owen
This has been ongoing for a week, the vote has been open for 3 days, Dongjoon has replied today (not sure if you saw it), and I think this is all around in circles; I don't see any basis for waiting 24 hours (? where is this from?) I don't know if this is a code change vote - there is no code chang

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-13 Thread Sean Owen
>>>>> > it a big deal this time when vendor names are already present >>> > >>>>> elsewhere? If >>> > >>>>> > we’ve failed to follow a policy, let’s correct it, but can >>> someone >>> > >>>

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-11 Thread Sean Owen
+1 to retain, to avoid problems for users at ~0 cost. On Mon, Mar 10, 2025 at 7:45 AM Jungtaek Lim wrote: > Hi dev, > > Please vote to retain migration logic of incorrect `spark.databricks.*` > configuration in Spark 4.0.x. > > - DISCUSSION: > https://lists.apache.org/thread/xzk9729lsmo397crdtk1

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-10 Thread Sean Owen
Doesn't the migration code 'clear' the debt? The proposal is not to continue to support the config. I feel like people are not quite understanding the change, and objecting to something that doesn't exist. It's a shame, as this seems like something not even worth discussing. I don't know why this t

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-07 Thread Sean Owen
What is the problem with the existence of the migration logic? I understand not keeping the misnamed config. But the migration logic does no harm other than taking up a couple lines in the code, no? Unless someone offers any reason this is an issue... what are we even talking about. Is the idea th

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-07 Thread Sean Owen
I don't understand the problem with keeping migration logic in for a long time, just in case. Who cares, it's some bit of check buried somewhere in the streaming code, much like deprecation warnings. There is not somehow an ASF policy compelling the removal of such logic; you are not _required_ to

Re: [DISCUSS] SPARK-51318: Remove `jar` files from Apache Spark repository and disable affected tests

2025-02-26 Thread Sean Owen
The gist of the initial 2018 thread was: These are not source .jar files that users use, but .jar files used to test loading of from .jar files. These are test resources only. I don't think this is what the spirit of the rule is speaking to, that the end-user code should always have source code, wh

Re: [DISCUSS] SPARK-51318: Remove `jar` files from Apache Spark repository and disable affected tests

2025-02-26 Thread Sean Owen
the following link to provide the XZ Utils history > explicitly. > > > https://www.akamai.com/blog/security-research/critical-linux-backdoor-xz-utils-discovered-what-to-know > > Although I agree that those test coverages are important, I don't think > that's wort

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

2025-02-10 Thread Sean Owen
I don't think this makes sense, or lacks motivation. You want teams to convert pandas code to Pyspark syntax, only to run it on pandas? why? just run the pandas code in a larger job that also uses Spark if you like, or within UDFs. If you remove this assumption that people need to convert to Pyspa

Re: Spark 4.0 vulnerable with hive-metastore-2.3.x.jar versions

2025-01-28 Thread Sean Owen
If you use vulnerable code in your application, sure, you might be exposed to its vulnerability. That's a problem for the application rather than Spark. Here I am asking if you know of a reason this CVE affects Spark usage, because you're asking about mitigating it. I'm first establishing whether

Re: Spark 4.0 vulnerable with hive-metastore-2.3.x.jar versions

2025-01-27 Thread Sean Owen
> > > > Thanks, > > Balaji > > > > > *From:* Mich Talebzadeh > *Sent:* 27 January 2025 20:41 > *To:* Sean Owen > *Cc:* Balaji Sudharsanam V ; > dev@spark.apache.org > *Subject:* [EXTERNAL] Re: Spark 4.0 vulnerable with > hive-metastore-2.3.x.jar

Re: Spark 4.0 vulnerable with hive-metastore-2.3.x.jar versions

2025-01-27 Thread Sean Owen
Crime | Forensic Analysis | GDPR > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Mon, 27 Jan 2025 at 13:37, Sean Owen wrote: > >> It looks like that affects Hive, and not the metastore. I do not see tha

Re: Spark 4.0 vulnerable with hive-metastore-2.3.x.jar versions

2025-01-27 Thread Sean Owen
It looks like that affects Hive, and not the metastore. I do not see that it is relevant to Spark at first glance. On Mon, Jan 27, 2025 at 1:21 AM Balaji Sudharsanam V wrote: > Hi All, > > There is a vulnerability with ‘High’ severity found in the *Apache Spark > 3.x and 4.0.0 preview (2) relea

Re: [ACTION REQUIRED] Removal of v3 artifact actions on December 5th

2024-11-25 Thread Sean Owen
FWIW I do not see any use of v3 in Spark's main branch https://github.com/search?q=repo%3Aapache%2Fspark+upload-artifact&type=code On Mon, Nov 25, 2024 at 12:36 PM Jacob Wujciak wrote: > Hello Everyone! > > I am writing to inform you of the imminent removal of the v3 artifact > actions that was

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-18 Thread Sean Owen
This should be taken offline - this is cross-posting to thousands of people. On Mon, Nov 18, 2024 at 12:37 PM Russell Jurney wrote: > I think we need a unit test that shows inconsistent results with the old > code and not with the new. I have one working, if I can just get the tests > to run wit

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-13 Thread Sean Owen
I do; I do not work on Spark there. I do not see why it would affect support as the code is still part of Spark, but, it is off-topic for this list. On Wed, Nov 13, 2024 at 9:19 AM Ángel wrote: > Btw, Sean, you work for Databricks ... deprecating GraphX would mean ... > Databricks won't give sup

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-12 Thread Sean Owen
the only reasons I've read for deprecating GraphX were > about unfixed bugs and its lack of maintenance—and that's exactly what > we're aiming to address in this 100+ message discussion and through the > hackathon that Russell has organized. > > El mié, 13 nov 2024

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-12 Thread Sean Owen
I think people are still reading "deprecated" as "removed". It 100% does not mean that. Wouldn't it be more likely that 'old' things are deprecated than new? What is light about this 100+ message discussion? I myself did not see any strong arguments against deprecation, which seemed to amount to "m

Re: [Question] Why driver doesn't shutdown executors gracefully on k8s?

2024-10-09 Thread Sean Owen
Mich: you can set any key-value pair you want in Spark config. It doesn't mean it is a real flag that code reads. spark.conf.set("ham", "sandwich") print(spark.conf.get("ham")) prints "sandwich" forceKillTimeout is a real config: https://github.com/apache/spark/blob/fed9a8da3d4187794161e0be325aa

Re: Dev list policy on posting genAI hallucinations

2024-10-09 Thread Sean Owen
Agree, I can't really explain this post except as AI hallucination, because: - those configs don't exist and it's not a simple typo away from a real one - they are kind of like unrelated real Spark config names and the kind of thing it seems an AI would 'infer' - no claim it was a typo with plausi

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Sean Owen
Deprecation doesn't stop any of that though, if you want to encourage people to do something with GraphX. We can un-deprecate things. We don't have to remove deprecated things. But, why would we not encourage people to work on GraphFrames if interested in this domain? Nobody has been willing to c

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Sean Owen
e can > and should deprecate. > > On Fri, Oct 4, 2024 at 3:10 PM Sean Owen wrote: > > > > I could flip this argument around. More strongly, not being deprecated > means "won't be removed" and likewise implies support and development. I > don't think e

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-04 Thread Sean Owen
I could flip this argument around. More strongly, *not* being deprecated means "won't be removed" and likewise implies support and development. I don't think either of the latter have been true for years. What suggests this will change? A todo list is not going to do anything, IMHO. I'm also conce

Re: [VOTE] Officialy Deprecate GraphX in Spark 4

2024-09-30 Thread Sean Owen
For reasons in the previous thread, yes +1 to deprecation On Mon, Sep 30, 2024 at 1:02 PM Holden Karau wrote: > I think it has been de-facto deprecated, we haven’t updated it > meaningfully in several years. I think removing the API would be excessive > but deprecating it would give us the flexi

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-09-30 Thread Sean Owen
I support deprecating GraphX because: - GraphFrames supersedes it, really - No maintainers and no reason to believe there will be - we can take the last 5+ years as thorough evidence - Low (but not trivial) docs hits compared to other modules: https://analytics.apache.org/index.php

Re: [VOTE] Deprecate SparkR

2024-08-21 Thread Sean Owen
+1 On Wed, Aug 21, 2024, 11:40 AM Shivaram Venkataraman < shivaram.venkatara...@gmail.com> wrote: > Hi all > > Based on the previous discussion thread [1], I hereby call a vote to > deprecate the SparkR module in Apache Spark with the upcoming Spark 4 > release and remove it in the next major rel

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Sean Owen
He did already; see the preceding thread here on dev@. You can figure the size that moves out of the repo from the docs sizes: 9.9M ./0.6.0 10M ./0.6.1 10M ./0.6.2 15M ./0.7.0 16M ./0.7.2 16M ./0.7.3 20M ./0.8.0 20M ./0.8.1 38M ./0.9.0 38M ./0.9.1 38M ./0.9.2 36M ./1.0.0 38M ./1.0.1

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Sean Owen
+1 with the following clarifications, for my benefit: Once we upload to release, and it's copied by archive, we delete from release right? I know we are meant to keep the files in release minimal as they're mirrored to all ASF mirrors. But if we're uploading some batches and deleting them after, t

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
t, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 8 Aug 2024 at 22:02, Sean Owen

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
That seems a ltle bit too much to me. I could see people still on a recent version that just want to see docs or compare/contrast docs for changes. Removing the versions that seem to have ~0 traffic would remove, it seems, like 80% of the .html files (and replace them with a compressed archive

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
joon staff 103M Aug 8 10:22 3.5.1.tgz > > Specifically, shall we keep HTML files for only the latest version of live > releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1? > > In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the > current status. > > Dongjoon. &g

Re: [DISCUSS] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-08 Thread Sean Owen
the self-serve > instructions - https://infra.apache.org/request-bug-tracker.html > > Please keep the feedback coming. > > On Thu, Aug 8, 2024 at 2:43 PM Sean Owen wrote: > >> This is still part of the Apache Spark project, conceptually? >> IIRC Apache projects still need to

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
I agree with 'archiving', but what does that mean? delete from the repo and site? While I really doubt people are looking for docs for, say, 0.5.0, it'd be a big jump to totally remove it. What if we made a compressed tarball of old docs and put that in the repo, linked to it, and removed the docs

Re: [DISCUSS] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-08 Thread Sean Owen
This is still part of the Apache Spark project, conceptually? IIRC Apache projects still need to use JIRA, so we can't do this. On Thu, Aug 8, 2024 at 5:08 AM Mich Talebzadeh wrote: > Hi Martin, > > If I understood it correctly, your proposal suggests centralizing issue > tracking for the Spark

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-29 Thread Sean Owen
Also from ASF community perspective - I think all are agreed this was merged too fast. But, I'm missing where this is somehow due to the needs of a single vendor. Where is this related to file systems or keys? did I miss it from another discussion or PR, or is this actually about a different issue

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Sean Owen
related to the order in which Maven executes the test cases in > the `connect` module. > > > > I have submitted a backport PR > <https://github.com/apache/spark/pull/45141> to branch-3.5, and if > necessary, we can merge it to fix this test issue. > > > > Jie Yan

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Sean Owen
Is anyone seeing this Spark Connect test failure? then again, I have some weird issue with this env that always fails 1 or 2 tests that nobody else can replicate. - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Sean Owen
I'm not aware of much usage. but that doesn't mean a lot. FWIW, in the past month or so, the Kinesis docs page got about 700 views, compared to about 1400 for Kafka https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=range&date=

Re: Regression? - UIUtils::formatBatchTime - [SPARK-46611][CORE] Remove ThreadLocal by replace SimpleDateFormat with DateTimeFormatter

2024-01-08 Thread Sean Owen
Agreed, that looks wrong. From the code, it seems that "timezone" is only used for testing, though apparently no test caught this. I'll submit a PR to patch it in any event: https://github.com/apache/spark/pull/44619 On Mon, Jan 8, 2024 at 1:33 AM Janda Martin wrote: > I think that > [SPARK-466

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

2023-12-04 Thread Sean Owen
It already does. I think that's not the same idea? On Mon, Dec 4, 2023, 8:12 PM Almog Tavor wrote: > I think Spark should start shading it’s problematic deps similar to how > it’s done in Flink > > On Mon, 4 Dec 2023 at 2:57 Sean Owen wrote: > >> I am not sure we ca

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

2023-12-03 Thread Sean Owen
I am not sure we can control that - the Scala _x.y suffix has particular meaning in the Scala ecosystem for artifacts and thus the naming of .jar files. And we need to work with the Scala ecosystem. What can't handle these files, Spring Boot? does it somehow assume the .jar file name relates to Ja

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Sean Owen
I think it's the same, and always has been - yes you don't have a guaranteed ordering unless an operation produces a specific ordering. Could be the result of order by, yes; I believe you would be guaranteed that reading input files results in data in the order they appear in the file, etc. 1:1 ope

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Sean Owen
I think you're talking past Hyukjin here. I think the response is: none of that is managed by Pyspark now, and this proposal does not change that. Your current interpreter and environment is used to execute the stored procedure, which is just Python code. It's on you to bring an environment that r

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Sean Owen
to verify? > > > > Thanks, > > Jie Yang > > > > *发件人**: *Dipayan Dev > *日期**: *2023年8月30日 星期三 17:01 > *收件人**: *Sean Owen > *抄送**: *Yuanjian Li , Spark dev list < > dev@spark.apache.org> > *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC3) > >

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Sean Owen
It looks good except that I'm getting errors running the Spark Connect tests at the end (Java 17, Scala 2.13) It looks like I missed something necessary to build; is anyone getting this? [ERROR] [Error] /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apach

Re: [VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-19 Thread Sean Owen
+1 this looks better to me. Works with Scala 2.13 / Java 17 for me. On Sat, Aug 19, 2023 at 3:23 AM Yuanjian Li wrote: > Please vote on releasing the following candidate(RC2) as Apache Spark > version 3.5.0. > > The vote is open until 11:59pm Pacific time Aug 23th and passes if a > majority +1 P

Re: Question about ARRAY_INSERT between Spark and Databricks

2023-08-13 Thread Sean Owen
There shouldn't be any difference here. In fact, I get the results you list for 'spark' from Databricks. It's possible the difference is a bug fix along the way that is in the Spark version you are using locally but not in the DBR you are using. But, yeah seems to work as. you say. If you're askin

What else could be removed in Spark 4?

2023-08-07 Thread Sean Owen
While we're noodling on the topic, what else might be worth removing in Spark 4? For example, looks like we're finally hitting problems supporting Java 8 through 21 all at once, related to Scala 2.13.x updates. It would be reasonable to require Java 11, or even 17, as a baseline for the multi-year

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-06 Thread Sean Owen
Sat, Aug 5, 2023 at 5:42 PM Sean Owen wrote: > I'm still testing other combinations, but it looks like tests fail on Java > 17 after building with Java 8, which should be a normal supported > configuration. > This is described at https://github.com/apache/spark/pull/41943 and lo

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-05 Thread Sean Owen
I'm still testing other combinations, but it looks like tests fail on Java 17 after building with Java 8, which should be a normal supported configuration. This is described at https://github.com/apache/spark/pull/41943 and looks like it is resolved by moving back to Scala 2.13.8 for now. Unless I'

Re: [VOTE] SPIP: XML data source support

2023-07-28 Thread Sean Owen
+1 I think that porting the package 'as is' into Spark is probably worthwhile. That's relatively easy; the code is already pretty battle-tested and not that big and even originally came from Spark code, so is more or less similar already. One thing it never got was DSv2 support, which means XML re

Re: Spark 3.0.0 EOL

2023-07-26 Thread Sean Owen
There aren't "LTS" releases, though you might expect the last 3.x release will see maintenance releases longer. See end of https://spark.apache.org/versioning-policy.html On Wed, Jul 26, 2023 at 3:56 AM Manu Zhang wrote: > Will Apache Spark 3.5 be a LTS version? > > Thanks, > Manu > > On Mon, Ju

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen
On Fri, Jun 16, 2023 at 3:58 PM Dongjoon Hyun wrote: > I started the thread about already publicly visible version issues > according to the ASF PMC communication guideline. It's no confidential, > personal, or security-related stuff. Are you insisting this is confidential? > Discussion about a

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen
As we noted in the last thread, this discussion should have been on private@ to begin with, but, the ship has sailed. You are suggesting that non-PMC members vote on whether the PMC has to do something? No, that's not how anything works here. It's certainly the PMC that decides what to put in the

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen
What does a vote on dev@ mean? did you mean this for the PMC list? Dongjoon - this offers no rationale about "why". The more relevant thread begins here: https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb but it likewise never got to connecting a specific observation to policy. Could

Re: JDK version support policy?

2023-06-08 Thread Sean Owen
ava 11 should be dropped in Spark 4, just > thought I'd bring this issue to your attention. > > Best Regards, Martin > -- > *From:* Jungtaek Lim > *Sent:* Wednesday, June 7, 2023 23:19 > *To:* Sean Owen > *Cc:* Dongjoon Hyun ; Holden Karau &

Re: JDK version support policy?

2023-06-07 Thread Sean Owen
t; >> On 2023/06/07 02:42:19 yangjie01 wrote: >> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only >> support Java 17 and the upcoming Java 21. >> > >> > 发件人: Denny Lee >> > 日期: 2023年6月7日 星期三 07:10 >> > 收件人: Sean

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Sean Owen
Hi Dongjoon, I think this conversation is not advancing anymore. I personally consider the matter closed unless you can find other support or respond with more specifics. While this perhaps should be on private@, I think it's not wrong as an instructive discussion on dev@. I don't believe you've m

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Sean Owen
(With consent, shall we move this to the PMC list?) No, I don't think that's what this policy says. First, could you please be more specific here? why do you think a certain release is at odds with this? Because so far you've mentioned, I think, not taking a Scala maintenance release update. But

Re: JDK version support policy?

2023-06-06 Thread Sean Owen
I haven't followed this discussion closely, but I think we could/should drop Java 8 in Spark 4.0, which is up next after 3.5? On Tue, Jun 6, 2023 at 2:44 PM David Li wrote: > Hello Spark developers, > > I'm from the Apache Arrow project. We've discussed Java version support > [1], and crucially,

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen
I think the issue is whether a distribution of Spark is so materially different from OSS that it causes problems for the larger community of users. There's a legitimate question of whether such a thing can be called "Apache Spark + changes", as describing it that way becomes meaningfully inaccurate

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen
On Mon, Jun 5, 2023 at 12:01 PM Dongjoon Hyun wrote: > 1. For the naming, yes, but the company should use different version > numbers instead of the exact "3.4.0". As I shared the screenshot in my > previous email, the company exposes "Apache Spark 3.4.0" exactly because > they build their distri

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen
1/ Regarding naming - I believe releasing "Apache Foo X.Y + patches" is acceptable, if it is substantially Apache Foo X.Y. This is common practice for downstream vendors. It's fair nominative use. The principle here is consumer confusion. Is anyone substantially misled? Here I don't think so. I kno

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Sean Owen
It does seem risky; there are still likely libs out there that don't cross compile for 2.13. I would make it the default at 4.0, myself. On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon wrote: > While I support going forward with a higher version, actually using Scala > 2.13 by default is a big deal

Re: Spark 3.4.0 with Hadoop2.7 cannot be downloaded

2023-04-20 Thread Sean Owen
We just removed it now, yes. On Thu, Apr 20, 2023 at 9:08 AM Emil Ejbyfeldt wrote: > Hi, > > I think this is expected as it was dropped from the release process in > https://issues.apache.org/jira/browse/SPARK-40651 > > Also I don't see a Hadoop2.7 option when selecting Spark 3.4.0 on > https://

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Sean Owen
+1 from me On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun wrote: > I'll start with my +1. > > I verified the checksum, signatures of the artifacts, and documentations. > Also, ran the tests with YARN and K8s modules. > > Dongjoon. > > On 2023/04/09 23:46:10 Dongjoon Hyun wrote: > > Please vote on

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-08 Thread Sean Owen
+1 form me, same result as last time. On Fri, Apr 7, 2023 at 6:30 PM Xinrong Meng wrote: > Please vote on releasing the following candidate(RC7) as Apache Spark > version 3.4.0. > > The vote is open until 11:59pm Pacific time *April 12th* and passes if a > majority +1 PMC votes are cast, with a

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-03-30 Thread Sean Owen
+1 same result from me as last time. On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng wrote: > Please vote on releasing the following candidate(RC5) as Apache Spark > version 3.4.0. > > The vote is open until 11:59pm Pacific time *April 4th* and passes if a > majority +1 PMC votes are cast, with a m

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Sean Owen
We cannot in the AS-IS commit log status because it's screwed already > as Emil wrote. > Did you check the branch-3.2 commit log, Sean? > > Dongjoon. > > > On Thu, Mar 9, 2023 at 11:42 AM Sean Owen wrote: > >> We can just push the tags onto the branches as needed r

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Sean Owen
We can just push the tags onto the branches as needed right? No need to roll a new release On Thu, Mar 9, 2023, 1:36 PM Dongjoon Hyun wrote: > Yes, I also confirmed that the v3.4.0-rc3 tag is invalid. > > I guess we need RC4. > > Dongjoon. > > On Thu, Mar 9, 2023 at 7:13 AM Emil Ejbyfeldt > wro

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-03 Thread Sean Owen
path get set up differently when running via > SBT vs. Maven? > > On Thu, Mar 2, 2023 at 5:37 PM Sean Owen wrote: > >> Thanks, that's good to know. The workaround (deleting the thriftserver >> target dir) works for me. Who knows? >> >> But I&

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Sean Owen
//github.com/sbt/sbt/issues/6183>. > > One thing that I did find to help was to > delete sql/hive-thriftserver/target between building Spark and running the > tests. This helps in my builds where the issue only occurs during the > testing phase and not during the initial build

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Sean Owen
Has anyone seen this behavior -- I've never seen it before. The Hive thriftserver module for me just goes into an infinite loop when running tests: ... [INFO] done compiling [INFO] compiling 22 Scala sources and 24 Java sources to /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.

Re: [Question] LimitedInputStream license issue in Spark source.

2023-03-01 Thread Sean Owen
Right, it contains ALv2 licensed code attributed to two authors - some is from Guava, some is from Apache Spark contributors. I thought this is how we should handle this. It's not feasible to go line by line and say what came from where. On Wed, Mar 1, 2023 at 1:33 AM Dongjoon Hyun wrote: > May

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Sean Owen
FWIW I agree with this. On Wed, Feb 22, 2023 at 2:59 PM Allan Folting wrote: > Hi all, > > I would like to propose that we show Python code examples first in the > Spark documentation where we have multiple programming language examples. > An example is on the Quick Start page: > https://spark.a

Re: [DISCUSS] Make release cadence predictable

2023-02-15 Thread Sean Owen
can persuade > the incomplete features to wait for next releases more easily. > > In addition, I want to add the first RC1 date requirement because RC1 > always did a great job for us. > > I guess `branch-cut + 1M (no later than 1month)` could be the reasonable > deadline. >

Re: [DISCUSS] Make release cadence predictable

2023-02-14 Thread Sean Owen
I'm fine with shifting to a stricter cadence-based schedule. Sometimes, it'll mean some significant change misses a release rather than delays it. If people are OK with that discipline, sure. A hard 6-month cycle would mean the minor releases are more frequent and have less change in them. That's p

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Sean Owen
Agree, just, if it's such a tiny change, and it actually fixes the issue, maybe worth getting that into 3.3.x. I don't feel strongly. On Mon, Feb 13, 2023 at 11:19 AM L. C. Hsieh wrote: > If it is not supported in Spark 3.3.x, it looks like an improvement at > Spark 3.4. > For such cases we usua

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Sean Owen
use for testing? When I use the latest >>> Python 3.11, I can reproduce similar test failures (43 tests of sql module >>> fail), but when I use python 3.10, they will succeed >>> >>> >>> >>> YangJie >>> >>> >>> >>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-11 Thread Sean Owen
+1 The tests and all results were the same as ever for me (Java 11, Scala 2.13, Ubuntu 22.04) I also didn't see that issue ... maybe somehow locale related? which could still be a bug. On Sat, Feb 11, 2023 at 8:49 PM L. C. Hsieh wrote: > Thank you for testing it. > > I was going to run it again

Re: Building Spark to run PySpark Tests?

2023-01-19 Thread Sean Owen
0.17+0) > > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode) > > > > OS > > Ventura 13.1 (22C65) > > > Best, > > > Adam Chhina > > On Jan 18, 2023, at 6:50 PM, Sean Owen wrote: > > Release _branches_ are tested as commits arrive to th

Re: Can you create an apache jira account for me? Thanks very much!

2023-01-19 Thread Sean Owen
I can help offline. Send me your preferred JIRA user name. On Thu, Jan 19, 2023 at 7:12 AM Wei Yan wrote: > When I tried to sign up through this site: > https://issues.apache.org/jira/secure/Signup!default.jspa > I got an error message:"Sorry, you can't sign up to this Jira site at the > moment

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
onnect_to_java_server > self.socket.connect((self.java_address, self.java_port)) > ConnectionRefusedError: [Errno 61] Connection refused > > -- > Ran 7 tests in 12.950s > > FAILED (errors=7) > sys:1: ResourceWarning:

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
out -b spark-321 v3.2.1 > > with > git clone --branch branch-3.2 https://github.com/apache/spark.git > This will give you branch 3.2 as today, what I suppose you call upstream > > https://github.com/apache/spark/commits/branch-3.2 > and right now all tests in github action are p

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
Never seen those, but it's probably a difference in pandas, numpy versions. You can see the current CICD test results in GitHub Actions. But, you want to use release versions, not an RC. 3.2.1 is not the latest version, and it's possible the tests were actually failing in the RC. On Wed, Jan 18, 2

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-15 Thread Sean Owen
+1 from me, at least from my testing. Java 8 + Scala 2.12 and Java 8 + Scala 2.13 worked for me, and I didn't see a test hang. I am testing with Python 3.10 FWIW. On Tue, Nov 15, 2022 at 6:37 AM Yang,Jie(INF) wrote: > Hi, all > > > > I test v3.2.3 with following command: > > > > ``` > > dev/chan

Re: CVE-2022-42889

2022-10-27 Thread Sean Owen
n official statement about this from > Spark? > > We weren’t able to find references to 2022-42889 here: > https://spark.apache.org/security.html (likely because Spark determined > it is not affected?) > > > > *From:* Sean Owen > *Sent:* Thursday, October 27

Re: CVE-2022-42889

2022-10-27 Thread Sean Owen
Probably a few months between maintenance releases. It does not appear to affect Spark, however. On Thu, Oct 27, 2022 at 9:24 AM Pastrana, Rodrigo (RIS-BCT) wrote: > Hello, > > This issue (SPARK-40801) > which addresses > CVE-2022-42889 doesn’t

Re: Apache Spark 3.2.3 Release?

2022-10-18 Thread Sean Owen
OK by me, if someone is willing to drive it. On Tue, Oct 18, 2022 at 11:47 AM Chao Sun wrote: > Hi All, > > It's been more than 3 months since 3.2.2 (tagged at Jul 11) was > released There are now 66 patches accumulated in branch-3.2, including > 2 correctness issues. > > Is it a good time to st

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-17 Thread Sean Owen
+1 from me, same as last time On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang wrote: > Please vote on releasing the following candidate as Apache Spark version > 3.3.1. > > The vote is open until 11:59pm Pacific time October 21th and passes if a > majority +1 PMC votes are cast, with a minimum of

Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-11 Thread Sean Owen
l.jar >>> 4. org.apache.hive#hive-exec;2.3.7!hive-exec.jar >>> >>> I worked around it by adding them locally explicitly - we should >>> probably add them as test dependency ? >>> Not sure if this changed in this release though (I had cleaned my local

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Sean Owen
I'm OK with this. It simplifies maintenance a bit, and specifically may allow us to finally move off of the ancient version of Guava (?) On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun wrote: > Hi, All. > > I'm wondering if the following Apache Spark Hadoop2 Binary Distribution > is still used by

Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-09-28 Thread Sean Owen
+1 from me, same result as last RC. On Wed, Sep 28, 2022 at 12:21 AM Yuming Wang wrote: > Please vote on releasing the following candidate as Apache Spark version > 3.3.1. > > The vote is open until 11:59pm Pacific time October 3th and passes if a > majority +1 PMC votes are cast, with a minim

Re: Why are hash functions seeded with 42?

2022-09-26 Thread Sean Owen
hunch that perhaps it’s a > nod to Douglas Adams (author of The Hitchhiker’s Guide to the Galaxy). > > > https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910 > > On Sep 26, 2022, at 16:59, Sean Owen wrote: > >  > OK, it came

Why are hash functions seeded with 42?

2022-09-26 Thread Sean Owen
OK, it came to my attention today that hash functions in spark, like xxhash64, actually always seed with 42: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655 This is an issue if you want the hash of some value in Spar

  1   2   3   4   5   6   7   8   9   10   >