Re: Security model update

2025-04-07 Thread Nicholas Chammas
I’m not a Spark security expert, and adding some extra prose may indeed be helpful. But I will note that that person’s reply to the ASF Security Team’s initial comment smells like LLM output. Perhaps I am being unfair to them, but I have read reports

Re: [DISCUSS] SPIP: Declarative Pipelines

2025-04-06 Thread Nicholas Chammas
There are many projects in the Spark ecosystem — like Deequ and Great Expectations — that are focused on expressing and enforcing data quality checks. In the more complex cases, these checks do not fit the scope of the checks that a typical data source may support (i.e. PK, FK, CHECK), so these

Re: Revert of [SPARK-51229][BUILD][CONNECT] Fix dependency:analyze goal on connect common

2025-03-26 Thread Nicholas Chammas
On Thu, 27 Mar 2025 at 00:13, Rozov, Vlad wrote: > Every graduated from incubating Apache project has guards against what you > name “chaotic” and what other name breaking best development practices. Such > guards include JIRA, unit tests and PR review. Instead of reverting commit, I > would ex

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Nicholas Chammas
> On Mar 10, 2025, at 10:14 PM, Andrew Melo wrote: >> >> This config was released to "Apache" Spark 3.5.4, so this is NO LONGER just >> a problem with vendor distribution. The breakage will happen even if someone >> does not even know about Databricks Runtime at all and keeps using Apache >>

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-10 Thread Nicholas Chammas
I agree with Sean that this proposal does not seem to me as controversial as it has turned out so far. Jungtaek’s detailed breakdown on the other thread explains that this proposed change is mainly to benefit open source users o

Re: Docs look weird; can't build locally

2025-02-06 Thread Nicholas Chammas
2>. > On Feb 5, 2025, at 1:59 PM, Nicholas Chammas > wrote: > > I’m trying to build the docs locally to reproduce this (and also to make a > change to some Python API docs), but I’m getting consistent Java heap memory > errors when I get to the Python part of the build. >

Re: [DISCUSS] Spark - How to improve our release processes

2025-02-06 Thread Nicholas Chammas
bricks' Photon and Velox lib does- > just directly within Java and not using C++, Ahead-of-Time Class Loading & > Linking <https://openjdk.org/jeps/483> - for faster startup times, Value > Objects <https://openjdk.org/jeps/8277163>, FFM > <http

Re: [DISCUSS] Spark - How to improve our release processes

2025-02-04 Thread Nicholas Chammas
> >> Thanks, >> Nimrod >> >> >> On Mon, May 13, 2024 at 2:31 PM Wenchen Fan > <mailto:cloud0...@gmail.com>> wrote: >>> Hi Nicholas, >>> >>> Thanks for your help! I'm definitely interested in participating in this >

Re: A documentation change is a user-facing change

2025-01-16 Thread Nicholas Chammas
since the template was short to be concise, it could be interpreted > in more ways than we thought. > > Dongjoon. > > On Thu, Jan 16, 2025 at 11:16 AM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: >> I didn’t write the pull request template and I a

Re: A documentation change is a user-facing change

2025-01-16 Thread Nicholas Chammas
line link addition) > [SPARK-48426][DOCS][FOLLOWUP] Add `Operators` page to `sql-ref.md` > > Given your definition, even a word typo fix inside an HTML page becomes a > user-facing change. > Did I understand your definition correctly? Or, is it something else? > > Dong

A documentation change is a user-facing change

2025-01-16 Thread Nicholas Chammas
This is not a big deal at all, but I figure it’s worth bringing up briefly because the pull request template does emphasize : > ### Does this PR introduce _any_ user-facing change

Re: Dev list policy on posting genAI hallucinations

2024-10-09 Thread Nicholas Chammas
Thanks to the committers and PMC members who chimed in on this thread. > On Oct 10, 2024, at 3:27 AM, Jungtaek Lim > wrote: > > I'd ask people to quote the part you got from GPT and explicitly call out the > part as "GPT-generated" so that people would consider that there is > hallucination.

Dev list policy on posting genAI hallucinations

2024-10-09 Thread Nicholas Chammas
<https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun > <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Wed, 9 Oct 2024 at 16:43, Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: >> Mich, >> >> Can you please share

Re: [Question] Why driver doesn't shutdown executors gracefully on k8s?

2024-10-09 Thread Nicholas Chammas
Mich, Can you please share with the list where exactly you are citing these configs from? As far as I can tell, these two configs don’t exist and have never existed in the Spark codebase: spark.executor.decommission.enabled=true spark.executor.decommission.gracefulShutdown=true Where exactly

Re: [DISCUSS] Deprecating SparkR

2024-08-12 Thread Nicholas Chammas
that impact those numbers. So it’s by no means a solid, objective measure. But I thought it was an interesting signal nonetheless. > On Aug 12, 2024, at 5:50 PM, Nicholas Chammas > wrote: > > Not an R user myself, but +1. > > I first wondered about the future

Re: [DISCUSS] Deprecating SparkR

2024-08-12 Thread Nicholas Chammas
Not an R user myself, but +1. I first wondered about the future of SparkR after noticing how low the visit stats were for the R API docs as compared to Python and Scala. (I can’t seem to find those visit stats

Re: [DISCUSS] Communicating over Slack instead of e-mails

2024-08-12 Thread Nicholas Chammas
If you’re proposing that Slack replace this dev list, then chat is not an appropriate substitute for emails. Conversations fly by on a conveyor belt, context is easily fractured across multiple threads and short messages, information is poorly indexed by search engines, and the lower bar of entr

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Nicholas Chammas
How big of a change would it be to have the repo only contain the Markdown source and not the rendered HTML (which should perhaps be moved to an object store)? > On Aug 8, 2024, at 8:06 AM, Kent Yao wrote: > > Hi dev, > > The current size of the spark-website repository is approximately 16GB

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-10 Thread Nicholas Chammas
I will let Neil and Matt clarify the details because I believe they understand the overall picture better. However, I would like to emphasize something that motivated this effort and which may be getting lost in the concerns about versioned vs. versionless docs. The main problem is that some of

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Nicholas Chammas
[dev list to bcc] This is a question for the user list or for Stack Overflow . The dev list is for discussions related to the development of Spark itself. Nick > On May 21, 2024, at 6:58 AM, Pr

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Nicholas Chammas
Re: unification We also have a long-standing problem with how we manage Python dependencies, something I’ve tried (unsuccessfully ) to fix in the past. Consider, for example, how many separate places this numpy dependency is installed: 1. https://g

Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread Nicholas Chammas
This is a side issue, but I’d like to bring people’s attention to SPARK-28024. Cases 2, 3, and 4 described in that ticket are still problems today on master (I just rechecked) even with ANSI mode enabled. Well, maybe not problems, but I’m flagging this since Spark’s behavior differs in these c

Re: Generating config docs automatically

2024-02-22 Thread Nicholas Chammas
dea); but that’s just my > opinion. I'd be happy to help with reviews though. > > On Wed, Feb 21, 2024 at 6:37 AM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: >> I know config documentation is not the most exciting thing. If there is >> anything I

Re: Generating config docs automatically

2024-02-21 Thread Nicholas Chammas
I know config documentation is not the most exciting thing. If there is anything I can do to make this as easy as possible for a committer to shepherd, I’m all ears! > On Feb 14, 2024, at 8:53 PM, Nicholas Chammas > wrote: > > I’m interested in automating our config documentat

Generating config docs automatically

2024-02-14 Thread Nicholas Chammas
I’m interested in automating our config documentation and need input from a committer who is interested in shepherding this work. We have around 60 tables of configs across our documentation. Here’s a typical example.

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Nicholas Chammas
27;) on the DataFrame. That helped me to find similar issues in > most cases. > > HTH > > On Sun, Feb 11, 2024 at 11:26 PM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: >> Consider this example: >> >>> from pyspark.sql.functions import

How do you debug a code-generated aggregate?

2024-02-11 Thread Nicholas Chammas
Consider this example: >>> from pyspark.sql.functions import sum >>> spark.range(4).repartition(2).select(sum("id")).show() +---+ |sum(id)| +---+ | 6| +---+ I’m trying to understand how this works because I’m investigating a bug in this kind of aggregate. I see that doProduceWith

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
tens of thousands of views. > > I do feel like it's unmaintained, and do feel like it might be a stretch to > leave it lying around until Spark 5. > It's not exactly unused though. > > I would not object to removing it unless there is some voice of support here. > &

Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
From the dev thread: What else could be removed in Spark 4? > On Aug 17, 2023, at 1:44 AM, Yang Jie wrote: > > I would like to know how we should handle the two Kinesis-related modules in > Spark 4.0. They have a very low freque

Install Ruby 3 to build the docs

2024-01-10 Thread Nicholas Chammas
Just a quick heads up that, while Ruby 2.7 will continue to work, you should plan to install Ruby 3 in the near future in order to build the docs. (I recommend using rbenv to manage multiple Ruby versions.) Ruby 2 reached EOL in March 2023

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation to the user list and BCC-ing the dev list. Also, this statement > We are not validating against table or column existence. is not correct. When you call spark.sql(…), Spark will lookup the table references and fail

Guidance for filling out "Affects Version" on Jira

2023-12-17 Thread Nicholas Chammas
The Contributing guide only mentions what to fill in for “Affects Version” for bugs. How about for improvements? This question once caused some problems when I set “Affects Version” to the last released version, and that was interpreted as a request

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
> relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas <mailto:nicholas.cham...@gma

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh > wrote: > spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default optimizer, > or NONE to disable it completely. > Hmm, I’ve also never heard of this setting before and can’t seem to find it in the Spark docs or source code.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh > wrote: > > By default, the CBO is enabled in Spark. Note that this is not correct. AQE is enabled

Re: When and how does Spark use metastore statistics?

2023-12-10 Thread Nicholas Chammas
in(mode="cost")) what the cost-based optimizer does and how to enable it Would this be a welcome addition to the project’s documentation? I’m happy to work on this. > On Dec 5, 2023, at 12:12 PM, Nicholas Chammas > wrote: > > I’m interested in improving some of t

Re: Algolia search on website is broken

2023-12-10 Thread Nicholas Chammas
onsole. > On Dec 5, 2023, at 11:28 AM, Nicholas Chammas > wrote: > > Should I report this instead on Jira? Apologies if the dev list is not the > right place. > > Search on the website appears to be broken. For example, here is a search for > “analyze”: >  > >

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
This is not a question for the dev list. Moving dev to bcc. One thing I would try is to connect to this database using JDBC + SSH tunnel, but without Spark. That way you can focus on getting the JDBC connection to work without Spark complicating the picture for you. > On Dec 5, 2023, at 8:12 P

When and how does Spark use metastore statistics?

2023-12-05 Thread Nicholas Chammas
I’m interested in improving some of the documentation relating to the table and column statistics that get stored in the metastore, and how Spark uses them. But I’m not clear on a few things, so I’m writing to you with some questions. 1. The documentation for spark.sql.autoBroadcastJoinThreshold

Algolia search on website is broken

2023-12-05 Thread Nicholas Chammas
Should I report this instead on Jira? Apologies if the dev list is not the right place. Search on the website appears to be broken. For example, here is a search for “analyze”:  And here is the same search using DDG

Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Nicholas Chammas
I’ve always considered DataFrames to be logically equivalent to SQL tables or queries. In SQL, the result order of any query is implementation-dependent without an explicit ORDER BY clause. Technically, you could run `SELECT * FROM table;` 10 times in a row and get 10 different orderings. I th

Allowing all Reader or Writer settings to be provided as options

2022-08-09 Thread Nicholas Chammas
Hello people, I want to bring some attention to SPARK-39630 and ask if there are any design objections to the idea proposed there. The gist of the proposal is that there are some reader or writer directives that cannot be supplied as options,

Re: Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas
the > normal Github emails - that is if we turn them off do we have anything? > > On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: > I assume I’m not the only one getting these new emails from GitBox. Is there > a story behind

Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas
I assume I’m not the only one getting these new emails from GitBox. Is there a story behind that that I missed? I’d rather not get these emails on the dev list. I assume most of the list would agree with me. GitHub has a good set of options for following activity on the repo. People who want t

Re: [DISCUSS] Rename 'SQL' to 'SQL / DataFrame', and 'Query' to 'Execution' in SQL UI page

2022-03-28 Thread Nicholas Chammas
+1 Understanding the close relationship between SQL and DataFrames in Spark was a key learning moment for me, but I agree that using the terms interchangeably can be confusing. > On Mar 27, 2022, at 9:27 PM, Hyukjin Kwon wrote: > > *for some reason, the image looks broken (to me). I am attac

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas
which are another way of computing aggregations through > composition of other Expressions. > > Simeon > > > > > > On Thu, Dec 9, 2021 at 9:26 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I'm trying to create a new aggregate function.

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
;s relatively cheap. > > > > On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> No takers here? :) >> >> I can see now why a median function is not available in most data >> processing systems.

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
No takers here? :) I can see now why a median function is not available in most data processing systems. It's pretty annoying to implement! On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas wrote: > I'm trying to create a new aggregate function. It's my first time working &

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas
I'm trying to create a new aggregate function. It's my first time working with Catalyst, so it's exciting---but I'm also in a bit over my head. My goal is to create a function to calculate the median . As a very simple solution, I could just defi

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nicholas Chammas
Farewell to Jenkins and its classic weather forecast build status icons: [image: health-80plus.png][image: health-60to79.png][image: health-40to59.png][image: health-20to39.png][image: health-00to19.png] And thank you Shane for all the help over these years. Will you be nuking all the Jenkins-re

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Nicholas Chammas
Side note about time travel: There is a PR to add VERSION/TIMESTAMP AS OF syntax to Spark SQL. On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue wrote: > I want to note that I wouldn't recommend time traveling this way by using > the hint for `snapshot-id`. I

Jira components cleanup

2021-11-15 Thread Nicholas Chammas
https://issues.apache.org/jira/projects/SPARK?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page I think the "docs" component should be merged into "Documentation". Likewise, the "k8" component should be merged into "Kubernetes". I think anyone can technically update tags, but

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Nicholas Chammas
On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon wrote: > I am currently thinking we will have to convert the Koalas tests to use > unittests to match with PySpark for now. > Keep in mind that pytest supports unittest-based tests out of the box , so

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Nicholas Chammas
On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin wrote: > I don't think we should deprecate existing APIs. > +1 I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way. For the large

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Nicholas Chammas
create a > reference within a scope which is closed. For example within the body of a > function (without return value) and store it only in a local > variable. After the scope is closed in case of our function when the caller > gets the control back you have chance to see t

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
n > unexpected error (in this case you should keep the checkpoint data). > > This way even after an unexpected exit the next run of the same app should > be able to pick up the checkpointed data. > > Best Regards, > Attila > > > > > On Wed, Mar 10, 2021 at 8:

Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
Hello people, I'm working on a fix for SPARK-33000 . Spark does not cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate configs are set. In the course of developing a fix, another contributor pointed out

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas
On Thu, Feb 18, 2021 at 10:34 AM Sean Owen wrote: > There is no way to force people to review or commit something of course. > And keep in mind we get a lot of, shall we say, unuseful pull requests. > There is occasionally some blowback to closing someone's PR, so the path of > least resistance i

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas
On Thu, Feb 18, 2021 at 9:58 AM Enrico Minack wrote: > *What is the approved way to ...* > > *... prevent it from being auto-closed?* Committing and commenting to the > PR does not prevent it from being closed the next day. > Committing and commenting should prevent the PR from being closed. It m

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Nicholas Chammas
On Thu, Jan 28, 2021 at 3:40 PM Sean Owen wrote: > It isn't that regexp_extract_all (for example) is useless outside SQL, > just, where do you draw the line? Supporting 10s of random SQL functions > across 3 other languages has a cost, which has to be weighed against > benefit, which we can never

Re: [DISCUSS][SPIP] Standardize Spark Exception Messages

2020-10-25 Thread Nicholas Chammas
Just want to call out that this SPIP should probably account somehow for PySpark and the work being done in SPARK-32082 to improve PySpark exceptions. On Sun, Oct 25, 2020 at 8:05 PM Xinyi Yu wrote: > Hi all, > > We like to post a SPIP of Stand

Re: get method guid prefix for file parts for write

2020-09-25 Thread Nicholas Chammas
I think what George is looking for is a way to determine ahead of time the partition IDs that Spark will use when writing output. George, I believe this is an example of what you're looking for: https://github.com/databricks/spark-redshift/blob/184b4428c1505dff7b4365963dc344197a92baa9/src/main/sc

PySpark: Un-deprecating inferring DataFrame schema from list of dictionaries

2020-08-24 Thread Nicholas Chammas
https://github.com/apache/spark/pull/29510 I don't think this is a big deal, but since we're removing a deprecation that has been around for ~6 years, I figured it would be good to bring everyone's attention to this change. Hopefully, we are not breaking any hidden assumptions about the direction

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
To rephrase my earlier email, PyPI users would care about the bundled Hadoop version if they have a workflow that, in effect, looks something like this: ``` pip install pyspark pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 spark.read.parquet('s3a://...') ``` I agree that Hadoop 3 would be

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
The team I'm on currently uses pip-installed PySpark for local development, and we regularly access S3 directly from our laptops/workstations. One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is being able to use a recent version of hadoop-aws that has mature support for s3a. W

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-03 Thread Nicholas Chammas
I believe that was fixed in 3.0 and there was a decision not to backport the fix: SPARK-31170 On Wed, Jun 3, 2020 at 1:04 PM Xiao Li wrote: > Just downloaded it in my local macbook. Trying to create a table using the > pre-built PySpark. It sou

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Nicholas Chammas
ukjin Kwon wrote: > Maybe it's time to switch. Do you know if we can still link the JIRA > against Github? > The script used to change the status of JIRA too but it stopped working > for a long time - I suspect this isn't a big deal. > > 2020년 4월 25일 (토) 오전 10:31, Nichol

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-24 Thread Nicholas Chammas
Have we asked Infra recently about enabling the native Jira-GitHub integration ? Maybe we can deprecate the part of this script that updates Jira tickets with links to the PR and rely on the native integrat

Beginner PR against the Catalog API

2020-04-02 Thread Nicholas Chammas
I recently submitted my first Scala PR. It's very simple, though I don't know if I've done things correctly since I'm not a regular Scala user. SPARK-31000 : Add ability to set table description in the catalog https://github.com/apache/spark/pull

Re: Automatic PR labeling

2020-04-02 Thread Nicholas Chammas
SPARK-31330 <https://issues.apache.org/jira/browse/SPARK-31330>: Automatically label PRs based on the paths they touch On Wed, Apr 1, 2020 at 11:34 PM Hyukjin Kwon wrote: > @Nicholas Chammas Would you be interested in > tacking a look? I would love this to be done. > > 2020

Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Nicholas Chammas
Probably the discussion here about Improvement Jira tickets and the "Affects Version" field: https://github.com/apache/spark/pull/27534#issuecomment-588416416 On Wed, Apr 1, 2020 at 9:59 PM Hyukjin Kwon wrote: > > 2) check with older versions to fill up affects version for bug > I don't agree wi

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Nicholas Chammas
I don't have a dog in this race, but: Would it be OK to ship 3.0 with some release notes and/or prominent documentation calling out this issue, and then fixing it in 3.0.1? On Sat, Mar 28, 2020 at 8:45 PM Jungtaek Lim wrote: > I'd say SPARK-31257 as open blocker, because the change in upcoming S

Automatic PR labeling

2020-03-24 Thread Nicholas Chammas
Public Service Announcement: There is a GitHub action that lets you automatically label PRs based on what paths they modify. https://github.com/actions/labeler If we set this up, perhaps down the line we can update the PR dashboard and PR merge script to use the tags. cc @Dongjoon Hyun , who may

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-20 Thread Nicholas Chammas
On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan wrote: > 2. PARTITIONED BY colTypeList: I think we can support it in the unified > syntax. Just make sure it doesn't appear together with PARTITIONED BY > transformList. > Another side note: Perhaps as part of (or after) unifying the CREATE TABLE synta

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Nicholas Chammas
Side comment: The current docs for CREATE TABLE add to the confusion by describing the Hive-compatible command as "CREATE TABLE USING HIVE FORMAT", but neither "USING"

Re-triggering failed GitHub workflows

2020-03-16 Thread Nicholas Chammas
Is there any way contributors can retrigger a failed GitHub workflow, like we do with Jenkins? There's supposed to be a "Re-run all checks" button, but I don't see it. Do we need INFRA to grant permissions for that, perhaps? Right now I'm doing it by adding empty commits: ``` git commit --allow-

Re: Running Spark through a debugger

2020-03-12 Thread Nicholas Chammas
x27; > in some cases. What are you having trouble with, does it build? > > On Mon, Dec 16, 2019 at 11:27 PM Nicholas Chammas > wrote: > > > > I normally stick to the Python parts of Spark, but I am interested in > walking through the DSv2 code and understanding how i

Re: Auto-linking from PRs to Jira tickets

2020-03-10 Thread Nicholas Chammas
Could you point us to the ticket? I'd like to follow along. On Tue, Mar 10, 2020 at 9:13 AM Alex Ott wrote: > For Zeppelin I've created recently the ASF INFRA Jira for that feature... > Although maybe it should be done for all projects. > > Nicholas Chammas at "Mon

Re: Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
t; > > On Mon, Mar 9, 2020 at 2:14 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> This is a feature of GitHub itself and would auto-link directly from the >> PR back to Jira. >> >> I haven't looked at the PR dashboard in a while, but

Re: Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
to do this with the same bot that runs the PR dashboard, > is it no longer working? > > On Mon, Mar 9, 2020 at 12:28 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> https://github.blog/2019-10-14-introducing-autolink-references/ >> >> Gi

Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
https://github.blog/2019-10-14-introducing-autolink-references/ GitHub has a feature for auto-linking from PRs to external tickets. It's only available for their paid plans, but perhaps Apache has some arrangement with them where we can get that feature. Since we include Jira ticket numbers in ev

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Nicholas Chammas
+1 to what Sean said. On Fri, Feb 21, 2020 at 10:14 AM Sean Owen wrote: > We've avoided using Assignee because it implies that someone 'owns' > resolving the issue, when we want to keep it collaborative, and many > times in the past someone would ask to be assigned and then didn't > follow throu

Re: More publicly documenting the options under spark.sql.*

2020-01-27 Thread Nicholas Chammas
gt;>> experimental option that may change, or legacy, or safety valve flag. >>>>> Certainly anything that's marked an internal conf. (That does raise >>>>> the question of who it's for, if you have to read source to find it.) >>>>> >>

Re: Closing stale PRs with a GitHub Action

2020-01-27 Thread Nicholas Chammas
=is%3Apr+label%3AStale+is%3Aclosed> is how many PRs are active with relatively recent activity. It's a testament to how active this project is. On Sun, Dec 15, 2019 at 11:16 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Just an FYI to everyone, we’ve merged in an

Re: Block a user from spark-website who repeatedly open the invalid same PR

2020-01-26 Thread Nicholas Chammas
+1 I think y'all have shown this person more patience than is merited by their behavior. On Sun, Jan 26, 2020 at 5:16 AM Takeshi Yamamuro wrote: > +1 > > Bests, > Takeshi > > On Sun, Jan 26, 2020 at 3:05 PM Hyukjin Kwon wrote: > >> Hi all, >> >> I am thinking about opening an infra ticket to b

Re: More publicly documenting the options under spark.sql.*

2020-01-15 Thread Nicholas Chammas
nal conf. (That does raise >>> the question of who it's for, if you have to read source to find it.) >>> >>> I don't know if we need to overhaul the conf system, but there may >>> indeed be some confs that could legitimately be documented. I don't &

More publicly documenting the options under spark.sql.*

2020-01-14 Thread Nicholas Chammas
I filed SPARK-30510 thinking that we had forgotten to document an option, but it turns out that there's a whole bunch of stuff under SQLConf.scala

Running Spark through a debugger

2019-12-16 Thread Nicholas Chammas
I normally stick to the Python parts of Spark, but I am interested in walking through the DSv2 code and understanding how it works. I tried following the "IDE Setup" section of the developer tools page, but quickly hit several problems loading the pro

Re: Closing stale PRs with a GitHub Action

2019-12-15 Thread Nicholas Chammas
n't mind it being automated if the idle time is long and it posts >>> some friendly message about reopening if there is a material change in the >>> proposed PR, the problem, or interest in merging it. >>> >>> On Fri, Dec 6, 2019 at 11:20 AM Nicholas Chammas

R linter is broken

2019-12-13 Thread Nicholas Chammas
The R linter GitHub action seems to be busted . Looks like we need to update some repository references

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Nicholas Chammas
Is this something that would be exposed/relevant to the Python API? Or is this just for people implementing their own Spark data source? On Wed, Dec 11, 2019 at 12:35 AM Jungtaek Lim wrote: > Hi devs, > > I'd like to propose to add close() on DataWriter explicitly, which is the > place for resou

Re: Closing stale PRs with a GitHub Action

2019-12-06 Thread Nicholas Chammas
be 6-12 > months or something. It's standard practice and doesn't mean it can't be > reopened. > Often the related JIRA should be closed as well but we have done that > separately with bulk-close in the past. > > On Thu, Dec 5, 2019 at 3:24 PM Nicholas Chammas < &

Closing stale PRs with a GitHub Action

2019-12-05 Thread Nicholas Chammas
It’s that topic again. 😄 We have almost 500 open PRs. A good chunk of them are more than a year old. The oldest open PR dates to summer 2015. https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc GitHub has an Action for closing stale PRs. https://github.com/marketplace/a

Re: Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas
Hyukjin Kwon wrote: > I think it's broken .. cc Josh Rosen > > 2019년 12월 4일 (수) 오전 10:25, Nicholas Chammas 님이 > 작성: > >> We used to have a bot or something that automatically linked Jira tickets >> to PRs that mentioned them in their title. I don't see that happ

Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas
We used to have a bot or something that automatically linked Jira tickets to PRs that mentioned them in their title. I don't see that happening anymore. Did we intentionally remove this functionality, or is it temporarily broken for some reason?

Re: Can't build unidoc

2019-11-29 Thread Nicholas Chammas
2019 at 11:48 AM Nicholas Chammas > wrote: > > > > Howdy folks. Running `./build/sbt unidoc` on the latest master is giving > me this trace: > > > > ``` > > [warn] :: > > [warn

Can't build unidoc

2019-11-29 Thread Nicholas Chammas
Howdy folks. Running `./build/sbt unidoc` on the latest master is giving me this trace: ``` [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: commons-collections#commons-co

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Nicholas Chammas
> I don't think the default Hadoop version matters except for the spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2 profile. What do you mean by "only meaningful under the hadoop-3.2 profile"? On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian wrote: > Hey Steve, > > In terms of

Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview

2019-11-16 Thread Nicholas Chammas
> Data Source API with Catalog Supports Where can we read more about this? The linked Nabble thread doesn't mention the word "Catalog". On Thu, Nov 7, 2019 at 5:53 PM Xingbo Jiang wrote: > Hi all, > > To enable wide-scale community testing of the upcoming Spark 3.0 release, > the Apache Spark c

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-03 Thread Nicholas Chammas
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran wrote: > It would be really good if the spark distributions shipped with later > versions of the hadoop artifacts. > I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer? Koert Kuipers wrote:

  1   2   3   4   5   6   >