Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas
On Thu, Feb 18, 2021 at 9:58 AM Enrico Minack wrote: > *What is the approved way to ...* > > *... prevent it from being auto-closed?* Committing and commenting to the > PR does not prevent it from being closed the next day. > Committing and commenting should prevent the PR from being closed. It m

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Nicholas Chammas
On Thu, Feb 18, 2021 at 10:34 AM Sean Owen wrote: > There is no way to force people to review or commit something of course. > And keep in mind we get a lot of, shall we say, unuseful pull requests. > There is occasionally some blowback to closing someone's PR, so the path of > least resistance i

Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
Hello people, I'm working on a fix for SPARK-33000 . Spark does not cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate configs are set. In the course of developing a fix, another contributor pointed out

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
n > unexpected error (in this case you should keep the checkpoint data). > > This way even after an unexpected exit the next run of the same app should > be able to pick up the checkpointed data. > > Best Regards, > Attila > > > > > On Wed, Mar 10, 2021 at 8:

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Nicholas Chammas
create a > reference within a scope which is closed. For example within the body of a > function (without return value) and store it only in a local > variable. After the scope is closed in case of our function when the caller > gets the control back you have chance to see t

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Nicholas Chammas
On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin wrote: > I don't think we should deprecate existing APIs. > +1 I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way. For the large

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Nicholas Chammas
On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon wrote: > I am currently thinking we will have to convert the Koalas tests to use > unittests to match with PySpark for now. > Keep in mind that pytest supports unittest-based tests out of the box , so

Jira components cleanup

2021-11-15 Thread Nicholas Chammas
https://issues.apache.org/jira/projects/SPARK?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page I think the "docs" component should be merged into "Documentation". Likewise, the "k8" component should be merged into "Kubernetes". I think anyone can technically update tags, but

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Nicholas Chammas
Side note about time travel: There is a PR to add VERSION/TIMESTAMP AS OF syntax to Spark SQL. On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue wrote: > I want to note that I wouldn't recommend time traveling this way by using > the hint for `snapshot-id`. I

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nicholas Chammas
Farewell to Jenkins and its classic weather forecast build status icons: [image: health-80plus.png][image: health-60to79.png][image: health-40to59.png][image: health-20to39.png][image: health-00to19.png] And thank you Shane for all the help over these years. Will you be nuking all the Jenkins-re

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas
I'm trying to create a new aggregate function. It's my first time working with Catalyst, so it's exciting---but I'm also in a bit over my head. My goal is to create a function to calculate the median . As a very simple solution, I could just defi

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
No takers here? :) I can see now why a median function is not available in most data processing systems. It's pretty annoying to implement! On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas wrote: > I'm trying to create a new aggregate function. It's my first time working &

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
;s relatively cheap. > > > > On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> No takers here? :) >> >> I can see now why a median function is not available in most data >> processing systems.

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas
which are another way of computing aggregations through > composition of other Expressions. > > Simeon > > > > > > On Thu, Dec 9, 2021 at 9:26 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I'm trying to create a new aggregate function.

Re: [DISCUSS] Rename 'SQL' to 'SQL / DataFrame', and 'Query' to 'Execution' in SQL UI page

2022-03-28 Thread Nicholas Chammas
+1 Understanding the close relationship between SQL and DataFrames in Spark was a key learning moment for me, but I agree that using the terms interchangeably can be confusing. > On Mar 27, 2022, at 9:27 PM, Hyukjin Kwon wrote: > > *for some reason, the image looks broken (to me). I am attac

Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas
I assume I’m not the only one getting these new emails from GitBox. Is there a story behind that that I missed? I’d rather not get these emails on the dev list. I assume most of the list would agree with me. GitHub has a good set of options for following activity on the repo. People who want t

Re: Deluge of GitBox emails

2022-04-04 Thread Nicholas Chammas
the > normal Github emails - that is if we turn them off do we have anything? > > On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: > I assume I’m not the only one getting these new emails from GitBox. Is there > a story behind

Allowing all Reader or Writer settings to be provided as options

2022-08-09 Thread Nicholas Chammas
Hello people, I want to bring some attention to SPARK-39630 and ask if there are any design objections to the idea proposed there. The gist of the proposal is that there are some reader or writer directives that cannot be supplied as options,

Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Nicholas Chammas
I’ve always considered DataFrames to be logically equivalent to SQL tables or queries. In SQL, the result order of any query is implementation-dependent without an explicit ORDER BY clause. Technically, you could run `SELECT * FROM table;` 10 times in a row and get 10 different orderings. I th

Algolia search on website is broken

2023-12-05 Thread Nicholas Chammas
Should I report this instead on Jira? Apologies if the dev list is not the right place. Search on the website appears to be broken. For example, here is a search for “analyze”:  And here is the same search using DDG

When and how does Spark use metastore statistics?

2023-12-05 Thread Nicholas Chammas
I’m interested in improving some of the documentation relating to the table and column statistics that get stored in the metastore, and how Spark uses them. But I’m not clear on a few things, so I’m writing to you with some questions. 1. The documentation for spark.sql.autoBroadcastJoinThreshold

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
This is not a question for the dev list. Moving dev to bcc. One thing I would try is to connect to this database using JDBC + SSH tunnel, but without Spark. That way you can focus on getting the JDBC connection to work without Spark complicating the picture for you. > On Dec 5, 2023, at 8:12 P

Re: Algolia search on website is broken

2023-12-10 Thread Nicholas Chammas
onsole. > On Dec 5, 2023, at 11:28 AM, Nicholas Chammas > wrote: > > Should I report this instead on Jira? Apologies if the dev list is not the > right place. > > Search on the website appears to be broken. For example, here is a search for > “analyze”: >  > >

Re: When and how does Spark use metastore statistics?

2023-12-10 Thread Nicholas Chammas
in(mode="cost")) what the cost-based optimizer does and how to enable it Would this be a welcome addition to the project’s documentation? I’m happy to work on this. > On Dec 5, 2023, at 12:12 PM, Nicholas Chammas > wrote: > > I’m interested in improving some of t

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh > wrote: > > By default, the CBO is enabled in Spark. Note that this is not correct. AQE is enabled

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh > wrote: > spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default optimizer, > or NONE to disable it completely. > Hmm, I’ve also never heard of this setting before and can’t seem to find it in the Spark docs or source code.

Re: When and how does Spark use metastore statistics?

2023-12-11 Thread Nicholas Chammas
> relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas <mailto:nicholas.cham...@gma

Guidance for filling out "Affects Version" on Jira

2023-12-17 Thread Nicholas Chammas
The Contributing guide only mentions what to fill in for “Affects Version” for bugs. How about for improvements? This question once caused some problems when I set “Affects Version” to the last released version, and that was interpreted as a request

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation to the user list and BCC-ing the dev list. Also, this statement > We are not validating against table or column existence. is not correct. When you call spark.sql(…), Spark will lookup the table references and fail

Install Ruby 3 to build the docs

2024-01-10 Thread Nicholas Chammas
Just a quick heads up that, while Ruby 2.7 will continue to work, you should plan to install Ruby 3 in the near future in order to build the docs. (I recommend using rbenv to manage multiple Ruby versions.) Ruby 2 reached EOL in March 2023

Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
From the dev thread: What else could be removed in Spark 4? > On Aug 17, 2023, at 1:44 AM, Yang Jie wrote: > > I would like to know how we should handle the two Kinesis-related modules in > Spark 4.0. They have a very low freque

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
tens of thousands of views. > > I do feel like it's unmaintained, and do feel like it might be a stretch to > leave it lying around until Spark 5. > It's not exactly unused though. > > I would not object to removing it unless there is some voice of support here. > &

How do you debug a code-generated aggregate?

2024-02-11 Thread Nicholas Chammas
Consider this example: >>> from pyspark.sql.functions import sum >>> spark.range(4).repartition(2).select(sum("id")).show() +---+ |sum(id)| +---+ | 6| +---+ I’m trying to understand how this works because I’m investigating a bug in this kind of aggregate. I see that doProduceWith

Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Nicholas Chammas
27;) on the DataFrame. That helped me to find similar issues in > most cases. > > HTH > > On Sun, Feb 11, 2024 at 11:26 PM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: >> Consider this example: >> >>> from pyspark.sql.functions import

Generating config docs automatically

2024-02-14 Thread Nicholas Chammas
I’m interested in automating our config documentation and need input from a committer who is interested in shepherding this work. We have around 60 tables of configs across our documentation. Here’s a typical example.

Re: Generating config docs automatically

2024-02-21 Thread Nicholas Chammas
I know config documentation is not the most exciting thing. If there is anything I can do to make this as easy as possible for a committer to shepherd, I’m all ears! > On Feb 14, 2024, at 8:53 PM, Nicholas Chammas > wrote: > > I’m interested in automating our config documentat

Re: Generating config docs automatically

2024-02-22 Thread Nicholas Chammas
dea); but that’s just my > opinion. I'd be happy to help with reviews though. > > On Wed, Feb 21, 2024 at 6:37 AM Nicholas Chammas <mailto:nicholas.cham...@gmail.com>> wrote: >> I know config documentation is not the most exciting thing. If there is >> anything I

Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Nicholas Chammas
This kind of question is for the User list, or for something like Stack Overflow. It's not on topic here. The dev list (i.e. this list) is for discussions about the development of Spark itself. On Wed, May 15, 2019 at 1:50 PM Chetan Khatri wrote: > Any one help me, I am confused. :( > > On Wed,

Python API for mapGroupsWithState

2019-08-02 Thread Nicholas Chammas
Can someone succinctly describe the challenge in adding the `mapGroupsWithState()` API to PySpark? I was hoping for some suboptimal but nonetheless working solution to be available in Python, as there are with Python UDFs for example, but that doesn't seem to be case. The JIRA ticket for arbitrary

Re: Recognizing non-code contributions

2019-08-05 Thread Nicholas Chammas
On Mon, Aug 5, 2019 at 9:55 AM Sean Owen wrote: > On Mon, Aug 5, 2019 at 3:50 AM Myrle Krantz wrote: > > So... events coordinators? I'd still make them committers. I guess I'm > still struggling to understand what problem making people VIP's without > giving them committership is trying to sol

Providing a namespace for third-party configurations

2019-08-30 Thread Nicholas Chammas
I discovered today that EMR provides its own optimizations for Spark . Some of these optimizations are controlled by configuration settings with names like `spark.sql.dynamicPartitionPruning.enabled` or `spark.sql.optim

Re: DSv2 sync - 4 September 2019

2019-09-08 Thread Nicholas Chammas
A quick question about failure modes, as a casual observer of the DSv2 effort: I was considering filing a JIRA ticket about enhancing the DataFrameReader to include the failure *reason* in addition to the corrupt record when the mode is PERMISSIVE. So if you are loading a CSV, for example, and a v

Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Nicholas Chammas
nchen > > On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> A quick question about failure modes, as a casual observer of the DSv2 >> effort: >> >> I was considering filing a JIRA ticket about enhancing the >>

Spark 3.0 and S3A

2019-10-28 Thread Nicholas Chammas
Howdy folks, I have a question about what is happening with the 3.0 release in relation to Hadoop and hadoop-aws . Today, among other builds, we release a build of Spark built against Hadoop 2.7 and another one built

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-03 Thread Nicholas Chammas
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran wrote: > It would be really good if the spark distributions shipped with later > versions of the hadoop artifacts. > I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer? Koert Kuipers wrote:

Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview

2019-11-16 Thread Nicholas Chammas
> Data Source API with Catalog Supports Where can we read more about this? The linked Nabble thread doesn't mention the word "Catalog". On Thu, Nov 7, 2019 at 5:53 PM Xingbo Jiang wrote: > Hi all, > > To enable wide-scale community testing of the upcoming Spark 3.0 release, > the Apache Spark c

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Nicholas Chammas
> I don't think the default Hadoop version matters except for the spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2 profile. What do you mean by "only meaningful under the hadoop-3.2 profile"? On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian wrote: > Hey Steve, > > In terms of

Can't build unidoc

2019-11-29 Thread Nicholas Chammas
Howdy folks. Running `./build/sbt unidoc` on the latest master is giving me this trace: ``` [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: commons-collections#commons-co

Re: Can't build unidoc

2019-11-29 Thread Nicholas Chammas
2019 at 11:48 AM Nicholas Chammas > wrote: > > > > Howdy folks. Running `./build/sbt unidoc` on the latest master is giving > me this trace: > > > > ``` > > [warn] :: > > [warn

Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas
We used to have a bot or something that automatically linked Jira tickets to PRs that mentioned them in their title. I don't see that happening anymore. Did we intentionally remove this functionality, or is it temporarily broken for some reason?

Re: Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas
Hyukjin Kwon wrote: > I think it's broken .. cc Josh Rosen > > 2019년 12월 4일 (수) 오전 10:25, Nicholas Chammas 님이 > 작성: > >> We used to have a bot or something that automatically linked Jira tickets >> to PRs that mentioned them in their title. I don't see that happ

Closing stale PRs with a GitHub Action

2019-12-05 Thread Nicholas Chammas
It’s that topic again. 😄 We have almost 500 open PRs. A good chunk of them are more than a year old. The oldest open PR dates to summer 2015. https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Acreated-asc GitHub has an Action for closing stale PRs. https://github.com/marketplace/a

Re: Closing stale PRs with a GitHub Action

2019-12-06 Thread Nicholas Chammas
be 6-12 > months or something. It's standard practice and doesn't mean it can't be > reopened. > Often the related JIRA should be closed as well but we have done that > separately with bulk-close in the past. > > On Thu, Dec 5, 2019 at 3:24 PM Nicholas Chammas < &

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Nicholas Chammas
Is this something that would be exposed/relevant to the Python API? Or is this just for people implementing their own Spark data source? On Wed, Dec 11, 2019 at 12:35 AM Jungtaek Lim wrote: > Hi devs, > > I'd like to propose to add close() on DataWriter explicitly, which is the > place for resou

R linter is broken

2019-12-13 Thread Nicholas Chammas
The R linter GitHub action seems to be busted . Looks like we need to update some repository references

Re: Closing stale PRs with a GitHub Action

2019-12-15 Thread Nicholas Chammas
n't mind it being automated if the idle time is long and it posts >>> some friendly message about reopening if there is a material change in the >>> proposed PR, the problem, or interest in merging it. >>> >>> On Fri, Dec 6, 2019 at 11:20 AM Nicholas Chammas

Running Spark through a debugger

2019-12-16 Thread Nicholas Chammas
I normally stick to the Python parts of Spark, but I am interested in walking through the DSv2 code and understanding how it works. I tried following the "IDE Setup" section of the developer tools page, but quickly hit several problems loading the pro

More publicly documenting the options under spark.sql.*

2020-01-14 Thread Nicholas Chammas
I filed SPARK-30510 thinking that we had forgotten to document an option, but it turns out that there's a whole bunch of stuff under SQLConf.scala

Re: More publicly documenting the options under spark.sql.*

2020-01-15 Thread Nicholas Chammas
nal conf. (That does raise >>> the question of who it's for, if you have to read source to find it.) >>> >>> I don't know if we need to overhaul the conf system, but there may >>> indeed be some confs that could legitimately be documented. I don't &

Re: Block a user from spark-website who repeatedly open the invalid same PR

2020-01-26 Thread Nicholas Chammas
+1 I think y'all have shown this person more patience than is merited by their behavior. On Sun, Jan 26, 2020 at 5:16 AM Takeshi Yamamuro wrote: > +1 > > Bests, > Takeshi > > On Sun, Jan 26, 2020 at 3:05 PM Hyukjin Kwon wrote: > >> Hi all, >> >> I am thinking about opening an infra ticket to b

Re: Closing stale PRs with a GitHub Action

2020-01-27 Thread Nicholas Chammas
=is%3Apr+label%3AStale+is%3Aclosed> is how many PRs are active with relatively recent activity. It's a testament to how active this project is. On Sun, Dec 15, 2019 at 11:16 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Just an FYI to everyone, we’ve merged in an

Re: More publicly documenting the options under spark.sql.*

2020-01-27 Thread Nicholas Chammas
gt;>> experimental option that may change, or legacy, or safety valve flag. >>>>> Certainly anything that's marked an internal conf. (That does raise >>>>> the question of who it's for, if you have to read source to find it.) >>>>> >>

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Nicholas Chammas
+1 to what Sean said. On Fri, Feb 21, 2020 at 10:14 AM Sean Owen wrote: > We've avoided using Assignee because it implies that someone 'owns' > resolving the issue, when we want to keep it collaborative, and many > times in the past someone would ask to be assigned and then didn't > follow throu

Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
https://github.blog/2019-10-14-introducing-autolink-references/ GitHub has a feature for auto-linking from PRs to external tickets. It's only available for their paid plans, but perhaps Apache has some arrangement with them where we can get that feature. Since we include Jira ticket numbers in ev

Re: Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
to do this with the same bot that runs the PR dashboard, > is it no longer working? > > On Mon, Mar 9, 2020 at 12:28 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> https://github.blog/2019-10-14-introducing-autolink-references/ >> >> Gi

Re: Auto-linking from PRs to Jira tickets

2020-03-09 Thread Nicholas Chammas
t; > > On Mon, Mar 9, 2020 at 2:14 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> This is a feature of GitHub itself and would auto-link directly from the >> PR back to Jira. >> >> I haven't looked at the PR dashboard in a while, but

Re: Auto-linking from PRs to Jira tickets

2020-03-10 Thread Nicholas Chammas
Could you point us to the ticket? I'd like to follow along. On Tue, Mar 10, 2020 at 9:13 AM Alex Ott wrote: > For Zeppelin I've created recently the ASF INFRA Jira for that feature... > Although maybe it should be done for all projects. > > Nicholas Chammas at "Mon

Re: Running Spark through a debugger

2020-03-12 Thread Nicholas Chammas
x27; > in some cases. What are you having trouble with, does it build? > > On Mon, Dec 16, 2019 at 11:27 PM Nicholas Chammas > wrote: > > > > I normally stick to the Python parts of Spark, but I am interested in > walking through the DSv2 code and understanding how i

Re-triggering failed GitHub workflows

2020-03-16 Thread Nicholas Chammas
Is there any way contributors can retrigger a failed GitHub workflow, like we do with Jenkins? There's supposed to be a "Re-run all checks" button, but I don't see it. Do we need INFRA to grant permissions for that, perhaps? Right now I'm doing it by adding empty commits: ``` git commit --allow-

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Nicholas Chammas
Side comment: The current docs for CREATE TABLE add to the confusion by describing the Hive-compatible command as "CREATE TABLE USING HIVE FORMAT", but neither "USING"

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-20 Thread Nicholas Chammas
On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan wrote: > 2. PARTITIONED BY colTypeList: I think we can support it in the unified > syntax. Just make sure it doesn't appear together with PARTITIONED BY > transformList. > Another side note: Perhaps as part of (or after) unifying the CREATE TABLE synta

Automatic PR labeling

2020-03-24 Thread Nicholas Chammas
Public Service Announcement: There is a GitHub action that lets you automatically label PRs based on what paths they modify. https://github.com/actions/labeler If we set this up, perhaps down the line we can update the PR dashboard and PR merge script to use the tags. cc @Dongjoon Hyun , who may

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Nicholas Chammas
I don't have a dog in this race, but: Would it be OK to ship 3.0 with some release notes and/or prominent documentation calling out this issue, and then fixing it in 3.0.1? On Sat, Mar 28, 2020 at 8:45 PM Jungtaek Lim wrote: > I'd say SPARK-31257 as open blocker, because the change in upcoming S

Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Nicholas Chammas
Probably the discussion here about Improvement Jira tickets and the "Affects Version" field: https://github.com/apache/spark/pull/27534#issuecomment-588416416 On Wed, Apr 1, 2020 at 9:59 PM Hyukjin Kwon wrote: > > 2) check with older versions to fill up affects version for bug > I don't agree wi

Re: Automatic PR labeling

2020-04-02 Thread Nicholas Chammas
SPARK-31330 <https://issues.apache.org/jira/browse/SPARK-31330>: Automatically label PRs based on the paths they touch On Wed, Apr 1, 2020 at 11:34 PM Hyukjin Kwon wrote: > @Nicholas Chammas Would you be interested in > tacking a look? I would love this to be done. > > 2020

Beginner PR against the Catalog API

2020-04-02 Thread Nicholas Chammas
I recently submitted my first Scala PR. It's very simple, though I don't know if I've done things correctly since I'm not a regular Scala user. SPARK-31000 : Add ability to set table description in the catalog https://github.com/apache/spark/pull

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-24 Thread Nicholas Chammas
Have we asked Infra recently about enabling the native Jira-GitHub integration ? Maybe we can deprecate the part of this script that updates Jira tickets with links to the PR and rely on the native integrat

Re: In Apache Spark JIRA, spark/dev/github_jira_sync.py not running properly

2020-04-29 Thread Nicholas Chammas
ukjin Kwon wrote: > Maybe it's time to switch. Do you know if we can still link the JIRA > against Github? > The script used to change the status of JIRA too but it stopped working > for a long time - I suspect this isn't a big deal. > > 2020년 4월 25일 (토) 오전 10:31, Nichol

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-03 Thread Nicholas Chammas
I believe that was fixed in 3.0 and there was a decision not to backport the fix: SPARK-31170 On Wed, Jun 3, 2020 at 1:04 PM Xiao Li wrote: > Just downloaded it in my local macbook. Trying to create a table using the > pre-built PySpark. It sou

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
The team I'm on currently uses pip-installed PySpark for local development, and we regularly access S3 directly from our laptops/workstations. One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is being able to use a recent version of hadoop-aws that has mature support for s3a. W

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
To rephrase my earlier email, PyPI users would care about the bundled Hadoop version if they have a workflow that, in effect, looks something like this: ``` pip install pyspark pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 spark.read.parquet('s3a://...') ``` I agree that Hadoop 3 would be

PySpark: Un-deprecating inferring DataFrame schema from list of dictionaries

2020-08-24 Thread Nicholas Chammas
https://github.com/apache/spark/pull/29510 I don't think this is a big deal, but since we're removing a deprecation that has been around for ~6 years, I figured it would be good to bring everyone's attention to this change. Hopefully, we are not breaking any hidden assumptions about the direction

Re: get method guid prefix for file parts for write

2020-09-25 Thread Nicholas Chammas
I think what George is looking for is a way to determine ahead of time the partition IDs that Spark will use when writing output. George, I believe this is an example of what you're looking for: https://github.com/databricks/spark-redshift/blob/184b4428c1505dff7b4365963dc344197a92baa9/src/main/sc

Re: [DISCUSS][SPIP] Standardize Spark Exception Messages

2020-10-25 Thread Nicholas Chammas
Just want to call out that this SPIP should probably account somehow for PySpark and the work being done in SPARK-32082 to improve PySpark exceptions. On Sun, Oct 25, 2020 at 8:05 PM Xinyi Yu wrote: > Hi all, > > We like to post a SPIP of Stand

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Nicholas Chammas
On Thu, Jan 28, 2021 at 3:40 PM Sean Owen wrote: > It isn't that regexp_extract_all (for example) is useless outside SQL, > just, where do you draw the line? Supporting 10s of random SQL functions > across 3 other languages has a cost, which has to be weighed against > benefit, which we can never

Re: Disabling Closed -> Reopened transition for non-committers

2017-10-05 Thread Nicholas Chammas
Whoops, didn’t mean to send that out to the list. Apologies. Somehow, an earlier draft of my email got sent out. Nick 2017년 10월 5일 (목) 오전 9:20, Nicholas Chammas 님이 작성: > The first sign that that conversation was going to go downhill was when > the user [demanded]( > https://issues.a

Re: Kubernetes: why use init containers?

2018-01-09 Thread Nicholas Chammas
I’d like to point out the output of “git show —stat” for that diff: 29 files changed, 130 insertions(+), 1560 deletions(-) +1 for that and generally for the idea of leveraging spark-submit. You can argue that executors downloading from external servers would be faster than downloading from the dr

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Nicholas Chammas
Launched a test cluster on EC2 with Flintrock and ran some simple tests. Building Spark took much longer than usual, but that may just be a fluke. Otherwise, all looks good to me. +1 On Fri, Feb 23, 2018 at 10:55 AM Denny Lee wrote: > +1 (non-binding) > >

Please keep s3://spark-related-packages/ alive

2018-02-26 Thread Nicholas Chammas
If you go to the Downloads page and download Spark 2.2.1, you’ll get a link to an Apache mirror. It didn’t use to be this way. As recently as Spark 2.2.0, downloads were served via CloudFront , which was backed by an S3 bu

Re: Please keep s3://spark-related-packages/ alive

2018-02-27 Thread Nicholas Chammas
Spark, FWIW. >> > > To clarify, the apache-spark.rb formula in Homebrew uses the Apache > mirror closer.lua script > > > https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb#L4 > >michael > > > >> On Mon, Feb 26, 2018 at 10:57 PM Nicho

Re: Please keep s3://spark-related-packages/ alive

2018-03-01 Thread Nicholas Chammas
Marton, Thanks for the tip. (Too bad the docs referenced from the issue I opened with INFRA make no mention of mirrors.cgi.) Matei, A Requester Pays bucket is a good idea. I was trying to avoid

Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
I couldn’t get an answer anywhere else, so I thought I’d ask here. Is there a way to silence the messages that come from Ivy when you call spark-submit with --packages? (For the record, I asked this question on Stack Overflow .) Would it be a good idea

Re: Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
"spark.jars.ivySettings" to point to your > ivysettings.xml file. Would that work for you to configure it there? > > Bryan > > On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I couldn’t get an answer anywhere els

Re: Silencing messages from Ivy when calling spark-submit

2018-03-12 Thread Nicholas Chammas
o understand some settings. If you happen to figure > out the answer, please report back here. I'm sure others would find it > useful too. > > Bryan > > On Mon, Mar 5, 2018 at 3:50 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Oh, I didn

Changing how we compute release hashes

2018-03-15 Thread Nicholas Chammas
To verify that I’ve downloaded a Hadoop release correctly, I can just do this: $ shasum --check hadoop-2.7.5.tar.gz.sha256 hadoop-2.7.5.tar.gz: OK However, since we generate Spark release hashes with GPG

Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
new format, or do we only want to do this for new releases? ​ On Fri, Mar 16, 2018 at 1:50 PM Felix Cheung wrote: > +1 there > > -- > *From:* Sean Owen > *Sent:* Friday, March 16, 2018 9:51:49 AM > *To:* Felix Cheung > *Cc:* rb...@netflix.com; Ni

Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
> +1 there >>> >>> -- >>> *From:* Sean Owen >>> *Sent:* Friday, March 16, 2018 9:51:49 AM >>> *To:* Felix Cheung >>> *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list >>> >>> *Subject:* Re:

Re: Changing how we compute release hashes

2018-03-23 Thread Nicholas Chammas
To close the loop here: SPARK-23716 <https://issues.apache.org/jira/browse/SPARK-23716> On Fri, Mar 16, 2018 at 5:00 PM Nicholas Chammas wrote: > OK, will do. > > On Fri, Mar 16, 2018 at 4:41 PM Sean Owen wrote: > >> I think you can file a JIRA and open a PR. All o

Correlated subqueries in the DataFrame API

2018-04-09 Thread Nicholas Chammas
I just submitted SPARK-23945 but wanted to double check here to make sure I didn't miss something fundamental. Correlated subqueries are tracked at a high level in SPARK-18455 , but it's not clea

Re: Correlated subqueries in the DataFrame API

2018-04-27 Thread Nicholas Chammas
from source") >> val df = table.filter($"col".isin(subQ.toSet)) >> >> That also distinguishes between a sub-query and a correlated sub-query >> that uses values from the outer query. We would still need to come up with >> syntax for the correlated case, u

  1   2   3   4   5   6   >