[DISCUSSION] Avoiding duplicate work

2020-02-21 Thread younggyu Chun
Hi All,

I would like to suggest to use "Assignee" functionality in the JIRA when we
are working on a project. When we pick a ticket to work on we don't know
who is doing that right now.

Recently I spent my time to solve an issue and made a merge request but
this was actually a duplicate work. The ticket I was working on doesn't
have any clues that somebody was working.

are there ways to avoid duplicate work that I don't know yet?

Thank you,
Younggyu


Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Wenchen Fan
The JIRA ticket will show the linked PR if there are any, which indicates
that someone is working on it if the PR is active. Maybe the bot should
also leave a comment on the JIRA ticket to make it clearer?

On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun 
wrote:

> Hi All,
>
> I would like to suggest to use "Assignee" functionality in the JIRA when
> we are working on a project. When we pick a ticket to work on we don't know
> who is doing that right now.
>
> Recently I spent my time to solve an issue and made a merge request but
> this was actually a duplicate work. The ticket I was working on doesn't
> have any clues that somebody was working.
>
> are there ways to avoid duplicate work that I don't know yet?
>
> Thank you,
> Younggyu
>
>


Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread younggyu Chun
what if both are looking at code and they don't make a merge request? I
guess we can't still see what's going on because that Jira ticket won't
show the linked PR.

On Fri, 21 Feb 2020 at 09:58, Wenchen Fan  wrote:

> The JIRA ticket will show the linked PR if there are any, which indicates
> that someone is working on it if the PR is active. Maybe the bot should
> also leave a comment on the JIRA ticket to make it clearer?
>
> On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun 
> wrote:
>
>> Hi All,
>>
>> I would like to suggest to use "Assignee" functionality in the JIRA when
>> we are working on a project. When we pick a ticket to work on we don't know
>> who is doing that right now.
>>
>> Recently I spent my time to solve an issue and made a merge request but
>> this was actually a duplicate work. The ticket I was working on doesn't
>> have any clues that somebody was working.
>>
>> are there ways to avoid duplicate work that I don't know yet?
>>
>> Thank you,
>> Younggyu
>>
>>


Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Sean Owen
We've avoided using Assignee because it implies that someone 'owns'
resolving the issue, when we want to keep it collaborative, and many
times in the past someone would ask to be assigned and then didn't
follow through.

You can comment on the JIRA to say "I'm working on this" but that has
the same problem. Frequently people see that and don't work on it, and
then the original person doesn't follow through either.

The best practice is probably to write down your analysis of the
problem and solution so far in a comment. That helps everyone and
doesn't suggest others shouldn't work on it; we want them to, we want
them to work together. That also shows some commitment to working on
it.


On Fri, Feb 21, 2020 at 9:11 AM younggyu Chun  wrote:
>
> what if both are looking at code and they don't make a merge request? I guess 
> we can't still see what's going on because that Jira ticket won't show the 
> linked PR.
>
> On Fri, 21 Feb 2020 at 09:58, Wenchen Fan  wrote:
>>
>> The JIRA ticket will show the linked PR if there are any, which indicates 
>> that someone is working on it if the PR is active. Maybe the bot should also 
>> leave a comment on the JIRA ticket to make it clearer?
>>
>> On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun  
>> wrote:
>>>
>>> Hi All,
>>>
>>> I would like to suggest to use "Assignee" functionality in the JIRA when we 
>>> are working on a project. When we pick a ticket to work on we don't know 
>>> who is doing that right now.
>>>
>>> Recently I spent my time to solve an issue and made a merge request but 
>>> this was actually a duplicate work. The ticket I was working on doesn't 
>>> have any clues that somebody was working.
>>>
>>> are there ways to avoid duplicate work that I don't know yet?
>>>
>>> Thank you,
>>> Younggyu
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Nicholas Chammas
+1 to what Sean said.

On Fri, Feb 21, 2020 at 10:14 AM Sean Owen  wrote:

> We've avoided using Assignee because it implies that someone 'owns'
> resolving the issue, when we want to keep it collaborative, and many
> times in the past someone would ask to be assigned and then didn't
> follow through.
>
> You can comment on the JIRA to say "I'm working on this" but that has
> the same problem. Frequently people see that and don't work on it, and
> then the original person doesn't follow through either.
>
> The best practice is probably to write down your analysis of the
> problem and solution so far in a comment. That helps everyone and
> doesn't suggest others shouldn't work on it; we want them to, we want
> them to work together. That also shows some commitment to working on
> it.
>
>
> On Fri, Feb 21, 2020 at 9:11 AM younggyu Chun 
> wrote:
> >
> > what if both are looking at code and they don't make a merge request? I
> guess we can't still see what's going on because that Jira ticket won't
> show the linked PR.
> >
> > On Fri, 21 Feb 2020 at 09:58, Wenchen Fan  wrote:
> >>
> >> The JIRA ticket will show the linked PR if there are any, which
> indicates that someone is working on it if the PR is active. Maybe the bot
> should also leave a comment on the JIRA ticket to make it clearer?
> >>
> >> On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun <
> younggyuchu...@gmail.com> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I would like to suggest to use "Assignee" functionality in the JIRA
> when we are working on a project. When we pick a ticket to work on we don't
> know who is doing that right now.
> >>>
> >>> Recently I spent my time to solve an issue and made a merge request
> but this was actually a duplicate work. The ticket I was working on doesn't
> have any clues that somebody was working.
> >>>
> >>> are there ways to avoid duplicate work that I don't know yet?
> >>>
> >>> Thank you,
> >>> Younggyu
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-21 Thread Michael Armbrust
This plan for evolving the TRIM function to be more standards compliant
sounds much better to me than the original change to just switch the order.
It pushes users in the right direction and cleans up our tech debt without
silently breaking existing workloads. It means that programs won't return
different results when run on Spark 2.x and Spark 3.x.

One caveat:

> If we keep this situation in 3.0.0 release (a major release), it means
>>> Apache Spark will be forever
>>>
>>
I think this ship has already sailed. There is nothing special about 3.0
here. If the API is in a released version of Spark, than the mistake is
already made.

Major releases are an opportunity to break APIs when we *have to*. We
always strive to avoid breaking APIs even if they have not been in an X.0
release.


Re: Breaking API changes in Spark 3.0

2020-02-21 Thread Holden Karau
So my view of how common & stable API removal should go (in general I want
to be clear exceptions can and do make sense)
1) Deprecate API
2) Release replacement API
3) Provide migration guidance (ideally in deprecated annotation, but
possible in release notes or elsewhere)
4) Remove old API

I think, ideally, we should 1, 2, and 3 occur in a release prior to 4. If
this is not possible I think having a quick discussion on the dev list is a
reasonable given the potential impact on our users. I think the preview
release is a good opportunity for us to get an idea of if something is
going to have a really large impact.

I think we've felt this pain as developers on top of Scala before, and
knowing how painful that has been in our own experiences I'd like us to
minimize the pain of this type that we make our users experience.
And it's not like having the conversation will have no utility at all, the
discussion will be visible to users searching so they can see the rationale
and hopefully migration suggestions.


On Wed, Feb 19, 2020 at 7:02 PM Jungtaek Lim 
wrote:

> I think I was too rushed to read and focused on the first sentence of
> Karen's input. Sorry about that.
>
> As I said I'm not sure I can agree with the point of deprecation and
> breaking changes of APIs, the thread has another topic which seems to be a
> good input - practice on new API proposal. I feel it should be different
> thread to discuss, though.
>
> Maybe we can make the deprecation of API as "heavy-weight" operation to
> mitigate the impact a bit, like requiring discussion thread to reach
> consensus before going through PR. For now, you have no idea which API is
> going to be deprecated and why if you only subscribe to dev@. Even you
> subscribe the issue@ you would miss it among flooded issues.
>
> Personally I feel the root cause as dev@ is very quiet compared to the
> volume of PRs the community gets and the impacts of changes these PRs have
> been made. I agree we should have balance on this to avoid restricting
> ourselves too much, but I feel there's no balance now - most things are
> just going through PRs without discussion. It would be ideal we have time
> to consider on this.
>
>
> On Thu, Feb 20, 2020 at 8:50 AM Jungtaek Lim 
> wrote:
>
>> Apache Spark 2.0 was released in July 2016. Assuming the project has been
>> trying the best to follow the semantic versioning, it is "more than three
>> years" to wait for the breaking changes. What the community misses to
>> address necessary breaking changes would be going to be technical debts for
>> another 3+ years.
>>
>> As the PRs removing deprecated APIs were pointed out first, I'm not sure
>> about the reason. I roughly remember that these PRs target to remove
>> deprecated APIs deprecated at couple of minor versions before. If then
>> what's the matter?
>>
>> If the deprecation messages don't kindly guide about alternatives then
>> that's the major problem the community should concern and try to fix, but
>> that's another problem. The community doesn't deprecate the API just for
>> fun. Every deprecation has the reason, and not removing the API doesn't
>> make sense unless the community has mistaken for a reason of deprecation.
>>
>> If the community really would like to build some (soft) rules/policies on
>> deprecation, I would only imagine 2 items -
>>
>> 1. define "minimum release to live" (either each deprecated API or
>> globally)
>> 2. never skip describing the reason of deprecation and try best to
>> describe alternative works same or similar - if the alternative doesn't
>> work exactly same, also describe the difference (optionally, maybe)
>>
>> I cannot imagine other problems at all about deprecation.
>>
> I think those guidelines seem reasonable to me. I've written a bit more
about what I'd expect us to be doing as a project with as many downstream
consumers that we have.

>
>> On Thu, Feb 20, 2020 at 7:36 AM Dongjoon Hyun 
>> wrote:
>>
>>> Sure. I understand the background of the following requests. So, it's a
>>> good time to decide the criteria in order to start discussion.
>>>
>>> 1. "to provide a reasonable migration path we’d want the replacement
>>> of the deprecated API to also exist in 2.4"
>>> 2. "We need to discuss the APIs case by case"
>>>
>>> For now, it's unclear what is `necessarily painful`, what is "widely
>>> used APIs", or how small is "the maintenance costs are small".
>>>
>> I think these are all case by case. For example, to me, in the original
situation which kicked off the thread the SQLContext getOrCreate probably
doesn't need to keep existing given that we've had SparkSession builder's
getOrCreate for several releases and it's been deprecated.

>
>>> I'm wondering if the goal of Apache Spark 3.0.0 is being 100% backward
>>> compatible with Apache Spark 2.4.5 like Apache Kafka?
>>> Are we going to revert all changes? If there is a clear criteria, we
>>> didn't need to do the clean up for that long period of 3.0.0.
>>>
>>> BTW, to be c

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Takeshi Yamamuro
Yea, +1 to the Sean suggestion.
When we see a comment "I'm working on this" on the jira comment,
I think we need to say "Are you still working on this?" to avoid duplicate
work there.

On Sat, Feb 22, 2020 at 2:20 AM Nicholas Chammas 
wrote:

> +1 to what Sean said.
>
> On Fri, Feb 21, 2020 at 10:14 AM Sean Owen  wrote:
>
>> We've avoided using Assignee because it implies that someone 'owns'
>> resolving the issue, when we want to keep it collaborative, and many
>> times in the past someone would ask to be assigned and then didn't
>> follow through.
>>
>> You can comment on the JIRA to say "I'm working on this" but that has
>> the same problem. Frequently people see that and don't work on it, and
>> then the original person doesn't follow through either.
>>
>> The best practice is probably to write down your analysis of the
>> problem and solution so far in a comment. That helps everyone and
>> doesn't suggest others shouldn't work on it; we want them to, we want
>> them to work together. That also shows some commitment to working on
>> it.
>>
>>
>> On Fri, Feb 21, 2020 at 9:11 AM younggyu Chun 
>> wrote:
>> >
>> > what if both are looking at code and they don't make a merge request? I
>> guess we can't still see what's going on because that Jira ticket won't
>> show the linked PR.
>> >
>> > On Fri, 21 Feb 2020 at 09:58, Wenchen Fan  wrote:
>> >>
>> >> The JIRA ticket will show the linked PR if there are any, which
>> indicates that someone is working on it if the PR is active. Maybe the bot
>> should also leave a comment on the JIRA ticket to make it clearer?
>> >>
>> >> On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun <
>> younggyuchu...@gmail.com> wrote:
>> >>>
>> >>> Hi All,
>> >>>
>> >>> I would like to suggest to use "Assignee" functionality in the JIRA
>> when we are working on a project. When we pick a ticket to work on we don't
>> know who is doing that right now.
>> >>>
>> >>> Recently I spent my time to solve an issue and made a merge request
>> but this was actually a duplicate work. The ticket I was working on doesn't
>> have any clues that somebody was working.
>> >>>
>> >>> are there ways to avoid duplicate work that I don't know yet?
>> >>>
>> >>> Thank you,
>> >>> Younggyu
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
---
Takeshi Yamamuro


Does dataframe spark API write/create a single file instead of directory as a result of write operation.

2020-02-21 Thread Kshitij
Hi,

There is no dataframe spark API which writes/creates a single file instead
of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv()



df.write.csv()


Instead of creating directory with standard files (_SUCCESS , _committed ,
_started). I want a single file with file_name specified.


Thanks