Re: Data correctness issue with Repartition + FetchFailure

2022-03-16 Thread Wenchen Fan
It's great if you can help with it! Basically, we need to propagate the
column-level deterministic information and sort the inputs if the partition
key lineage has nondeterminisitc part.

On Wed, Mar 16, 2022 at 5:28 AM Jason Xu  wrote:

> Hi Wenchen, thanks for the insight. Agree, the previous fix for
> repartition works for deterministic data. With non-deterministic data, I
> didn't find an API to pass DeterministicLevel to underlying rdd.
> Do you plan to continue work on integration with SQL operators? If not,
> I'm available to take a stab.
>
> On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan  wrote:
>
>> We fixed the repartition correctness bug before, by sorting the data
>> before doing round-robin partitioning. But the issue is that we need to
>> propagate the isDeterministic property through SQL operators.
>>
>> On Tue, Mar 15, 2022 at 1:50 AM Jason Xu  wrote:
>>
>>> Hi Reynold, do you suggest removing RoundRobinPartitioning in
>>> repartition(numPartitions: Int) API implementation? If that's the direction
>>> we're considering, before we have a new implementation, should we suggest
>>> users avoid using the repartition(numPartitions: Int) API?
>>>
>>> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:
>>>
 This is why RoundRobinPartitioning shouldn't be used ...


 On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
 wrote:

> Hi Spark community,
>
> I reported a data correctness issue in
> https://issues.apache.org/jira/browse/SPARK-38388. In short,
> non-deterministic data + Repartition + FetchFailure could result in
> incorrect data, this is an issue we run into in production pipelines, I
> have an example to reproduce the bug in the ticket.
>
> I report here to bring more attention, could you help confirm it's a
> bug and worth effort to further investigate and fix, thank you in advance
> for help!
>
> Thanks,
> Jason Xu
>




Re: Apache Spark 3.3 Release

2022-03-16 Thread Wenchen Fan
+1 to define an allowlist of features that we want to backport to branch
3.3. I also have a few in my mind
complex type support in vectorized parquet reader:
https://github.com/apache/spark/pull/34659
refine the DS v2 filter API for JDBC v2:
https://github.com/apache/spark/pull/35768
a few new SQL functions that have been in development for a while: to_char,
split_part, percentile_disc, try_sum, etc.

On Wed, Mar 16, 2022 at 2:41 PM Maxim Gekk
 wrote:

> Hi All,
>
> I have created the branch for Spark 3.3:
> https://github.com/apache/spark/commits/branch-3.3
>
> Please, backport important fixes to it, and if you have some doubts, ping
> me in the PR. Regarding new features, we are still building the allow list
> for branch-3.3.
>
> Best regards,
> Max Gekk
>
>
> On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun 
> wrote:
>
>> Yes, I agree with you for your whitelist approach for backporting. :)
>> Thank you for summarizing.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:
>>
>>> I think I finally got your point. What you want to keep unchanged is the
>>> branch cut date of Spark 3.3. Today? or this Friday? This is not a big
>>> deal.
>>>
>>> My major concern is whether we should keep merging the feature work or
>>> the dependency upgrade after the branch cut. To make our release time more
>>> predictable, I am suggesting we should finalize the exception PR list
>>> first, instead of merging them in an ad hoc way. In the past, we spent a
>>> lot of time on the revert of the PRs that were merged after the branch cut.
>>> I hope we can minimize unnecessary arguments in this release. Do you agree,
>>> Dongjoon?
>>>
>>>
>>>
>>> Dongjoon Hyun  于2022年3月15日周二 15:55写道:
>>>
 That is not totally fine, Xiao. It sounds like you are asking a change
 of plan without a proper reason.

 Although we cut the branch Today according our plan, you still can
 collect the list and make a list of exceptions. I'm not blocking what you
 want to do.

 Please let the community start to ramp down as we agreed before.

 Dongjoon



 On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:

> Please do not get me wrong. If we don't cut a branch, we are allowing
> all patches to land Apache Spark 3.3. That is totally fine. After we cut
> the branch, we should avoid merging the feature work. In the next three
> days, let us collect the actively developed PRs that we want to make an
> exception (i.e., merged to 3.3 after the upcoming branch cut). Does that
> make sense?
>
> Dongjoon Hyun  于2022年3月15日周二 14:54写道:
>
>> Xiao. You are working against what you are saying.
>> If you don't cut a branch, it means you are allowing all patches to
>> land Apache Spark 3.3. No?
>>
>> > we need to avoid backporting the feature work that are not being
>> well discussed.
>>
>>
>>
>> On Tue, Mar 15, 2022 at 12:12 PM Xiao Li 
>> wrote:
>>
>>> Cutting the branch is simple, but we need to avoid backporting the
>>> feature work that are not being well discussed. Not all the members are
>>> actively following the dev list. I think we should wait 3 more days for
>>> collecting the PR list before cutting the branch.
>>>
>>> BTW, there are very few 3.4-only feature work that will be affected.
>>>
>>> Xiao
>>>
>>> Dongjoon Hyun  于2022年3月15日周二 11:49写道:
>>>
 Hi, Max, Chao, Xiao, Holden and all.

 I have a different idea.

 Given the situation and small patch list, I don't think we need to
 postpone the branch cut for those patches. It's easier to cut a 
 branch-3.3
 and allow backporting.

 As of today, we already have an obvious Apache Spark 3.4 patch in
 the branch together. This situation only becomes worse and worse 
 because
 there is no way to block the other patches from landing 
 unintentionally if
 we don't cut a branch.

 [SPARK-38335][SQL] Implement parser support for DEFAULT column
 values

 Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.

 Best,
 Dongjoon.


 On Tue, Mar 15, 2022 at 10:17 AM Chao Sun 
 wrote:

> Cool, thanks for clarifying!
>
> On Tue, Mar 15, 2022 at 10:11 AM Xiao Li 
> wrote:
> >>
> >> For the following list:
> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
> vectorized reader
> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> >> Do you mean we should include them, or exclude them from 3.3?
> >
> >
> > If possible, I hope these features can be shipped with Spark 3.3.
> >
> >

Skip single integration test case in Spark on K8s

2022-03-16 Thread Pralabh Kumar
Hi Spark team

I am running Spark kubernetes integration test suite on cloud.

build/mvn install \

-f  pom.xml \

-pl resource-managers/kubernetes/integration-tests -am -Pscala-2.12
-Phadoop-3.1.1 -Phive -Phive-thriftserver -Pyarn -Pkubernetes
-Pkubernetes-integration-tests \

-Djava.version=8 \

-Dspark.kubernetes.test.sparkTgz= \

-Dspark.kubernetes.test.imageTag=<> \

-Dspark.kubernetes.test.imageRepo=< repo> \

-Dspark.kubernetes.test.deployMode=cloud \

-Dtest.include.tags=k8s \

-Dspark.kubernetes.test.javaImageTag= \

-Dspark.kubernetes.test.namespace= \

-Dspark.kubernetes.test.serviceAccountName=spark \

-Dspark.kubernetes.test.kubeConfigContext=<> \

-Dspark.kubernetes.test.master=<> \

-Dspark.kubernetes.test.jvmImage=<> \

-Dspark.kubernetes.test.pythonImage=<> \

-Dlog4j.logger.org.apache.spark=DEBUG



I am successfully able to run some test cases and some are failing . For
e.g "Run SparkRemoteFileTest using a Remote data file" in KuberneterSuite
is failing.


Is there a way to skip running some of the test cases ?.



Please help me on the same.


Regards

Pralabh Kumar


Re: Apache Spark 3.3 Release

2022-03-16 Thread Adam Binford
Also throwing my hat in for two of my PRs that should be ready just need
final reviews/approval:
Removing shuffles from deallocated executors using the shuffle service:
https://github.com/apache/spark/pull/35085. This has been asked for for
several years across many issues.
Configurable memory overhead factor:
https://github.com/apache/spark/pull/35504

Adam

On Wed, Mar 16, 2022 at 8:53 AM Wenchen Fan  wrote:

> +1 to define an allowlist of features that we want to backport to branch
> 3.3. I also have a few in my mind
> complex type support in vectorized parquet reader:
> https://github.com/apache/spark/pull/34659
> refine the DS v2 filter API for JDBC v2:
> https://github.com/apache/spark/pull/35768
> a few new SQL functions that have been in development for a while:
> to_char, split_part, percentile_disc, try_sum, etc.
>
> On Wed, Mar 16, 2022 at 2:41 PM Maxim Gekk
>  wrote:
>
>> Hi All,
>>
>> I have created the branch for Spark 3.3:
>> https://github.com/apache/spark/commits/branch-3.3
>>
>> Please, backport important fixes to it, and if you have some doubts, ping
>> me in the PR. Regarding new features, we are still building the allow list
>> for branch-3.3.
>>
>> Best regards,
>> Max Gekk
>>
>>
>> On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun 
>> wrote:
>>
>>> Yes, I agree with you for your whitelist approach for backporting. :)
>>> Thank you for summarizing.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:
>>>
 I think I finally got your point. What you want to keep unchanged is
 the branch cut date of Spark 3.3. Today? or this Friday? This is not a big
 deal.

 My major concern is whether we should keep merging the feature work or
 the dependency upgrade after the branch cut. To make our release time more
 predictable, I am suggesting we should finalize the exception PR list
 first, instead of merging them in an ad hoc way. In the past, we spent a
 lot of time on the revert of the PRs that were merged after the branch cut.
 I hope we can minimize unnecessary arguments in this release. Do you agree,
 Dongjoon?



 Dongjoon Hyun  于2022年3月15日周二 15:55写道:

> That is not totally fine, Xiao. It sounds like you are asking a change
> of plan without a proper reason.
>
> Although we cut the branch Today according our plan, you still can
> collect the list and make a list of exceptions. I'm not blocking what you
> want to do.
>
> Please let the community start to ramp down as we agreed before.
>
> Dongjoon
>
>
>
> On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:
>
>> Please do not get me wrong. If we don't cut a branch, we are allowing
>> all patches to land Apache Spark 3.3. That is totally fine. After we cut
>> the branch, we should avoid merging the feature work. In the next three
>> days, let us collect the actively developed PRs that we want to make an
>> exception (i.e., merged to 3.3 after the upcoming branch cut). Does that
>> make sense?
>>
>> Dongjoon Hyun  于2022年3月15日周二 14:54写道:
>>
>>> Xiao. You are working against what you are saying.
>>> If you don't cut a branch, it means you are allowing all patches to
>>> land Apache Spark 3.3. No?
>>>
>>> > we need to avoid backporting the feature work that are not being
>>> well discussed.
>>>
>>>
>>>
>>> On Tue, Mar 15, 2022 at 12:12 PM Xiao Li 
>>> wrote:
>>>
 Cutting the branch is simple, but we need to avoid backporting the
 feature work that are not being well discussed. Not all the members are
 actively following the dev list. I think we should wait 3 more days for
 collecting the PR list before cutting the branch.

 BTW, there are very few 3.4-only feature work that will be affected.

 Xiao

 Dongjoon Hyun  于2022年3月15日周二 11:49写道:

> Hi, Max, Chao, Xiao, Holden and all.
>
> I have a different idea.
>
> Given the situation and small patch list, I don't think we need to
> postpone the branch cut for those patches. It's easier to cut a 
> branch-3.3
> and allow backporting.
>
> As of today, we already have an obvious Apache Spark 3.4 patch in
> the branch together. This situation only becomes worse and worse 
> because
> there is no way to block the other patches from landing 
> unintentionally if
> we don't cut a branch.
>
> [SPARK-38335][SQL] Implement parser support for DEFAULT column
> values
>
> Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
>
> Best,
> Dongjoon.
>
>
> On Tue, Mar 15, 2022 at 10:17 AM Chao Sun 
> wrote:
>
>> Cool, thanks for clarifying!
>>
>>

Re:Apache Spark 3.3 Release

2022-03-16 Thread beliefer
+1 Glad to see we will release 3.3.0.




At 2022-03-04 02:44:37, "Maxim Gekk"  wrote:

Hello All,

I would like to bring on the table the theme about the new Spark release 3.3. 
According to the public schedule at 
https://spark.apache.org/versioning-policy.html, we planned to start the code 
freeze and release branch cut on March 15th, 2022. Since this date is coming 
soon, I would like to take your attention on the topic and gather objections 
that you might have.

Bellow is the list of ongoing and active SPIPs:

Spark SQL:
- [SPARK-31357] DataSourceV2: Catalog API for view metadata
- [SPARK-35801] Row-level operations in Data Source V2
- [SPARK-37166] Storage Partitioned Join

Spark Core:
- [SPARK-20624] Add better handling for node shutdown
- [SPARK-25299] Use remote storage for persisting shuffle data

PySpark:
- [SPARK-26413] RDD Arrow Support in Spark Core and PySpark

Kubernetes:
- [SPARK-36057] Support Customized Kubernetes Schedulers

Probably, we should finish if there are any remaining works for Spark 3.3, and 
switch to QA mode, cut a branch and keep everything on track. I would like to 
volunteer to help drive this process.



Best regards,
Max Gekk

Re: Apache Spark 3.3 Release

2022-03-16 Thread Jacky Lee
I also have a PR that has been ready to merge for a while, can we merge in
3.3.0?
[SPARK-37831][CORE] add task partition id in TaskInfo and Task Metrics
https://github.com/apache/spark/pull/35185

beliefer  于2022年3月16日周三 21:33写道:

> +1 Glad to see we will release 3.3.0.
>
>
> At 2022-03-04 02:44:37, "Maxim Gekk" 
> wrote:
>
> Hello All,
>
> I would like to bring on the table the theme about the new Spark release
> 3.3. According to the public schedule at
> https://spark.apache.org/versioning-policy.html, we planned to start the
> code freeze and release branch cut on March 15th, 2022. Since this date is
> coming soon, I would like to take your attention on the topic and gather
> objections that you might have.
>
> Bellow is the list of ongoing and active SPIPs:
>
> Spark SQL:
> - [SPARK-31357] DataSourceV2: Catalog API for view metadata
> - [SPARK-35801] Row-level operations in Data Source V2
> - [SPARK-37166] Storage Partitioned Join
>
> Spark Core:
> - [SPARK-20624] Add better handling for node shutdown
> - [SPARK-25299] Use remote storage for persisting shuffle data
>
> PySpark:
> - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark
>
> Kubernetes:
> - [SPARK-36057] Support Customized Kubernetes Schedulers
>
> Probably, we should finish if there are any remaining works for Spark 3.3,
> and switch to QA mode, cut a branch and keep everything on track. I would
> like to volunteer to help drive this process.
>
> Best regards,
> Max Gekk
>
>
>
>
>


Re: Apache Spark 3.3 Release

2022-03-16 Thread Jacky Lee
I also have a PR that has been ready to merge for a while, can we merge in
3.3.0?
[SPARK-37831][CORE] add task partition id in TaskInfo and Task Metrics
https://github.com/apache/spark/pull/35185

Adam Binford  于2022年3月16日周三 21:16写道:

> Also throwing my hat in for two of my PRs that should be ready just need
> final reviews/approval:
> Removing shuffles from deallocated executors using the shuffle service:
> https://github.com/apache/spark/pull/35085. This has been asked for for
> several years across many issues.
> Configurable memory overhead factor:
> https://github.com/apache/spark/pull/35504
>
> Adam
>
> On Wed, Mar 16, 2022 at 8:53 AM Wenchen Fan  wrote:
>
>> +1 to define an allowlist of features that we want to backport to branch
>> 3.3. I also have a few in my mind
>> complex type support in vectorized parquet reader:
>> https://github.com/apache/spark/pull/34659
>> refine the DS v2 filter API for JDBC v2:
>> https://github.com/apache/spark/pull/35768
>> a few new SQL functions that have been in development for a while:
>> to_char, split_part, percentile_disc, try_sum, etc.
>>
>> On Wed, Mar 16, 2022 at 2:41 PM Maxim Gekk
>>  wrote:
>>
>>> Hi All,
>>>
>>> I have created the branch for Spark 3.3:
>>> https://github.com/apache/spark/commits/branch-3.3
>>>
>>> Please, backport important fixes to it, and if you have some doubts,
>>> ping me in the PR. Regarding new features, we are still building the allow
>>> list for branch-3.3.
>>>
>>> Best regards,
>>> Max Gekk
>>>
>>>
>>> On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun 
>>> wrote:
>>>
 Yes, I agree with you for your whitelist approach for backporting. :)
 Thank you for summarizing.

 Thanks,
 Dongjoon.


 On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:

> I think I finally got your point. What you want to keep unchanged is
> the branch cut date of Spark 3.3. Today? or this Friday? This is not a big
> deal.
>
> My major concern is whether we should keep merging the feature work or
> the dependency upgrade after the branch cut. To make our release time more
> predictable, I am suggesting we should finalize the exception PR list
> first, instead of merging them in an ad hoc way. In the past, we spent a
> lot of time on the revert of the PRs that were merged after the branch 
> cut.
> I hope we can minimize unnecessary arguments in this release. Do you 
> agree,
> Dongjoon?
>
>
>
> Dongjoon Hyun  于2022年3月15日周二 15:55写道:
>
>> That is not totally fine, Xiao. It sounds like you are asking a
>> change of plan without a proper reason.
>>
>> Although we cut the branch Today according our plan, you still can
>> collect the list and make a list of exceptions. I'm not blocking what you
>> want to do.
>>
>> Please let the community start to ramp down as we agreed before.
>>
>> Dongjoon
>>
>>
>>
>> On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:
>>
>>> Please do not get me wrong. If we don't cut a branch, we are
>>> allowing all patches to land Apache Spark 3.3. That is totally fine. 
>>> After
>>> we cut the branch, we should avoid merging the feature work. In the next
>>> three days, let us collect the actively developed PRs that we want to 
>>> make
>>> an exception (i.e., merged to 3.3 after the upcoming branch cut). Does 
>>> that
>>> make sense?
>>>
>>> Dongjoon Hyun  于2022年3月15日周二 14:54写道:
>>>
 Xiao. You are working against what you are saying.
 If you don't cut a branch, it means you are allowing all patches to
 land Apache Spark 3.3. No?

 > we need to avoid backporting the feature work that are not being
 well discussed.



 On Tue, Mar 15, 2022 at 12:12 PM Xiao Li 
 wrote:

> Cutting the branch is simple, but we need to avoid backporting the
> feature work that are not being well discussed. Not all the members 
> are
> actively following the dev list. I think we should wait 3 more days 
> for
> collecting the PR list before cutting the branch.
>
> BTW, there are very few 3.4-only feature work that will be
> affected.
>
> Xiao
>
> Dongjoon Hyun  于2022年3月15日周二 11:49写道:
>
>> Hi, Max, Chao, Xiao, Holden and all.
>>
>> I have a different idea.
>>
>> Given the situation and small patch list, I don't think we need
>> to postpone the branch cut for those patches. It's easier to cut a
>> branch-3.3 and allow backporting.
>>
>> As of today, we already have an obvious Apache Spark 3.4 patch in
>> the branch together. This situation only becomes worse and worse 
>> because
>> there is no way to block the other patches from landing 
>> unintentional

Re: Apache Spark 3.3 Release

2022-03-16 Thread Tom Graves
 It looks like the version hasn't been updated on master and still shows 
3.3.0-SNAPSHOT, can you please update that. 
Tom
On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk 
 wrote:  
 
 Hi All,

I have created the branch for Spark 3.3:
https://github.com/apache/spark/commits/branch-3.3

Please, backport important fixes to it, and if you have some doubts, ping me in 
the PR. Regarding new features, we are still building the allow list for 
branch-3.3.
Best regards,Max Gekk

On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun  wrote:

Yes, I agree with you for your whitelist approach for backporting. :)Thank you 
for summarizing.

Thanks,Dongjoon.

On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:

I think I finally got your point. What you want to keep unchanged is the branch 
cut date of Spark 3.3. Today? or this Friday? This is not a big deal. 
My major concern is whether we should keep merging the feature work or the 
dependency upgrade after the branch cut. To make our release time more 
predictable, I am suggesting we should finalize the exception PR list first, 
instead of merging them in an ad hoc way. In the past, we spent a lot of time 
on the revert of the PRs that were merged after the branch cut. I hope we can 
minimize unnecessary arguments in this release. Do you agree, Dongjoon?


Dongjoon Hyun  于2022年3月15日周二 15:55写道:

That is not totally fine, Xiao. It sounds like you are asking a change of plan 
without a proper reason.
Although we cut the branch Today according our plan, you still can collect the 
list and make a list of exceptions. I'm not blocking what you want to do.
Please let the community start to ramp down as we agreed before.
Dongjoon


On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:

Please do not get me wrong. If we don't cut a branch, we are allowing all 
patches to land Apache Spark 3.3. That is totally fine. After we cut the 
branch, we should avoid merging the feature work. In the next three days, let 
us collect the actively developed PRs that we want to make an exception (i.e., 
merged to 3.3 after the upcoming branch cut). Does that make sense?
Dongjoon Hyun  于2022年3月15日周二 14:54写道:

Xiao. You are working against what you are saying.If you don't cut a branch, it 
means you are allowing all patches to land Apache Spark 3.3. No?

> we need to avoid backporting the feature work that are not being well 
> discussed.


On Tue, Mar 15, 2022 at 12:12 PM Xiao Li  wrote:

Cutting the branch is simple, but we need to avoid backporting the feature work 
that are not being well discussed. Not all the members are actively following 
the dev list. I think we should wait 3 more days for collecting the PR list 
before cutting the branch. 
BTW, there are very few 3.4-only feature work that will be affected.

Xiao
Dongjoon Hyun  于2022年3月15日周二 11:49写道:

Hi, Max, Chao, Xiao, Holden and all.
I have a different idea.
Given the situation and small patch list, I don't think we need to postpone the 
branch cut for those patches. It's easier to cut a branch-3.3 and allow 
backporting.
As of today, we already have an obvious Apache Spark 3.4 patch in the branch 
together. This situation only becomes worse and worse because there is no way 
to block the other patches from landing unintentionally if we don't cut a 
branch.
    [SPARK-38335][SQL] Implement parser support for DEFAULT column values

Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
Best,
Dongjoon.

On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:

Cool, thanks for clarifying!

On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
>>
>> For the following list:
>> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
>> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>> Do you mean we should include them, or exclude them from 3.3?
>
>
> If possible, I hope these features can be shipped with Spark 3.3.
>
>
>
> Chao Sun  于2022年3月15日周二 10:06写道:
>>
>> Hi Xiao,
>>
>> For the following list:
>>
>> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
>> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>
>> Do you mean we should include them, or exclude them from 3.3?
>>
>> Thanks,
>> Chao
>>
>> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun  
>> wrote:
>> >
>> > The following was tested and merged a few minutes ago. So, we can remove 
>> > it from the list.
>> >
>> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li  wrote:
>> >>
>> >> Let me clarify my above suggestion. Maybe we can wait 3 more days to 
>> >> collect the list of actively developed PRs that we want to merge to 3.3 
>> >> after the branch cut?
>> >>
>> >> Please do not rush to merge the PRs that are not fully reviewed. We can 
>> >> cut the branch this Friday and continue merging the PRs that have been 
>>

Re: Apache Spark 3.3 Release

2022-03-16 Thread Chao Sun
There is one item on our side that we want to backport to 3.3:
- vectorized DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for
Parquet V2 support (https://github.com/apache/spark/pull/35262)

It's already reviewed and approved.

On Wed, Mar 16, 2022 at 9:13 AM Tom Graves  wrote:
>
> It looks like the version hasn't been updated on master and still shows 
> 3.3.0-SNAPSHOT, can you please update that.
>
> Tom
>
> On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk 
>  wrote:
>
>
> Hi All,
>
> I have created the branch for Spark 3.3:
> https://github.com/apache/spark/commits/branch-3.3
>
> Please, backport important fixes to it, and if you have some doubts, ping me 
> in the PR. Regarding new features, we are still building the allow list for 
> branch-3.3.
>
> Best regards,
> Max Gekk
>
>
> On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun  wrote:
>
> Yes, I agree with you for your whitelist approach for backporting. :)
> Thank you for summarizing.
>
> Thanks,
> Dongjoon.
>
>
> On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:
>
> I think I finally got your point. What you want to keep unchanged is the 
> branch cut date of Spark 3.3. Today? or this Friday? This is not a big deal.
>
> My major concern is whether we should keep merging the feature work or the 
> dependency upgrade after the branch cut. To make our release time more 
> predictable, I am suggesting we should finalize the exception PR list first, 
> instead of merging them in an ad hoc way. In the past, we spent a lot of time 
> on the revert of the PRs that were merged after the branch cut. I hope we can 
> minimize unnecessary arguments in this release. Do you agree, Dongjoon?
>
>
>
> Dongjoon Hyun  于2022年3月15日周二 15:55写道:
>
> That is not totally fine, Xiao. It sounds like you are asking a change of 
> plan without a proper reason.
>
> Although we cut the branch Today according our plan, you still can collect 
> the list and make a list of exceptions. I'm not blocking what you want to do.
>
> Please let the community start to ramp down as we agreed before.
>
> Dongjoon
>
>
>
> On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:
>
> Please do not get me wrong. If we don't cut a branch, we are allowing all 
> patches to land Apache Spark 3.3. That is totally fine. After we cut the 
> branch, we should avoid merging the feature work. In the next three days, let 
> us collect the actively developed PRs that we want to make an exception 
> (i.e., merged to 3.3 after the upcoming branch cut). Does that make sense?
>
> Dongjoon Hyun  于2022年3月15日周二 14:54写道:
>
> Xiao. You are working against what you are saying.
> If you don't cut a branch, it means you are allowing all patches to land 
> Apache Spark 3.3. No?
>
> > we need to avoid backporting the feature work that are not being well 
> > discussed.
>
>
>
> On Tue, Mar 15, 2022 at 12:12 PM Xiao Li  wrote:
>
> Cutting the branch is simple, but we need to avoid backporting the feature 
> work that are not being well discussed. Not all the members are actively 
> following the dev list. I think we should wait 3 more days for collecting the 
> PR list before cutting the branch.
>
> BTW, there are very few 3.4-only feature work that will be affected.
>
> Xiao
>
> Dongjoon Hyun  于2022年3月15日周二 11:49写道:
>
> Hi, Max, Chao, Xiao, Holden and all.
>
> I have a different idea.
>
> Given the situation and small patch list, I don't think we need to postpone 
> the branch cut for those patches. It's easier to cut a branch-3.3 and allow 
> backporting.
>
> As of today, we already have an obvious Apache Spark 3.4 patch in the branch 
> together. This situation only becomes worse and worse because there is no way 
> to block the other patches from landing unintentionally if we don't cut a 
> branch.
>
> [SPARK-38335][SQL] Implement parser support for DEFAULT column values
>
> Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
>
> Best,
> Dongjoon.
>
>
> On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:
>
> Cool, thanks for clarifying!
>
> On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
> >>
> >> For the following list:
> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized 
> >> reader
> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> >> Do you mean we should include them, or exclude them from 3.3?
> >
> >
> > If possible, I hope these features can be shipped with Spark 3.3.
> >
> >
> >
> > Chao Sun  于2022年3月15日周二 10:06写道:
> >>
> >> Hi Xiao,
> >>
> >> For the following list:
> >>
> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized 
> >> reader
> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> >>
> >> Do you mean we should include them, or exclude them from 3.3?
> >>
> >> Thanks,
> >> Chao
> >>
> >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun  
> >> wrote:
> >> >
> >> > The following was tested and merged a few minutes ago.

Re: Apache Spark 3.3 Release

2022-03-16 Thread Holden Karau
I'd like to add/backport the logging in
https://github.com/apache/spark/pull/35881 PR so that when users submit
issues with dynamic allocation we can better debug what's going on.

On Wed, Mar 16, 2022 at 3:45 PM Chao Sun  wrote:

> There is one item on our side that we want to backport to 3.3:
> - vectorized DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for
> Parquet V2 support (https://github.com/apache/spark/pull/35262)
>
> It's already reviewed and approved.
>
> On Wed, Mar 16, 2022 at 9:13 AM Tom Graves 
> wrote:
> >
> > It looks like the version hasn't been updated on master and still shows
> 3.3.0-SNAPSHOT, can you please update that.
> >
> > Tom
> >
> > On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk <
> maxim.g...@databricks.com.invalid> wrote:
> >
> >
> > Hi All,
> >
> > I have created the branch for Spark 3.3:
> > https://github.com/apache/spark/commits/branch-3.3
> >
> > Please, backport important fixes to it, and if you have some doubts,
> ping me in the PR. Regarding new features, we are still building the allow
> list for branch-3.3.
> >
> > Best regards,
> > Max Gekk
> >
> >
> > On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun 
> wrote:
> >
> > Yes, I agree with you for your whitelist approach for backporting. :)
> > Thank you for summarizing.
> >
> > Thanks,
> > Dongjoon.
> >
> >
> > On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:
> >
> > I think I finally got your point. What you want to keep unchanged is the
> branch cut date of Spark 3.3. Today? or this Friday? This is not a big deal.
> >
> > My major concern is whether we should keep merging the feature work or
> the dependency upgrade after the branch cut. To make our release time more
> predictable, I am suggesting we should finalize the exception PR list
> first, instead of merging them in an ad hoc way. In the past, we spent a
> lot of time on the revert of the PRs that were merged after the branch cut.
> I hope we can minimize unnecessary arguments in this release. Do you agree,
> Dongjoon?
> >
> >
> >
> > Dongjoon Hyun  于2022年3月15日周二 15:55写道:
> >
> > That is not totally fine, Xiao. It sounds like you are asking a change
> of plan without a proper reason.
> >
> > Although we cut the branch Today according our plan, you still can
> collect the list and make a list of exceptions. I'm not blocking what you
> want to do.
> >
> > Please let the community start to ramp down as we agreed before.
> >
> > Dongjoon
> >
> >
> >
> > On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:
> >
> > Please do not get me wrong. If we don't cut a branch, we are allowing
> all patches to land Apache Spark 3.3. That is totally fine. After we cut
> the branch, we should avoid merging the feature work. In the next three
> days, let us collect the actively developed PRs that we want to make an
> exception (i.e., merged to 3.3 after the upcoming branch cut). Does that
> make sense?
> >
> > Dongjoon Hyun  于2022年3月15日周二 14:54写道:
> >
> > Xiao. You are working against what you are saying.
> > If you don't cut a branch, it means you are allowing all patches to land
> Apache Spark 3.3. No?
> >
> > > we need to avoid backporting the feature work that are not being well
> discussed.
> >
> >
> >
> > On Tue, Mar 15, 2022 at 12:12 PM Xiao Li  wrote:
> >
> > Cutting the branch is simple, but we need to avoid backporting the
> feature work that are not being well discussed. Not all the members are
> actively following the dev list. I think we should wait 3 more days for
> collecting the PR list before cutting the branch.
> >
> > BTW, there are very few 3.4-only feature work that will be affected.
> >
> > Xiao
> >
> > Dongjoon Hyun  于2022年3月15日周二 11:49写道:
> >
> > Hi, Max, Chao, Xiao, Holden and all.
> >
> > I have a different idea.
> >
> > Given the situation and small patch list, I don't think we need to
> postpone the branch cut for those patches. It's easier to cut a branch-3.3
> and allow backporting.
> >
> > As of today, we already have an obvious Apache Spark 3.4 patch in the
> branch together. This situation only becomes worse and worse because there
> is no way to block the other patches from landing unintentionally if we
> don't cut a branch.
> >
> > [SPARK-38335][SQL] Implement parser support for DEFAULT column values
> >
> > Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
> >
> > Best,
> > Dongjoon.
> >
> >
> > On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:
> >
> > Cool, thanks for clarifying!
> >
> > On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
> > >>
> > >> For the following list:
> > >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
> > >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
> vectorized reader
> > >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
> > >> Do you mean we should include them, or exclude them from 3.3?
> > >
> > >
> > > If possible, I hope these features can be shipped with Spark 3.3.
> > >
> > >
> > >
> > > Chao Sun  于2022年3月15日周二 10:06写道:
> > >>
> > >> Hi Xiao,
> > >>
> > 

Re: Apache Spark 3.3 Release

2022-03-16 Thread Andrew Melo
Hello,

I've been trying for a bit to get the following two PRs merged and
into a release, and I'm having some difficulty moving them forward:

https://github.com/apache/spark/pull/34903 - This passes the current
python interpreter to spark-env.sh to allow some currently-unavailable
customization to happen
https://github.com/apache/spark/pull/31774 - This fixes a bug in the
SparkUI reverse proxy-handling code where it does a greedy match for
"proxy" in the URL, and will mistakenly replace the App-ID in the
wrong place.

I'm not exactly sure of how to get attention of PRs that have been
sitting around for a while, but these are really important to our
use-cases, and it would be nice to have them merged in.

Cheers
Andrew

On Wed, Mar 16, 2022 at 6:21 PM Holden Karau  wrote:
>
> I'd like to add/backport the logging in 
> https://github.com/apache/spark/pull/35881 PR so that when users submit 
> issues with dynamic allocation we can better debug what's going on.
>
> On Wed, Mar 16, 2022 at 3:45 PM Chao Sun  wrote:
>>
>> There is one item on our side that we want to backport to 3.3:
>> - vectorized DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for
>> Parquet V2 support (https://github.com/apache/spark/pull/35262)
>>
>> It's already reviewed and approved.
>>
>> On Wed, Mar 16, 2022 at 9:13 AM Tom Graves  
>> wrote:
>> >
>> > It looks like the version hasn't been updated on master and still shows 
>> > 3.3.0-SNAPSHOT, can you please update that.
>> >
>> > Tom
>> >
>> > On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk 
>> >  wrote:
>> >
>> >
>> > Hi All,
>> >
>> > I have created the branch for Spark 3.3:
>> > https://github.com/apache/spark/commits/branch-3.3
>> >
>> > Please, backport important fixes to it, and if you have some doubts, ping 
>> > me in the PR. Regarding new features, we are still building the allow list 
>> > for branch-3.3.
>> >
>> > Best regards,
>> > Max Gekk
>> >
>> >
>> > On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun  
>> > wrote:
>> >
>> > Yes, I agree with you for your whitelist approach for backporting. :)
>> > Thank you for summarizing.
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> >
>> > On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:
>> >
>> > I think I finally got your point. What you want to keep unchanged is the 
>> > branch cut date of Spark 3.3. Today? or this Friday? This is not a big 
>> > deal.
>> >
>> > My major concern is whether we should keep merging the feature work or the 
>> > dependency upgrade after the branch cut. To make our release time more 
>> > predictable, I am suggesting we should finalize the exception PR list 
>> > first, instead of merging them in an ad hoc way. In the past, we spent a 
>> > lot of time on the revert of the PRs that were merged after the branch 
>> > cut. I hope we can minimize unnecessary arguments in this release. Do you 
>> > agree, Dongjoon?
>> >
>> >
>> >
>> > Dongjoon Hyun  于2022年3月15日周二 15:55写道:
>> >
>> > That is not totally fine, Xiao. It sounds like you are asking a change of 
>> > plan without a proper reason.
>> >
>> > Although we cut the branch Today according our plan, you still can collect 
>> > the list and make a list of exceptions. I'm not blocking what you want to 
>> > do.
>> >
>> > Please let the community start to ramp down as we agreed before.
>> >
>> > Dongjoon
>> >
>> >
>> >
>> > On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:
>> >
>> > Please do not get me wrong. If we don't cut a branch, we are allowing all 
>> > patches to land Apache Spark 3.3. That is totally fine. After we cut the 
>> > branch, we should avoid merging the feature work. In the next three days, 
>> > let us collect the actively developed PRs that we want to make an 
>> > exception (i.e., merged to 3.3 after the upcoming branch cut). Does that 
>> > make sense?
>> >
>> > Dongjoon Hyun  于2022年3月15日周二 14:54写道:
>> >
>> > Xiao. You are working against what you are saying.
>> > If you don't cut a branch, it means you are allowing all patches to land 
>> > Apache Spark 3.3. No?
>> >
>> > > we need to avoid backporting the feature work that are not being well 
>> > > discussed.
>> >
>> >
>> >
>> > On Tue, Mar 15, 2022 at 12:12 PM Xiao Li  wrote:
>> >
>> > Cutting the branch is simple, but we need to avoid backporting the feature 
>> > work that are not being well discussed. Not all the members are actively 
>> > following the dev list. I think we should wait 3 more days for collecting 
>> > the PR list before cutting the branch.
>> >
>> > BTW, there are very few 3.4-only feature work that will be affected.
>> >
>> > Xiao
>> >
>> > Dongjoon Hyun  于2022年3月15日周二 11:49写道:
>> >
>> > Hi, Max, Chao, Xiao, Holden and all.
>> >
>> > I have a different idea.
>> >
>> > Given the situation and small patch list, I don't think we need to 
>> > postpone the branch cut for those patches. It's easier to cut a branch-3.3 
>> > and allow backporting.
>> >
>> > As of today, we already have an obvious Apache Spark 3.4 patch in the 
>> > branch together. This situati

Re: Skip single integration test case in Spark on K8s

2022-03-16 Thread Dongjoon Hyun
-user@spark

For cloud backend, you need to exclude minikube specific tests and
local-only test (SparkRemoteFileTest).

-Dtest.exclude.tags=minikube,local

You can find more options including SBT commands here.


https://github.com/apache/spark/tree/master/resource-managers/kubernetes/integration-tests

Dongjoon.


On Wed, Mar 16, 2022 at 6:11 AM Pralabh Kumar 
wrote:

> Hi Spark team
>
> I am running Spark kubernetes integration test suite on cloud.
>
> build/mvn install \
>
> -f  pom.xml \
>
> -pl resource-managers/kubernetes/integration-tests -am -Pscala-2.12
> -Phadoop-3.1.1 -Phive -Phive-thriftserver -Pyarn -Pkubernetes
> -Pkubernetes-integration-tests \
>
> -Djava.version=8 \
>
> -Dspark.kubernetes.test.sparkTgz= \
>
> -Dspark.kubernetes.test.imageTag=<> \
>
> -Dspark.kubernetes.test.imageRepo=< repo> \
>
> -Dspark.kubernetes.test.deployMode=cloud \
>
> -Dtest.include.tags=k8s \
>
> -Dspark.kubernetes.test.javaImageTag= \
>
> -Dspark.kubernetes.test.namespace= \
>
> -Dspark.kubernetes.test.serviceAccountName=spark \
>
> -Dspark.kubernetes.test.kubeConfigContext=<> \
>
> -Dspark.kubernetes.test.master=<> \
>
> -Dspark.kubernetes.test.jvmImage=<> \
>
> -Dspark.kubernetes.test.pythonImage=<> \
>
> -Dlog4j.logger.org.apache.spark=DEBUG
>
>
>
> I am successfully able to run some test cases and some are failing . For
> e.g "Run SparkRemoteFileTest using a Remote data file" in KuberneterSuite
> is failing.
>
>
> Is there a way to skip running some of the test cases ?.
>
>
>
> Please help me on the same.
>
>
> Regards
>
> Pralabh Kumar
>