Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
Chawla,

We hit this issue, too. I worked around it by setting
spark.scheduler.executorTaskBlacklistTime=5000. The problem for us was that
the scheduler was using locality to select the executor, even though it had
already failed there. The executor task blacklist time controls how long
the scheduler will avoid using an executor for a failed task, which will
cause it to avoid rescheduling on the executor. The default was 0, so the
executor was put back into consideration immediately.

In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not sure
if that does exactly the same thing. The default for that setting is 1h
instead of 0. It’s better to have a non-zero default to avoid what you’re
seeing.

rb
​

On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit 
wrote:

> I am seeing a strange issue. I had a bad behaving slave that failed the
> entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems
> like all task retries happen on the same slave in case of failure.  My
> expectation was that task will be retried on different slave in case of
> failure, and chance of all 8 retries to happen on same slave is very less.
>
>
> Regards
> Sumit Chawla
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
Looking at the code a bit more, it appears that blacklisting is disabled by
default. To enable it, set spark.blacklist.enabled=true.

The updates in 2.1.0 appear to provide much more fine-grained settings for
this, like the number of tasks that can fail before an executor is
blacklisted for a stage. In that version, you probably want to set
spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs
 and search for
“blacklist” to see all the options.

rb
​

On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue  wrote:

> Chawla,
>
> We hit this issue, too. I worked around it by setting spark.scheduler.
> executorTaskBlacklistTime=5000. The problem for us was that the scheduler
> was using locality to select the executor, even though it had already
> failed there. The executor task blacklist time controls how long the
> scheduler will avoid using an executor for a failed task, which will cause
> it to avoid rescheduling on the executor. The default was 0, so the
> executor was put back into consideration immediately.
>
> In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not
> sure if that does exactly the same thing. The default for that setting is
> 1h instead of 0. It’s better to have a non-zero default to avoid what
> you’re seeing.
>
> rb
> ​
>
> On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit 
> wrote:
>
>> I am seeing a strange issue. I had a bad behaving slave that failed the
>> entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems
>> like all task retries happen on the same slave in case of failure.  My
>> expectation was that task will be retried on different slave in case of
>> failure, and chance of all 8 retries to happen on same slave is very less.
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Michael Allman
The trouble we ran into is that this upgrade was blocking access to our tables, 
and we didn't know why. This sounds like a kind of migration operation, but it 
was not apparent that this was the case. It took an expert examining a stack 
trace and source code to figure this out. Would a more naive end user be able 
to debug this issue? Maybe we're an unusual case, but our particular experience 
was pretty bad. I have my doubts that the schema inference on our largest 
tables would ever complete without throwing some kind of timeout (which we were 
in fact receiving) or the end user just giving up and killing our job. We ended 
up doing a rollback while we investigated the source of the issue. In our case, 
INFER_NEVER is clearly the best configuration. We're going to add that to our 
default configuration files.

My expectation is that a minor point release is a pretty safe bug fix release. 
We were a bit hasty in not doing better due diligence pre-upgrade.

One suggestion the Spark team might consider is releasing 2.1.1 with 
INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of up-front 
migration notes would help in identifying this new behavior in 2.2.

Thanks,

Michael


> On Apr 24, 2017, at 2:09 AM, Wenchen Fan  wrote:
> 
> see https://issues.apache.org/jira/browse/SPARK-19611 
> 
> 
> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau  > wrote:
> Whats the regression this fixed in 2.1 from 2.0?
> 
> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan  > wrote:
> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will only 
> scan all table files only once, and write back the inferred schema to 
> metastore so that we don't need to do the schema inference again.
> 
> So technically this will introduce a performance regression for the first 
> query, but compared to branch-2.0, it's not performance regression. And this 
> patch fixed a regression in branch-2.1, which can run in branch-2.0. 
> Personally, I think we should keep INFER_AND_SAVE as the default mode.
> 
> + [Eric], what do you think?
> 
> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust  > wrote:
> Thanks for pointing this out, Michael.  Based on the conversation on the PR 
>  this 
> seems like a risky change to include in a release branch with a default other 
> than NEVER_INFER.
> 
> +Wenchen?  What do you think?
> 
> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman  > wrote:
> We've identified the cause of the change in behavior. It is related to the 
> SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This key and its 
> related functionality was absent from our previous build. The default setting 
> in the current build was causing Spark to attempt to scan all table files 
> during query analysis. Changing this setting to NEVER_INFER disabled this 
> operation and resolved the issue we had.
> 
> Michael
> 
> 
>> On Apr 20, 2017, at 3:42 PM, Michael Allman > > wrote:
>> 
>> I want to caution that in testing a build from this morning's branch-2.1 we 
>> found that Hive partition pruning was not working. We found that Spark SQL 
>> was fetching all Hive table partitions for a very simple query whereas in a 
>> build from several weeks ago it was fetching only the required partitions. I 
>> cannot currently think of a reason for the regression outside of some 
>> difference between branch-2.1 from our previous build and branch-2.1 from 
>> this morning.
>> 
>> That's all I know right now. We are actively investigating to find the root 
>> cause of this problem, and specifically whether this is a problem in the 
>> Spark codebase or not. I will report back when I have an answer to that 
>> question.
>> 
>> Michael
>> 
>> 
>>> On Apr 18, 2017, at 11:59 AM, Michael Armbrust >> > wrote:
>>> 
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and passes 
>>> if a majority of at least 3 +1 PMC votes are cast.
>>> 
>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>> [ ] -1 Do not release this package because ...
>>> 
>>> 
>>> To learn more about Apache Spark, please see http://spark.apache.org/ 
>>> 
>>> 
>>> The tag to be voted on is v2.1.1-rc3 
>>>  
>>> (2ed19cff2f6ab79a718526e5d16633412d8c4dd4)
>>> 
>>> List of JIRA tickets resolved can be found with this filter 
>>> .
>>> 
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/ 
>>> 

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Holden Karau
It

On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman 
wrote:

> The trouble we ran into is that this upgrade was blocking access to our
> tables, and we didn't know why. This sounds like a kind of migration
> operation, but it was not apparent that this was the case. It took an
> expert examining a stack trace and source code to figure this out. Would a
> more naive end user be able to debug this issue? Maybe we're an unusual
> case, but our particular experience was pretty bad. I have my doubts that
> the schema inference on our largest tables would ever complete without
> throwing some kind of timeout (which we were in fact receiving) or the end
> user just giving up and killing our job. We ended up doing a rollback while
> we investigated the source of the issue. In our case, INFER_NEVER is
> clearly the best configuration. We're going to add that to our default
> configuration files.
>
> My expectation is that a minor point release is a pretty safe bug fix
> release. We were a bit hasty in not doing better due diligence pre-upgrade.
>
> One suggestion the Spark team might consider is releasing 2.1.1 with
> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of up-front
> migration notes would help in identifying this new behavior in 2.2.
>
> Thanks,
>
> Michael
>
>
> On Apr 24, 2017, at 2:09 AM, Wenchen Fan  wrote:
>
> see https://issues.apache.org/jira/browse/SPARK-19611
>
> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau 
> wrote:
>
>> Whats the regression this fixed in 2.1 from 2.0?
>>
>> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan 
>> wrote:
>>
>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
>>> only scan all table files only once, and write back the inferred schema to
>>> metastore so that we don't need to do the schema inference again.
>>>
>>> So technically this will introduce a performance regression for the
>>> first query, but compared to branch-2.0, it's not performance regression.
>>> And this patch fixed a regression in branch-2.1, which can run in
>>> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
>>> default mode.
>>>
>>> + [Eric], what do you think?
>>>
>>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 Thanks for pointing this out, Michael.  Based on the conversation on
 the PR
 
 this seems like a risky change to include in a release branch with a
 default other than NEVER_INFER.

 +Wenchen?  What do you think?

 On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman 
 wrote:

> We've identified the cause of the change in behavior. It is related to
> the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This
> key and its related functionality was absent from our previous build. The
> default setting in the current build was causing Spark to attempt to scan
> all table files during query analysis. Changing this setting to 
> NEVER_INFER
> disabled this operation and resolved the issue we had.
>
> Michael
>
>
> On Apr 20, 2017, at 3:42 PM, Michael Allman 
> wrote:
>
> I want to caution that in testing a build from this morning's
> branch-2.1 we found that Hive partition pruning was not working. We found
> that Spark SQL was fetching all Hive table partitions for a very simple
> query whereas in a build from several weeks ago it was fetching only the
> required partitions. I cannot currently think of a reason for the
> regression outside of some difference between branch-2.1 from our previous
> build and branch-2.1 from this morning.
>
> That's all I know right now. We are actively investigating to find the
> root cause of this problem, and specifically whether this is a problem in
> the Spark codebase or not. I will report back when I have an answer to 
> that
> question.
>
> Michael
>
>
> On Apr 18, 2017, at 11:59 AM, Michael Armbrust 
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark
> version 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc3
>  (2ed19cff2f6ab79
> a718526e5d16633412d8c4dd4)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spa

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Holden Karau
Whoops, sorry finger slipped on that last message.
It sounds like whatever we do is going to break some existing users (either
with the tables by case sensitivity or with the unexpected scan).

Personally I agree with Michael Allman on this, I believe we should
use INFER_NEVER for 2.1.1.

On Mon, Apr 24, 2017 at 11:01 AM, Holden Karau  wrote:

> It
>
> On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman 
> wrote:
>
>> The trouble we ran into is that this upgrade was blocking access to our
>> tables, and we didn't know why. This sounds like a kind of migration
>> operation, but it was not apparent that this was the case. It took an
>> expert examining a stack trace and source code to figure this out. Would a
>> more naive end user be able to debug this issue? Maybe we're an unusual
>> case, but our particular experience was pretty bad. I have my doubts that
>> the schema inference on our largest tables would ever complete without
>> throwing some kind of timeout (which we were in fact receiving) or the end
>> user just giving up and killing our job. We ended up doing a rollback while
>> we investigated the source of the issue. In our case, INFER_NEVER is
>> clearly the best configuration. We're going to add that to our default
>> configuration files.
>>
>> My expectation is that a minor point release is a pretty safe bug fix
>> release. We were a bit hasty in not doing better due diligence pre-upgrade.
>>
>> One suggestion the Spark team might consider is releasing 2.1.1 with
>> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of up-front
>> migration notes would help in identifying this new behavior in 2.2.
>>
>> Thanks,
>>
>> Michael
>>
>>
>> On Apr 24, 2017, at 2:09 AM, Wenchen Fan  wrote:
>>
>> see https://issues.apache.org/jira/browse/SPARK-19611
>>
>> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau 
>> wrote:
>>
>>> Whats the regression this fixed in 2.1 from 2.0?
>>>
>>> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan 
>>> wrote:
>>>
 IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
 only scan all table files only once, and write back the inferred schema to
 metastore so that we don't need to do the schema inference again.

 So technically this will introduce a performance regression for the
 first query, but compared to branch-2.0, it's not performance regression.
 And this patch fixed a regression in branch-2.1, which can run in
 branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
 default mode.

 + [Eric], what do you think?

 On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
 mich...@databricks.com> wrote:

> Thanks for pointing this out, Michael.  Based on the conversation on
> the PR
> 
> this seems like a risky change to include in a release branch with a
> default other than NEVER_INFER.
>
> +Wenchen?  What do you think?
>
> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman 
> wrote:
>
>> We've identified the cause of the change in behavior. It is related
>> to the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode".
>> This key and its related functionality was absent from our previous 
>> build.
>> The default setting in the current build was causing Spark to attempt to
>> scan all table files during query analysis. Changing this setting
>> to NEVER_INFER disabled this operation and resolved the issue we had.
>>
>> Michael
>>
>>
>> On Apr 20, 2017, at 3:42 PM, Michael Allman 
>> wrote:
>>
>> I want to caution that in testing a build from this morning's
>> branch-2.1 we found that Hive partition pruning was not working. We found
>> that Spark SQL was fetching all Hive table partitions for a very simple
>> query whereas in a build from several weeks ago it was fetching only the
>> required partitions. I cannot currently think of a reason for the
>> regression outside of some difference between branch-2.1 from our 
>> previous
>> build and branch-2.1 from this morning.
>>
>> That's all I know right now. We are actively investigating to find
>> the root cause of this problem, and specifically whether this is a 
>> problem
>> in the Spark codebase or not. I will report back when I have an answer to
>> that question.
>>
>> Michael
>>
>>
>> On Apr 18, 2017, at 11:59 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00
>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Eric Liang
-1 (non-binding)

I also agree with using NEVER_INFER for 2.1.1. The migration cost is
unexpected for a point release.

On Mon, Apr 24, 2017 at 11:08 AM Holden Karau  wrote:

> Whoops, sorry finger slipped on that last message.
> It sounds like whatever we do is going to break some existing users
> (either with the tables by case sensitivity or with the unexpected scan).
>
> Personally I agree with Michael Allman on this, I believe we should
> use INFER_NEVER for 2.1.1.
>
> On Mon, Apr 24, 2017 at 11:01 AM, Holden Karau 
> wrote:
>
>> It
>>
>> On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman 
>> wrote:
>>
>>> The trouble we ran into is that this upgrade was blocking access to our
>>> tables, and we didn't know why. This sounds like a kind of migration
>>> operation, but it was not apparent that this was the case. It took an
>>> expert examining a stack trace and source code to figure this out. Would a
>>> more naive end user be able to debug this issue? Maybe we're an unusual
>>> case, but our particular experience was pretty bad. I have my doubts that
>>> the schema inference on our largest tables would ever complete without
>>> throwing some kind of timeout (which we were in fact receiving) or the end
>>> user just giving up and killing our job. We ended up doing a rollback while
>>> we investigated the source of the issue. In our case, INFER_NEVER is
>>> clearly the best configuration. We're going to add that to our default
>>> configuration files.
>>>
>>> My expectation is that a minor point release is a pretty safe bug fix
>>> release. We were a bit hasty in not doing better due diligence pre-upgrade.
>>>
>>> One suggestion the Spark team might consider is releasing 2.1.1 with
>>> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of
>>> up-front migration notes would help in identifying this new behavior in 2.2.
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>
>>> On Apr 24, 2017, at 2:09 AM, Wenchen Fan  wrote:
>>>
>>> see https://issues.apache.org/jira/browse/SPARK-19611
>>>
>>> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau 
>>> wrote:
>>>
 Whats the regression this fixed in 2.1 from 2.0?

 On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan 
 wrote:

> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
> only scan all table files only once, and write back the inferred schema to
> metastore so that we don't need to do the schema inference again.
>
> So technically this will introduce a performance regression for the
> first query, but compared to branch-2.0, it's not performance regression.
> And this patch fixed a regression in branch-2.1, which can run in
> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
> default mode.
>
> + [Eric], what do you think?
>
> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> Thanks for pointing this out, Michael.  Based on the conversation on
>> the PR
>> 
>> this seems like a risky change to include in a release branch with a
>> default other than NEVER_INFER.
>>
>> +Wenchen?  What do you think?
>>
>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman > > wrote:
>>
>>> We've identified the cause of the change in behavior. It is related
>>> to the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This 
>>> key
>>> and its related functionality was absent from our previous build. The
>>> default setting in the current build was causing Spark to attempt to 
>>> scan
>>> all table files during query analysis. Changing this setting to 
>>> NEVER_INFER
>>> disabled this operation and resolved the issue we had.
>>>
>>> Michael
>>>
>>>
>>> On Apr 20, 2017, at 3:42 PM, Michael Allman 
>>> wrote:
>>>
>>> I want to caution that in testing a build from this morning's
>>> branch-2.1 we found that Hive partition pruning was not working. We 
>>> found
>>> that Spark SQL was fetching all Hive table partitions for a very simple
>>> query whereas in a build from several weeks ago it was fetching only the
>>> required partitions. I cannot currently think of a reason for the
>>> regression outside of some difference between branch-2.1 from our 
>>> previous
>>> build and branch-2.1 from this morning.
>>>
>>> That's all I know right now. We are actively investigating to find
>>> the root cause of this problem, and specifically whether this is a 
>>> problem
>>> in the Spark codebase or not. I will report back when I have an answer 
>>> to
>>> that question.
>>>
>>> Michael
>>>
>>>
>>> On Apr 18, 2017, at 11:59 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.1. The

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Michael Armbrust
Yeah, I agree.

-1 (binding)

This vote fails, and I'll cut a new RC after #17749
 is merged.

On Mon, Apr 24, 2017 at 12:18 PM, Eric Liang  wrote:

> -1 (non-binding)
>
> I also agree with using NEVER_INFER for 2.1.1. The migration cost is
> unexpected for a point release.
>
> On Mon, Apr 24, 2017 at 11:08 AM Holden Karau 
> wrote:
>
>> Whoops, sorry finger slipped on that last message.
>> It sounds like whatever we do is going to break some existing users
>> (either with the tables by case sensitivity or with the unexpected scan).
>>
>> Personally I agree with Michael Allman on this, I believe we should
>> use INFER_NEVER for 2.1.1.
>>
>> On Mon, Apr 24, 2017 at 11:01 AM, Holden Karau 
>> wrote:
>>
>>> It
>>>
>>> On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman 
>>> wrote:
>>>
 The trouble we ran into is that this upgrade was blocking access to our
 tables, and we didn't know why. This sounds like a kind of migration
 operation, but it was not apparent that this was the case. It took an
 expert examining a stack trace and source code to figure this out. Would a
 more naive end user be able to debug this issue? Maybe we're an unusual
 case, but our particular experience was pretty bad. I have my doubts that
 the schema inference on our largest tables would ever complete without
 throwing some kind of timeout (which we were in fact receiving) or the end
 user just giving up and killing our job. We ended up doing a rollback while
 we investigated the source of the issue. In our case, INFER_NEVER is
 clearly the best configuration. We're going to add that to our default
 configuration files.

 My expectation is that a minor point release is a pretty safe bug fix
 release. We were a bit hasty in not doing better due diligence pre-upgrade.

 One suggestion the Spark team might consider is releasing 2.1.1 with
 INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of
 up-front migration notes would help in identifying this new behavior in 
 2.2.

 Thanks,

 Michael


 On Apr 24, 2017, at 2:09 AM, Wenchen Fan 
 wrote:

 see https://issues.apache.org/jira/browse/SPARK-19611

 On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau 
 wrote:

> Whats the regression this fixed in 2.1 from 2.0?
>
> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan 
> wrote:
>
>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
>> only scan all table files only once, and write back the inferred schema 
>> to
>> metastore so that we don't need to do the schema inference again.
>>
>> So technically this will introduce a performance regression for the
>> first query, but compared to branch-2.0, it's not performance regression.
>> And this patch fixed a regression in branch-2.1, which can run in
>> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
>> default mode.
>>
>> + [Eric], what do you think?
>>
>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Thanks for pointing this out, Michael.  Based on the conversation
>>> on the PR
>>> 
>>> this seems like a risky change to include in a release branch with a
>>> default other than NEVER_INFER.
>>>
>>> +Wenchen?  What do you think?
>>>
>>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman <
>>> mich...@videoamp.com> wrote:
>>>
 We've identified the cause of the change in behavior. It is related
 to the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode".
 This key and its related functionality was absent from our previous 
 build.
 The default setting in the current build was causing Spark to attempt 
 to
 scan all table files during query analysis. Changing this setting
 to NEVER_INFER disabled this operation and resolved the issue we had.

 Michael


 On Apr 20, 2017, at 3:42 PM, Michael Allman 
 wrote:

 I want to caution that in testing a build from this morning's
 branch-2.1 we found that Hive partition pruning was not working. We 
 found
 that Spark SQL was fetching all Hive table partitions for a very simple
 query whereas in a build from several weeks ago it was fetching only 
 the
 required partitions. I cannot currently think of a reason for the
 regression outside of some difference between branch-2.1 from our 
 previous
 build and branch-2.1 from this morning.

 That's all I know right now. We are actively investigating to find
 the root cause of this problem, and specifically whether this is a 
 problem
>>

Spark Conf Problem with CacheLoader

2017-04-24 Thread John Compitello
Hey all, 

I’ve been working on contributing to Spark a little bit in past few weeks, but 
I’ve suddenly encountered a problem I’m having some trouble with. Specifically, 
however I’ve built Spark on my laptop, I am unable to create a SparkConf. I’ve 
done it in the past, but somehow it’s now broken. When I try to create a 
SparkContext, I get a “NoClassDefFoundError” on “CacheLoader” from 
com/google/common/cache/CacheLoader. Has  anyone ever run into this before? 

Thanks,

John
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: branch-2.2 has been cut

2017-04-24 Thread Josh Rosen
I've created the Jenkins jobs for branch-2.2, including the nightly
snapshot, packaging, and docs jobs.

You can view the latest nightly package at
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.2-bin/latest/
and
nightly docs at
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.2-docs/latest/

We need to bump the version in the master branch's POM in order for nightly
Maven snapshots to work as expected (
https://issues.apache.org/jira/browse/SPARK-20453). I'll submit a PR for
that now.

On Wed, Apr 19, 2017 at 5:04 AM Sean Owen  wrote:

> (PS we'll need new Jenkins jobs to test 2.2 now -- I don't think I have
> access to create them)
>
> On Tue, Apr 18, 2017 at 5:50 PM Michael Armbrust 
> wrote:
>
>> I just cut the release branch for Spark 2.2.  If you are merging
>> important bug fixes, please backport as appropriate.  If you have doubts if
>> something should be backported, please ping me.  I'll follow with an RC
>> later this week.
>>
>


Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Chawla,Sumit
Thanks a lot @ Dongjin, @Ryan

I am using Spark 1.6.  I agree with your assesment Ryan.  Further
investigation seemed to suggest that our cluster was probably at 100%
capacity at that point of time.  Though tasks were failing on that slave,
still it was accepting the task, and task  retries exhausted much faster
than the other slaves freeing up to accept those tasks.

Regards
Sumit Chawla


On Mon, Apr 24, 2017 at 9:48 AM, Ryan Blue  wrote:

> Looking at the code a bit more, it appears that blacklisting is disabled
> by default. To enable it, set spark.blacklist.enabled=true.
>
> The updates in 2.1.0 appear to provide much more fine-grained settings for
> this, like the number of tasks that can fail before an executor is
> blacklisted for a stage. In that version, you probably want to set
> spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs
>  and search for
> “blacklist” to see all the options.
>
> rb
> ​
>
> On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue  wrote:
>
>> Chawla,
>>
>> We hit this issue, too. I worked around it by setting
>> spark.scheduler.executorTaskBlacklistTime=5000. The problem for us was
>> that the scheduler was using locality to select the executor, even though
>> it had already failed there. The executor task blacklist time controls how
>> long the scheduler will avoid using an executor for a failed task, which
>> will cause it to avoid rescheduling on the executor. The default was 0, so
>> the executor was put back into consideration immediately.
>>
>> In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not
>> sure if that does exactly the same thing. The default for that setting is
>> 1h instead of 0. It’s better to have a non-zero default to avoid what
>> you’re seeing.
>>
>> rb
>> ​
>>
>> On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit 
>> wrote:
>>
>>> I am seeing a strange issue. I had a bad behaving slave that failed the
>>> entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems
>>> like all task retries happen on the same slave in case of failure.  My
>>> expectation was that task will be retried on different slave in case of
>>> failure, and chance of all 8 retries to happen on same slave is very less.
>>>
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Wenchen Fan
see https://issues.apache.org/jira/browse/SPARK-19611

On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau  wrote:

> Whats the regression this fixed in 2.1 from 2.0?
>
> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan 
> wrote:
>
>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
>> only scan all table files only once, and write back the inferred schema to
>> metastore so that we don't need to do the schema inference again.
>>
>> So technically this will introduce a performance regression for the first
>> query, but compared to branch-2.0, it's not performance regression. And
>> this patch fixed a regression in branch-2.1, which can run in branch-2.0.
>> Personally, I think we should keep INFER_AND_SAVE as the default mode.
>>
>> + [Eric], what do you think?
>>
>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust > > wrote:
>>
>>> Thanks for pointing this out, Michael.  Based on the conversation on
>>> the PR
>>> 
>>> this seems like a risky change to include in a release branch with a
>>> default other than NEVER_INFER.
>>>
>>> +Wenchen?  What do you think?
>>>
>>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman 
>>> wrote:
>>>
 We've identified the cause of the change in behavior. It is related to
 the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". This key
 and its related functionality was absent from our previous build. The
 default setting in the current build was causing Spark to attempt to scan
 all table files during query analysis. Changing this setting to NEVER_INFER
 disabled this operation and resolved the issue we had.

 Michael


 On Apr 20, 2017, at 3:42 PM, Michael Allman 
 wrote:

 I want to caution that in testing a build from this morning's
 branch-2.1 we found that Hive partition pruning was not working. We found
 that Spark SQL was fetching all Hive table partitions for a very simple
 query whereas in a build from several weeks ago it was fetching only the
 required partitions. I cannot currently think of a reason for the
 regression outside of some difference between branch-2.1 from our previous
 build and branch-2.1 from this morning.

 That's all I know right now. We are actively investigating to find the
 root cause of this problem, and specifically whether this is a problem in
 the Spark codebase or not. I will report back when I have an answer to that
 question.

 Michael


 On Apr 18, 2017, at 11:59 AM, Michael Armbrust 
 wrote:

 Please vote on releasing the following candidate as Apache Spark
 version 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00
 PST and passes if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.1.1
 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v2.1.1-rc3
  (2ed19cff2f6ab79
 a718526e5d16633412d8c4dd4)

 List of JIRA tickets resolved can be found with this filter
 
 .

 The release files, including signatures, digests, etc. can be found at:
 http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1230/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/


 *FAQ*

 *How can I help test this release?*

 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 *What should happen to JIRA tickets still targeting 2.1.1?*

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should be
 worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.

 *But my bug isn't fixed!??!*

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from 2.1.0.

 *What happened to RC1?*

 There were issues with the release packaging and as a result was
 skipped.




>>>
>>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>