Re: There is no space for new record

2018-02-13 Thread SNEHASISH DUTTA
Hi,

In which version of Spark will this fix  be available ?
The deployment is on EMR

Regards,
Snehasish

On Fri, Feb 9, 2018 at 8:51 PM, Wenchen Fan  wrote:

> It should be fixed by https://github.com/apache/spark/pull/20561 soon.
>
> On Fri, Feb 9, 2018 at 6:16 PM, Wenchen Fan  wrote:
>
>> This has been reported before: http://apache-spark-de
>> velopers-list.1001551.n3.nabble.com/java-lang-IllegalStateEx
>> ception-There-is-no-space-for-new-record-tc20108.html
>>
>> I think we may have a real bug here, but we need a reproduce. Can you
>> provide one? thanks!
>>
>> On Fri, Feb 9, 2018 at 5:59 PM, SNEHASISH DUTTA > > wrote:
>>
>>> Hi ,
>>>
>>> I am facing the following when running on EMR
>>>
>>> Caused by: java.lang.IllegalStateException: There is no space for new
>>> record
>>> at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemoryS
>>> orter.insertRecord(UnsafeInMemorySorter.java:226)
>>> at org.apache.spark.sql.execution.UnsafeKVExternalSorter.
>>> (UnsafeKVExternalSorter.java:132)
>>> at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMa
>>> p.destructAndCreateExternalSorter(UnsafeFixedWidthAggregatio
>>> nMap.java:250)
>>>
>>> I am using pyspark 2.2 , what spark configuration should be
>>> changed/modified to get this resolved
>>>
>>>
>>> Regards,
>>> Snehasish
>>>
>>>
>>> Regards,
>>> Snehasish
>>>
>>> On Fri, Feb 9, 2018 at 1:26 PM, SNEHASISH DUTTA <
>>> info.snehas...@gmail.com> wrote:
>>>
 Hi ,

 I am facing the following when running on EMR

 Caused by: java.lang.IllegalStateException: There is no space for new
 record
 at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemoryS
 orter.insertRecord(UnsafeInMemorySorter.java:226)
 at org.apache.spark.sql.execution.UnsafeKVExternalSorter.
 (UnsafeKVExternalSorter.java:132)
 at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMa
 p.destructAndCreateExternalSorter(UnsafeFixedWidthAggregatio
 nMap.java:250)

 I am using spark 2.2 , what spark configuration should be
 changed/modified to get this resolved


 Regards,
 Snehasish

>>>
>>>
>>
>


Re: There is no space for new record

2018-02-13 Thread Marco Gaido
You can check all the versions where the fix is available on the
JIRA SPARK-23376. Anyway it will be available in the upcoming 2.3.0 release.

Thanks.

On 13 Feb 2018 9:09 a.m., "SNEHASISH DUTTA" 
wrote:

> Hi,
>
> In which version of Spark will this fix  be available ?
> The deployment is on EMR
>
> Regards,
> Snehasish
>
> On Fri, Feb 9, 2018 at 8:51 PM, Wenchen Fan  wrote:
>
>> It should be fixed by https://github.com/apache/spark/pull/20561 soon.
>>
>> On Fri, Feb 9, 2018 at 6:16 PM, Wenchen Fan  wrote:
>>
>>> This has been reported before: http://apache-spark-de
>>> velopers-list.1001551.n3.nabble.com/java-lang-IllegalStateEx
>>> ception-There-is-no-space-for-new-record-tc20108.html
>>>
>>> I think we may have a real bug here, but we need a reproduce. Can you
>>> provide one? thanks!
>>>
>>> On Fri, Feb 9, 2018 at 5:59 PM, SNEHASISH DUTTA <
>>> info.snehas...@gmail.com> wrote:
>>>
 Hi ,

 I am facing the following when running on EMR

 Caused by: java.lang.IllegalStateException: There is no space for new
 record
 at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemoryS
 orter.insertRecord(UnsafeInMemorySorter.java:226)
 at org.apache.spark.sql.execution.UnsafeKVExternalSorter.
 (UnsafeKVExternalSorter.java:132)
 at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMa
 p.destructAndCreateExternalSorter(UnsafeFixedWidthAggregatio
 nMap.java:250)

 I am using pyspark 2.2 , what spark configuration should be
 changed/modified to get this resolved


 Regards,
 Snehasish


 Regards,
 Snehasish

 On Fri, Feb 9, 2018 at 1:26 PM, SNEHASISH DUTTA <
 info.snehas...@gmail.com> wrote:

> Hi ,
>
> I am facing the following when running on EMR
>
> Caused by: java.lang.IllegalStateException: There is no space for new
> record
> at org.apache.spark.util.collecti
> on.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMem
> orySorter.java:226)
> at org.apache.spark.sql.execution
> .UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:132)
> at org.apache.spark.sql.execution
> .UnsafeFixedWidthAggregationMap.destructAndCreateExternalSor
> ter(UnsafeFixedWidthAggregationMap.java:250)
>
> I am using spark 2.2 , what spark configuration should be
> changed/modified to get this resolved
>
>
> Regards,
> Snehasish
>


>>>
>>
>


redundant decision tree model

2018-02-13 Thread Alessandro Solimando
Hello community,
I have recently manually inspected some decision trees computed with Spark
(2.2.1, but the behavior is the same with the latest code on the repo).

I have observed that the trees are always complete, even if an entire
subtree leads to the same prediction in its different leaves.

In such case, the root of the subtree, instead of being an InternalNode,
could simply be a LeafNode with the (shared) prediction.

I know that decision trees computed by scikit-learn share the same feature,
I understand that this is needed by construction, because you realize this
redundancy only at the end.

So my question is, why is this "post-pruning" missing?

Three hypothesis:

1) It is not suitable (for a reason I fail to see)
2) Such addition to the code is considered as not worth (in terms of code
complexity, maybe)
3) It has been overlooked, but could be a favorable addition

For clarity, I have managed to isolate a small case to reproduce this, in
what follows.

This is the dataset:

> +-+-+
> |label|features |
> +-+-+
> |1.0  |[1.0,0.0,1.0]|
> |1.0  |[0.0,1.0,0.0]|
> |1.0  |[1.0,1.0,0.0]|
> |0.0  |[0.0,0.0,0.0]|
> |1.0  |[1.0,1.0,0.0]|
> |0.0  |[0.0,1.0,1.0]|
> |1.0  |[0.0,0.0,0.0]|
> |0.0  |[0.0,1.0,1.0]|
> |1.0  |[0.0,1.0,1.0]|
> |0.0  |[1.0,0.0,0.0]|
> |0.0  |[1.0,0.0,1.0]|
> |1.0  |[0.0,1.0,1.0]|
> |0.0  |[0.0,0.0,1.0]|
> |0.0  |[1.0,0.0,1.0]|
> |0.0  |[0.0,0.0,1.0]|
> |0.0  |[1.0,1.0,1.0]|
> |0.0  |[1.0,1.0,0.0]|
> |1.0  |[1.0,1.0,1.0]|
> |0.0  |[1.0,0.0,1.0]|
> +-+-+


Which generates the following model:

DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
> nodes
>   If (feature 1 <= 0.5)
>If (feature 2 <= 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
>Else (feature 2 > 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
>   Else (feature 1 > 0.5)
>If (feature 2 <= 0.5)
> If (feature 0 <= 0.5)
>  Predict: 1.0
> Else (feature 0 > 0.5)
>  Predict: 1.0
>Else (feature 2 > 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0


As you can see, the following model would be equivalent, but smaller and

DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
> nodes
>   If (feature 1 <= 0.5)
>Predict: 0.0
>   Else (feature 1 > 0.5)
>If (feature 2 <= 0.5)
> Predict: 1.0
>Else (feature 2 > 0.5)
> Predict: 0.0


This happens pretty often in real cases, and despite the small gain in the
single model invocation for the "optimized" version, it can become non
negligible when the number of calls is massive, as one can expect in a Big
Data context.

I would appreciate your opinion on this matter (if relevant for a PR or
not, pros/cons etc).

Best regards,
Alessandro


Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath
There is a long outstanding JIRA issue about it:
https://issues.apache.org/jira/browse/SPARK-3155.

It is probably still a useful feature to have for trees but the priority is
not that high since it may not be that useful for the tree ensemble models.

On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello community,
> I have recently manually inspected some decision trees computed with Spark
> (2.2.1, but the behavior is the same with the latest code on the repo).
>
> I have observed that the trees are always complete, even if an entire
> subtree leads to the same prediction in its different leaves.
>
> In such case, the root of the subtree, instead of being an InternalNode,
> could simply be a LeafNode with the (shared) prediction.
>
> I know that decision trees computed by scikit-learn share the same
> feature, I understand that this is needed by construction, because you
> realize this redundancy only at the end.
>
> So my question is, why is this "post-pruning" missing?
>
> Three hypothesis:
>
> 1) It is not suitable (for a reason I fail to see)
> 2) Such addition to the code is considered as not worth (in terms of code
> complexity, maybe)
> 3) It has been overlooked, but could be a favorable addition
>
> For clarity, I have managed to isolate a small case to reproduce this, in
> what follows.
>
> This is the dataset:
>
>> +-+-+
>> |label|features |
>> +-+-+
>> |1.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,0.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,0.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,0.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> +-+-+
>
>
> Which generates the following model:
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 1.0
>> Else (feature 0 > 0.5)
>>  Predict: 1.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>
>
> As you can see, the following model would be equivalent, but smaller and
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> Predict: 1.0
>>Else (feature 2 > 0.5)
>> Predict: 0.0
>
>
> This happens pretty often in real cases, and despite the small gain in the
> single model invocation for the "optimized" version, it can become non
> negligible when the number of calls is massive, as one can expect in a Big
> Data context.
>
> I would appreciate your opinion on this matter (if relevant for a PR or
> not, pros/cons etc).
>
> Best regards,
> Alessandro
>


Re: redundant decision tree model

2018-02-13 Thread Alessandro Solimando
Hello Nick,
thanks for the pointer, that's interesting.

However, there seems to be a major difference with what I was discussing.

The JIRA issue relates to overfitting and consideration on information
gain, while what I propose is a much simpler "syntactic" pruning.

Consider a fragment of the example above, the leftmost subtree in
particular:

If (feature 1 <= 0.5)
>If (feature 2 <= 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
>Else (feature 2 > 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0


Which corresponds to the following "objects":

-InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
> org.apache.spark.ml.tree.ContinuousSplit@ffe0)
> ---InternalNode(prediction = 0.0, impurity = 0.4445, split =
> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
> LeafNode(prediction = 0.0, impurity = -1.0)
> LeafNode(prediction = 0.0, impurity = 0.0)
> ---InternalNode(prediction = 0.0, impurity = 0.2777, split =
> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
> LeafNode(prediction = 0.0, impurity = 0.0)
> LeafNode(prediction = 0.0, impurity = -1.0)


For sure a more comprehensive policy for node splitting based on impurity
might prevent this situation (by splitting node "ffe0" you have an
impurity gain on one child, and a loss on the other), but independently
from this, once the tree is built, I would cut the redundant subtree and
obtain the following:

-InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
> --LeafNode(prediction = 0.0, impurity = ...)


I cannot say that this is relevant for all the tree ensemble methods, but
it for sure is for RF, even more than for DT, as the lever effect will be
even higher (and the code generating them is the same, DT calls RF with
numTree = 1 for what I can see).

Being an optimization aiming at saving model memory footprint and
invocation time, it is independent from any consideration on the
statistical amortization of overfit, as your reply seems to imply.

Am I missing something?

Best regards,
Alessandro



On 13 February 2018 at 10:57, Nick Pentreath 
wrote:

> There is a long outstanding JIRA issue about it:
> https://issues.apache.org/jira/browse/SPARK-3155.
>
> It is probably still a useful feature to have for trees but the priority
> is not that high since it may not be that useful for the tree ensemble
> models.
>
>
> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> Hello community,
>> I have recently manually inspected some decision trees computed with
>> Spark (2.2.1, but the behavior is the same with the latest code on the
>> repo).
>>
>> I have observed that the trees are always complete, even if an entire
>> subtree leads to the same prediction in its different leaves.
>>
>> In such case, the root of the subtree, instead of being an InternalNode,
>> could simply be a LeafNode with the (shared) prediction.
>>
>> I know that decision trees computed by scikit-learn share the same
>> feature, I understand that this is needed by construction, because you
>> realize this redundancy only at the end.
>>
>> So my question is, why is this "post-pruning" missing?
>>
>> Three hypothesis:
>>
>> 1) It is not suitable (for a reason I fail to see)
>> 2) Such addition to the code is considered as not worth (in terms of code
>> complexity, maybe)
>> 3) It has been overlooked, but could be a favorable addition
>>
>> For clarity, I have managed to isolate a small case to reproduce this, in
>> what follows.
>>
>> This is the dataset:
>>
>>> +-+-+
>>> |label|features |
>>> +-+-+
>>> |1.0  |[1.0,0.0,1.0]|
>>> |1.0  |[0.0,1.0,0.0]|
>>> |1.0  |[1.0,1.0,0.0]|
>>> |0.0  |[0.0,0.0,0.0]|
>>> |1.0  |[1.0,1.0,0.0]|
>>> |0.0  |[0.0,1.0,1.0]|
>>> |1.0  |[0.0,0.0,0.0]|
>>> |0.0  |[0.0,1.0,1.0]|
>>> |1.0  |[0.0,1.0,1.0]|
>>> |0.0  |[1.0,0.0,0.0]|
>>> |0.0  |[1.0,0.0,1.0]|
>>> |1.0  |[0.0,1.0,1.0]|
>>> |0.0  |[0.0,0.0,1.0]|
>>> |0.0  |[1.0,0.0,1.0]|
>>> |0.0  |[0.0,0.0,1.0]|
>>> |0.0  |[1.0,1.0,1.0]|
>>> |0.0  |[1.0,1.0,0.0]|
>>> |1.0  |[1.0,1.0,1.0]|
>>> |0.0  |[1.0,0.0,1.0]|
>>> +-+-+
>>
>>
>> Which generates the following model:
>>
>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with
>>> 15 nodes
>>>   If (feature 1 <= 0.5)
>>>If (feature 2 <= 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predict: 0.0
>>> Else (feature 0 > 0.5)
>>>  Predict: 0.0
>>>Else (feature 2 > 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predict: 0.0
>>> Else (feature 0 > 0.5)
>>>  Predict: 0.0
>>>   Else (feature 1 > 0.5)
>>>If (feature 2 <= 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predi

Re: redundant decision tree model

2018-02-13 Thread Sean Owen
I think the simple pruning you have in mind was just never implemented.

That sort of pruning wouldn't help much if the nodes maintained a
distribution over classes, as those are rarely identical, but, they just
maintain a single class prediction. After training, I see no value in
keeping those nodes. Whatever impurity gain the split managed on the
training data is 'lost' when the prediction is collapsed to a single class
anyway.

Whether it's easy to implement in the code I don't know, but it's
straightforward conceptually.

On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello Nick,
> thanks for the pointer, that's interesting.
>
> However, there seems to be a major difference with what I was discussing.
>
> The JIRA issue relates to overfitting and consideration on information
> gain, while what I propose is a much simpler "syntactic" pruning.
>
> Consider a fragment of the example above, the leftmost subtree in
> particular:
>
> If (feature 1 <= 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>
>
> Which corresponds to the following "objects":
>
> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
>> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
>> org.apache.spark.ml.tree.ContinuousSplit@ffe0)
>> ---InternalNode(prediction = 0.0, impurity = 0.4445, split =
>> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
>> LeafNode(prediction = 0.0, impurity = -1.0)
>> LeafNode(prediction = 0.0, impurity = 0.0)
>> ---InternalNode(prediction = 0.0, impurity = 0.2777, split =
>> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
>> LeafNode(prediction = 0.0, impurity = 0.0)
>> LeafNode(prediction = 0.0, impurity = -1.0)
>
>
> For sure a more comprehensive policy for node splitting based on impurity
> might prevent this situation (by splitting node "ffe0" you have an
> impurity gain on one child, and a loss on the other), but independently
> from this, once the tree is built, I would cut the redundant subtree and
> obtain the following:
>
> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
>> --LeafNode(prediction = 0.0, impurity = ...)
>
>
> I cannot say that this is relevant for all the tree ensemble methods, but
> it for sure is for RF, even more than for DT, as the lever effect will be
> even higher (and the code generating them is the same, DT calls RF with
> numTree = 1 for what I can see).
>
> Being an optimization aiming at saving model memory footprint and
> invocation time, it is independent from any consideration on the
> statistical amortization of overfit, as your reply seems to imply.
>
> Am I missing something?
>
> Best regards,
> Alessandro
>
>
>
> On 13 February 2018 at 10:57, Nick Pentreath 
> wrote:
>
>> There is a long outstanding JIRA issue about it:
>> https://issues.apache.org/jira/browse/SPARK-3155.
>>
>> It is probably still a useful feature to have for trees but the priority
>> is not that high since it may not be that useful for the tree ensemble
>> models.
>>
>>
>> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
>> alessandro.solima...@gmail.com> wrote:
>>
>>> Hello community,
>>> I have recently manually inspected some decision trees computed with
>>> Spark (2.2.1, but the behavior is the same with the latest code on the
>>> repo).
>>>
>>> I have observed that the trees are always complete, even if an entire
>>> subtree leads to the same prediction in its different leaves.
>>>
>>> In such case, the root of the subtree, instead of being an InternalNode,
>>> could simply be a LeafNode with the (shared) prediction.
>>>
>>> I know that decision trees computed by scikit-learn share the same
>>> feature, I understand that this is needed by construction, because you
>>> realize this redundancy only at the end.
>>>
>>> So my question is, why is this "post-pruning" missing?
>>>
>>> Three hypothesis:
>>>
>>> 1) It is not suitable (for a reason I fail to see)
>>> 2) Such addition to the code is considered as not worth (in terms of
>>> code complexity, maybe)
>>> 3) It has been overlooked, but could be a favorable addition
>>>
>>> For clarity, I have managed to isolate a small case to reproduce this,
>>> in what follows.
>>>
>>> This is the dataset:
>>>
 +-+-+
 |label|features |
 +-+-+
 |1.0  |[1.0,0.0,1.0]|
 |1.0  |[0.0,1.0,0.0]|
 |1.0  |[1.0,1.0,0.0]|
 |0.0  |[0.0,0.0,0.0]|
 |1.0  |[1.0,1.0,0.0]|
 |0.0  |[0.0,1.0,1.0]|
 |1.0  |[0.0,0.0,0.0]|
 |0.0  |[0.0,1.0,1.0]|
 |1.0  |[0.0,1.0,1.0]|
 |0.0  |[1.0,0.0,0.0]|
 |0.0  |[1.0,

Re: There is no space for new record

2018-02-13 Thread SNEHASISH DUTTA
Hi,

Will it be possible to overcome this with some spark configuration tweak ,
since EMR has spark version available only till 2.2.1

Regards,
Snehasish

On Tue, Feb 13, 2018 at 2:00 PM, Marco Gaido  wrote:

> You can check all the versions where the fix is available on the
> JIRA SPARK-23376. Anyway it will be available in the upcoming 2.3.0 release.
>
> Thanks.
>
> On 13 Feb 2018 9:09 a.m., "SNEHASISH DUTTA" 
> wrote:
>
>> Hi,
>>
>> In which version of Spark will this fix  be available ?
>> The deployment is on EMR
>>
>> Regards,
>> Snehasish
>>
>> On Fri, Feb 9, 2018 at 8:51 PM, Wenchen Fan  wrote:
>>
>>> It should be fixed by https://github.com/apache/spark/pull/20561 soon.
>>>
>>> On Fri, Feb 9, 2018 at 6:16 PM, Wenchen Fan  wrote:
>>>
 This has been reported before: http://apache-spark-de
 velopers-list.1001551.n3.nabble.com/java-lang-IllegalStateEx
 ception-There-is-no-space-for-new-record-tc20108.html

 I think we may have a real bug here, but we need a reproduce. Can you
 provide one? thanks!

 On Fri, Feb 9, 2018 at 5:59 PM, SNEHASISH DUTTA <
 info.snehas...@gmail.com> wrote:

> Hi ,
>
> I am facing the following when running on EMR
>
> Caused by: java.lang.IllegalStateException: There is no space for new
> record
> at org.apache.spark.util.collecti
> on.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMem
> orySorter.java:226)
> at org.apache.spark.sql.execution
> .UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:132)
> at org.apache.spark.sql.execution
> .UnsafeFixedWidthAggregationMap.destructAndCreateExternalSor
> ter(UnsafeFixedWidthAggregationMap.java:250)
>
> I am using pyspark 2.2 , what spark configuration should be
> changed/modified to get this resolved
>
>
> Regards,
> Snehasish
>
>
> Regards,
> Snehasish
>
> On Fri, Feb 9, 2018 at 1:26 PM, SNEHASISH DUTTA <
> info.snehas...@gmail.com> wrote:
>
>> Hi ,
>>
>> I am facing the following when running on EMR
>>
>> Caused by: java.lang.IllegalStateException: There is no space for
>> new record
>> at org.apache.spark.util.collecti
>> on.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMem
>> orySorter.java:226)
>> at org.apache.spark.sql.execution
>> .UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:132)
>> at org.apache.spark.sql.execution
>> .UnsafeFixedWidthAggregationMap.destructAndCreateExternalSor
>> ter(UnsafeFixedWidthAggregationMap.java:250)
>>
>> I am using spark 2.2 , what spark configuration should be
>> changed/modified to get this resolved
>>
>>
>> Regards,
>> Snehasish
>>
>
>

>>>
>>


Re: Corrupt parquet file

2018-02-13 Thread Steve Loughran


On 12 Feb 2018, at 20:21, Ryan Blue 
mailto:rb...@netflix.com>> wrote:

I wouldn't say we have a primary failure mode that we deal with. What we 
concluded was that all the schemes we came up with to avoid corruption couldn't 
cover all cases. For example, what about when memory holding a value is 
corrupted just before it is handed off to the writer?

That's why we track down the source of the corruption and remove it from our 
clusters and let Amazon know to remove the instance from the hardware pool. We 
also structure our ETL so we have some time to reprocess.


I see.

I could remove memory/disk buffering of the blocks as a source of corruption 
leaving only working memory  failures which somehow get past ECC, or bus errors 
of some form.

Filed https://issues.apache.org/jira/browse/HADOOP-15224 to add to the todo 
list, Hadoop >= 3.2





Re: redundant decision tree model

2018-02-13 Thread Alessandro Solimando
Thanks for your feedback Sean, I agree with you.

I have logged a JIRA case (https://issues.apache.org/jira/browse/SPARK-23409),
I will take a look at the code more in detail and see if I come up with a
PR to handle this.

On 13 February 2018 at 12:00, Sean Owen  wrote:

> I think the simple pruning you have in mind was just never implemented.
>
> That sort of pruning wouldn't help much if the nodes maintained a
> distribution over classes, as those are rarely identical, but, they just
> maintain a single class prediction. After training, I see no value in
> keeping those nodes. Whatever impurity gain the split managed on the
> training data is 'lost' when the prediction is collapsed to a single class
> anyway.
>
> Whether it's easy to implement in the code I don't know, but it's
> straightforward conceptually.
>
> On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> Hello Nick,
>> thanks for the pointer, that's interesting.
>>
>> However, there seems to be a major difference with what I was discussing.
>>
>> The JIRA issue relates to overfitting and consideration on information
>> gain, while what I propose is a much simpler "syntactic" pruning.
>>
>> Consider a fragment of the example above, the leftmost subtree in
>> particular:
>>
>> If (feature 1 <= 0.5)
>>>If (feature 2 <= 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predict: 0.0
>>> Else (feature 0 > 0.5)
>>>  Predict: 0.0
>>>Else (feature 2 > 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predict: 0.0
>>> Else (feature 0 > 0.5)
>>>  Predict: 0.0
>>
>>
>> Which corresponds to the following "objects":
>>
>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
>>> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@ffe0)
>>> ---InternalNode(prediction = 0.0, impurity = 0.4445, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
>>> LeafNode(prediction = 0.0, impurity = -1.0)
>>> LeafNode(prediction = 0.0, impurity = 0.0)
>>> ---InternalNode(prediction = 0.0, impurity = 0.2777, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
>>> LeafNode(prediction = 0.0, impurity = 0.0)
>>> LeafNode(prediction = 0.0, impurity = -1.0)
>>
>>
>> For sure a more comprehensive policy for node splitting based on impurity
>> might prevent this situation (by splitting node "ffe0" you have an
>> impurity gain on one child, and a loss on the other), but independently
>> from this, once the tree is built, I would cut the redundant subtree and
>> obtain the following:
>>
>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
>>> --LeafNode(prediction = 0.0, impurity = ...)
>>
>>
>> I cannot say that this is relevant for all the tree ensemble methods, but
>> it for sure is for RF, even more than for DT, as the lever effect will be
>> even higher (and the code generating them is the same, DT calls RF with
>> numTree = 1 for what I can see).
>>
>> Being an optimization aiming at saving model memory footprint and
>> invocation time, it is independent from any consideration on the
>> statistical amortization of overfit, as your reply seems to imply.
>>
>> Am I missing something?
>>
>> Best regards,
>> Alessandro
>>
>>
>>
>> On 13 February 2018 at 10:57, Nick Pentreath 
>> wrote:
>>
>>> There is a long outstanding JIRA issue about it:
>>> https://issues.apache.org/jira/browse/SPARK-3155.
>>>
>>> It is probably still a useful feature to have for trees but the priority
>>> is not that high since it may not be that useful for the tree ensemble
>>> models.
>>>
>>>
>>> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
>>> alessandro.solima...@gmail.com> wrote:
>>>
 Hello community,
 I have recently manually inspected some decision trees computed with
 Spark (2.2.1, but the behavior is the same with the latest code on the
 repo).

 I have observed that the trees are always complete, even if an entire
 subtree leads to the same prediction in its different leaves.

 In such case, the root of the subtree, instead of being an
 InternalNode, could simply be a LeafNode with the (shared) prediction.

 I know that decision trees computed by scikit-learn share the same
 feature, I understand that this is needed by construction, because you
 realize this redundancy only at the end.

 So my question is, why is this "post-pruning" missing?

 Three hypothesis:

 1) It is not suitable (for a reason I fail to see)
 2) Such addition to the code is considered as not worth (in terms of
 code complexity, maybe)
 3) It has been overlooked, but could be a favorable addition

 For clarity, I have managed to isolate a small case to reproduce this,
 in what follo

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-13 Thread Sean Owen
+1 from me. Again, licenses and sigs look fine. I built the source
distribution with "-Phive -Phadoop-2.7 -Pyarn -Pkubernetes" and all tests
passed.

Remaining issues for 2.3.0, none of which are a Blocker:

SPARK-22797 Add multiple column support to PySpark Bucketizer
SPARK-23083 Adding Kubernetes as an option to https://spark.apache.org/
SPARK-23292 python tests related to pandas are skipped
SPARK-23309 Spark 2.3 cached query performance 20-30% worse then spark 2.2
SPARK-23316 AnalysisException after max iteration reached for IN query

... though the pandas tests issue is "Critical".

(SPARK-23083 is an update to the main site that should happen as the
artifacts are released, so it's OK.)

On Tue, Feb 13, 2018 at 12:30 AM Sameer Agarwal  wrote:

> Now that all known blockers have once again been resolved, please vote on
> releasing the following candidate as Apache Spark version 2.3.0. The vote
> is open until Friday February 16, 2018 at 8:00:00 am UTC and passes if a
> majority of at least 3 PMC +1 votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc3:
> https://github.com/apache/spark/tree/v2.3.0-rc3
> (89f6fcbafcfb0a7aeb897fba6036cb085bd35121)
>
> List of JIRA tickets resolved in this release can be found here:
> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1264/
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-docs/_site/index.html
>
>
> FAQ
>
> ===
> What are the unresolved issues targeted for 2.3.0?
> ===
>
> Please see https://s.apache.org/oXKi. At the time of writing, there are
> currently no known release blockers.
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you can
> add the staging repository to your projects resolvers and test with the RC
> (make sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.0?
> ===
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
> appropriate.
>
> ===
> Why is my bug not fixed?
> ===
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.2.0. That being said, if
> there is something which is a regression from 2.2.0 and has not been
> correctly targeted please ping me or a committer to help target the issue
> (you can see the open issues listed as impacting Spark 2.3.0 at
> https://s.apache.org/WmoI).
>
>
> Regards,
> Sameer
>


Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-13 Thread Sameer Agarwal
The issue with SPARK-23292 is that we currently run the python tests
related to pandas and pyarrow with python 3 (which is already installed on
all amplab jenkins machines). Since the code path is fully tested, we
decided to not mark it as a blocker; I've reworded the title to better
indicate that.

On 13 February 2018 at 08:16, Sean Owen  wrote:

> +1 from me. Again, licenses and sigs look fine. I built the source
> distribution with "-Phive -Phadoop-2.7 -Pyarn -Pkubernetes" and all tests
> passed.
>
> Remaining issues for 2.3.0, none of which are a Blocker:
>
> SPARK-22797 Add multiple column support to PySpark Bucketizer
> SPARK-23083 Adding Kubernetes as an option to https://spark.apache.org/
> SPARK-23292 python tests related to pandas are skipped
> SPARK-23309 Spark 2.3 cached query performance 20-30% worse then spark 2.2
> SPARK-23316 AnalysisException after max iteration reached for IN query
>
> ... though the pandas tests issue is "Critical".
>
> (SPARK-23083 is an update to the main site that should happen as the
> artifacts are released, so it's OK.)
>
> On Tue, Feb 13, 2018 at 12:30 AM Sameer Agarwal 
> wrote:
>
>> Now that all known blockers have once again been resolved, please vote on
>> releasing the following candidate as Apache Spark version 2.3.0. The vote
>> is open until Friday February 16, 2018 at 8:00:00 am UTC and passes if a
>> majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.3.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.3.0-rc3: https://github.com/apache/
>> spark/tree/v2.3.0-rc3 (89f6fcbafcfb0a7aeb897fba6036cb085bd35121)
>>
>> List of JIRA tickets resolved in this release can be found here:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1264/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-
>> docs/_site/index.html
>>
>>
>> FAQ
>>
>> ===
>> What are the unresolved issues targeted for 2.3.0?
>> ===
>>
>> Please see https://s.apache.org/oXKi. At the time of writing, there are
>> currently no known release blockers.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you can
>> add the staging repository to your projects resolvers and test with the RC
>> (make sure to clean up the artifact cache before/after so you don't end up
>> building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.0?
>> ===
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
>> appropriate.
>>
>> ===
>> Why is my bug not fixed?
>> ===
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.2.0. That being said, if
>> there is something which is a regression from 2.2.0 and has not been
>> correctly targeted please ping me or a committer to help target the issue
>> (you can see the open issues listed as impacting Spark 2.3.0 at
>> https://s.apache.org/WmoI).
>>
>>
>> Regards,
>> Sameer
>>
>


-- 
Sameer Agarwal
Computer Science | UC Berkeley
http://cs.berkeley.edu/~sameerag


Re: Regarding NimbusDS JOSE JWT jar 3.9 security vulnerability

2018-02-13 Thread PJ Fanning
Hi Sujith,
I didn't find the nimbusds dependency in any spark 2.2 jars. Maybe I missed
something. Could you tell us which spark jar has the nimbusds dependency?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



A new external catalog

2018-02-13 Thread Tayyebi, Ameen
Hello everyone,

For those of you not familiar with AWS Glue 
Catalog, it’s a Hive Metastore implemented as a 
web service. The Glue service is composed of different components, but the one 
I’m interested in is the Catalog. Today, there’s a Hive metastore 
implementation and you can plug the catalog to Spark as instructed 
here. 
Basically, the Hive metastore Java class is swapped with an implementation that 
calls into Glue’s web service.

I don’t like this implementation because:

  *   It puts Hive as a middle-man between Spark and Glue
  *   It prevents Glue specific implementations

As an example of the second issue, the Hive version embedded in Spark today 
does not support partition pruning for column types that are fractionals or 
timestamps. I have a pull request to fix 
this, but as rxin correctly pointed 
out, I have to fake a new Hive version called Glue or something and put this 
under the Hive shim for it.

I have locally implemented a version of 
ExternalCatalog
 on top of Glue and would like to productionize it and submit it as a pull 
request. You can set spark.catalog.implementation config to “glue” and then it 
will use Glue instead of either the in-memory catalog or Hive.

Rudimentary tests are promising and I can hook up Parquet tables directly 
without going through any Hive. I really need this because I need to fix a data 
consistency issue with InsertIntoHiveTable command when data is backed by S3. 
Different topic.

The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Are there any objections to this? Any guidance around upgrading the AWS client? 
Who would be a good person to review this pull request?

Thanks,
-Ameen





Re: A new external catalog

2018-02-13 Thread Steve Loughran


On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?


Inefficient state management in stream to stream join in 2.3

2018-02-13 Thread Yogesh Mahajan
In 2.3, stream to stream joins(both Inner and Outer) are implemented using
symmetric hash join(SHJ) algorithm, and that is a good choice
and I am sure you had compared with other family of algorithms like XJoin
and non-blocking sort based algorithms like progressive merge join (PMJ

)

*From functional point of view - *
1. It considers most of the stream to stream join use cases and all the
considerations around event time and watermarks as joins keys are well
thought trough.
2. It also adopts an effective approach towards join state management is to
exploit 'hard' constraints in the input streams to reduce state rather than
exploiting statistical properties as 'soft' constraints.

*From performance point of view - *
Since SHJ assumes that the entire join state can be kept in main memory,
but the StateStore in Spark is backed by the HDFS compatible file system.
Also looking at the code StreamingSymmetricHashJoinExec here
,
two StateStores(KeyToNumValuesStore, KeyWithIndexToValueStore) are used and
multiple lookups to them in each
StreamExecution(MicroBatch/ContinuousExecution)
per partition per operator will have huge performance penalty even for a
moderate size of state of queries like groupBy “SYMBOL”

To overcome this perf hit, even though we implement our own efficient
in-memory StateStore, there is no way to avoid these multiple lookups
unless and until you have your own StreamingSymmetricHashJoinExec
implementation.

We should consider using efficient main-memory data structures described in
this paper

which are suited for storing sliding windows, with efficient support for
removing tuples that have fallen out of the state.

Other way to reduce unnecessary state using punctuations

(in contrast to existing way where constraints have to be known a priori). A
punctuation is a tuple of patterns specifying a predicate that must
evaluate to false for all future data tuples in the stream and these can be
inserted dynamically.

For example consider two streams join, auctionStream and bidStream. When a
particular auction closes, system inserts a punctuation into the bidStream
to signal that there will be no more bids for that particular auction
and purges
those tuples that cannot possibly join with future arrivals. PJoin
 is one example of stream join
algorithm which exploits punctuations.

Thanks,
http://www.snappydata.io/blog 


Re: A new external catalog

2018-02-13 Thread Tayyebi, Ameen
Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76

272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?

From: Steve Loughran 
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" 
Cc: Apache Spark Dev 
Subject: Re: A new external catalog




On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?


Re: A new external catalog

2018-02-13 Thread Steve Loughran


On 13 Feb 2018, at 21:20, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:

Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76


272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?


FWIW, some of the other troublespots are not functional, they're log overflow

https://issues.apache.org/jira/browse/HADOOP-15040
https://issues.apache.org/jira/browse/HADOOP-14596

Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go 
with that into Hadoop 3.1 if we're happy, but that's not so much for new 
features but "stack traces throughout the log", which seems to be a recurrent 
issue with the JARs, and one which often slips by CI build runs. If it wasn't 
for that, we'd have stuck with 1.11.199 because it didn't have any issues that 
we hadn't already got under control 
(https://github.com/aws/aws-sdk-java/issues/1211)

Like I said: upgrades bring fear


From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>>
Cc: Apache Spark Dev mailto:dev@spark.apache.org>>
Subject: Re: A new external catalog




On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?