Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-07 Thread Hyukjin Kwon
Hi all, I would like to proceed this. Are there more thoughts on this? If
not, I would like to go ahead with the proposal here.

2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:

> Nothing is urgent. I just don't want to leave it undecided and just keep
> adding Java APIs inconsistently as it's currently happening.
>
> We should have a set of coherent APIs. It's very difficult to change APIs
> once they are out in releases. I guess I have seen people here agree with
> having a general guidance for the same reason at least - please let me know
> if I'm taking it wrong.
>
> I don't think we should assume Java programmers know how Scala works with
> Java types. Less assumtion might be better.
>
> I feel like we have things on the table to consider at this moment and not
> much point of waiting indefinitely.
>
> But sure maybe I am wrong. We can wait for more feedback for a couple of
> days.
>
>
> On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:
>
>> I feel a little pushed... :-) I still don't get the point of why it's
>> urgent to make the decision now. AFAIK, it's a common practice to handle
>> Scala types conversions by self when Java programmers prepare to
>> invoke Scala libraries. I'm not sure which one is the Java programmers'
>> root complaint, Scala type instance or Scala Jar file.
>>
>> My 2 cents.
>>
>> --
>> Cheers,
>> -z
>>
>> On Thu, 30 Apr 2020 09:17:37 +0900
>> Hyukjin Kwon  wrote:
>>
>> > There was a typo in the previous email. I am re-sending:
>> >
>> > Hm, I thought you meant you prefer 3. over 4 but don't mind
>> particularly.
>> > I don't mean to wait for more feedback. It looks likely just a deadlock
>> > which will be the worst case.
>> > I was suggesting to pick one way first, and stick to it. If we find out
>> > something later, we can discuss
>> > more about changing it later.
>> >
>> > Having separate Java specific API (3. way)
>> >   - causes maintenance cost
>> >   - makes users to search which API for Java every time
>> >   - this looks the opposite why against the unified API set Spark
>> targeted
>> > so far.
>> >
>> > I don't completely buy the argument about Scala/Java friendly because
>> using
>> > Java instance is already documented in the official Scala documentation.
>> > Users still need to search if we have Java specific methods for *some*
>> APIs.
>> >
>> > 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
>> >
>> > > Hm, I thought you meant you prefer 3. over 4 but don't mind
>> particularly.
>> > > I don't mean to wait for more feedback. It looks likely just a
>> deadlock
>> > > which will be the worst case.
>> > > I was suggesting to pick one way first, and stick to it. If we find
>> out
>> > > something later, we can discuss
>> > > more about changing it later.
>> > >
>> > > Having separate Java specific API (4. way)
>> > >   - causes maintenance cost
>> > >   - makes users to search which API for Java every time
>> > >   - this looks the opposite why against the unified API set Spark
>> targeted
>> > > so far.
>> > >
>> > > I don't completely buy the argument about Scala/Java friendly because
>> > > using Java instance is already documented in the official Scala
>> > > documentation.
>> > > Users still need to search if we have Java specific methods for *some*
>> > > APIs.
>> > >
>> > >
>> > >
>> > > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
>> > >
>> > >> Sorry I'm not sure what your last email means. Does it mean you are
>> > >> putting it up for a vote or just waiting to get more feedback?  I
>> disagree
>> > >> with saying option 4 is the rule but agree having a general rule
>> makes
>> > >> sense.  I think we need a lot more input to make the rule as it
>> affects the
>> > >> api's.
>> > >>
>> > >> Tom
>> > >>
>> > >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
>> > >> gurwls...@gmail.com> wrote:
>> > >>
>> > >>
>> > >> I think I am not seeing explicit objection here but rather see people
>> > >> tend to agree with the proposal in general.
>> > >> I would like to step forward rather than leaving it as a deadlock -
>> the
>> > >> worst choice here is to postpone and abandon this discussion with
>> this
>> > >> inconsistency.
>> > >>
>> > >> I don't currently target to document this as the cases are rather
>> > >> rare, and we haven't really documented JavaRDD <> RDD vs DataFrame
>> case as
>> > >> well.
>> > >> Let's keep monitoring and see if this discussion thread clarifies
>> things
>> > >> enough in such cases I mentioned.
>> > >>
>> > >> Let me know if you guys think differently.
>> > >>
>> > >>
>> > >> 2020년 4월 28일 (화) 오후 5:03, Hyukjin Kwon 님이 작성:
>> > >>
>> > >> Spark has targeted to have a unified API set rather than having
>> separate
>> > >> Java classes to reduce the maintenance cost,
>> > >> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the
>> > >> legacy.
>> > >>
>> > >> I think it's best to stick to the approach 4. in general cases.
>> > >> Other options might have to be considered based upon a specific
>> context.

Re: Spark FP-growth

2020-05-07 Thread Aditya Addepalli
Hi,

I understand that this is not a priority with everything going on, but if
you think generating rules for only a single consequent adds value, I would
like to contribute.

Thanks & Regards,
Aditya

On Sat, May 2, 2020 at 9:34 PM Aditya Addepalli  wrote:

> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are
> introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but
> this may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would
> help greatly.
>
> Reducing the support doesn't really work in my case simply because rules
> for the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than
> that to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen,  wrote:
>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli 
>> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FP-Growth
>> algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I
>> don't need the rules for all the other consequents. I know I can filter the
>> rules to get the desired output, but if I could input this in the algorithm
>> itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 5-6.
>> Again, I can filter on length but I was wondering if we could take this as
>> input into the algo. Given the Depth first nature of FP-Growth, I am not
>> sure that is feasible.
>> >
>> >  I am willing to work on these suggestions, if someone thinks they are
>> feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli
>>
>


Re: Spark FP-growth

2020-05-07 Thread Sean Owen
Yes, you can get the correct support this way by accounting for how
many rows were filtered out, but not the right confidence, as it
depends on counting support in rows without the items of interest.

But computing confidence depends on computing all that support; how
would you optimize it even if you knew the consequent you cared about?
maybe there's a way, sure, I don't know the code well but it wasn't
obvious at a glance how to take advantage of it.

I can see how limiting the rule size could help.

On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli  wrote:
>
> Hi Sean,
>
> I understand your approach, but there's a slight problem.
>
> If we generate rules after filtering for our desired consequent, we are 
> introducing some bias into our rules.
> The confidence of the rules on the filtered input can be very high but this 
> may not be the case on the entire dataset.
> Thus we can get biased rules which wrongly depict the patterns in the data.
> This is why I think having a parameter to mention the consequent would help 
> greatly.
>
> Reducing the support doesn't really work in my case simply because rules for 
> the consequents I am mining for occur very rarely in the data.
> Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than that 
> to capture the rules for that consequent.
>
> Thanks for your reply. Let me know what you think.
>
> Regards.
> Aditya Addepalli
>
>
>
>
> On Sat, 2 May, 2020, 9:13 pm Sean Owen,  wrote:
>>
>> You could just filter the input for sets containing the desired item,
>> and discard the rest. That doesn't mean all of the item sets have that
>> item, and you'd still have to filter, but may be much faster to
>> compute.
>> Increasing min support might generally have the effect of smaller
>> rules, though it doesn't impose a cap. That could help perf, if that's
>> what you're trying to improve.
>> I don't know if it's worth new params in the implementation, maybe. I
>> think there would have to be an argument this generalizes.
>>
>> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli  wrote:
>> >
>> > Hi Everyone,
>> >
>> > I was wondering if we could make any enhancements to the FP-Growth 
>> > algorithm in spark/pyspark.
>> >
>> > Many times I am looking for a rule for a particular consequent, so I don't 
>> > need the rules for all the other consequents. I know I can filter the 
>> > rules to get the desired output, but if I could input this in the 
>> > algorithm itself, the execution time would reduce drastically.
>> >
>> > Also, sometimes I want the rules to be small, maybe of length 5-6. Again, 
>> > I can filter on length but I was wondering if we could take this as input 
>> > into the algo. Given the Depth first nature of FP-Growth, I am not sure 
>> > that is feasible.
>> >
>> >  I am willing to work on these suggestions, if someone thinks they are 
>> > feasible. Thanks to the dev team for all the hard work!
>> >
>> > Regards,
>> > Aditya Addepalli

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark FP-growth

2020-05-07 Thread Aditya Addepalli
Hi Sean,

1.
I was thinking that by specifying the consequent we can (somehow?) skip the
confidence calculation for all the other consequents.

This would greatly reduce the time taken as we avoid computation for
consequents we don't care about.


2.
Is limiting rule size even possible? I thought because of FP growth's depth
first nature it might not be possible.

My experience with Fp-growth has largely been in python where the API is
limited. I will take a look at the scala source code and get back to you
with more concrete answers.

Thanks & Regards,
Aditya

On Thu, 7 May, 2020, 11:21 pm Sean Owen,  wrote:

> Yes, you can get the correct support this way by accounting for how
> many rows were filtered out, but not the right confidence, as it
> depends on counting support in rows without the items of interest.
>
> But computing confidence depends on computing all that support; how
> would you optimize it even if you knew the consequent you cared about?
> maybe there's a way, sure, I don't know the code well but it wasn't
> obvious at a glance how to take advantage of it.
>
> I can see how limiting the rule size could help.
>
> On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli 
> wrote:
> >
> > Hi Sean,
> >
> > I understand your approach, but there's a slight problem.
> >
> > If we generate rules after filtering for our desired consequent, we are
> introducing some bias into our rules.
> > The confidence of the rules on the filtered input can be very high but
> this may not be the case on the entire dataset.
> > Thus we can get biased rules which wrongly depict the patterns in the
> data.
> > This is why I think having a parameter to mention the consequent would
> help greatly.
> >
> > Reducing the support doesn't really work in my case simply because rules
> for the consequents I am mining for occur very rarely in the data.
> > Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less than
> that to capture the rules for that consequent.
> >
> > Thanks for your reply. Let me know what you think.
> >
> > Regards.
> > Aditya Addepalli
> >
> >
> >
> >
> > On Sat, 2 May, 2020, 9:13 pm Sean Owen,  wrote:
> >>
> >> You could just filter the input for sets containing the desired item,
> >> and discard the rest. That doesn't mean all of the item sets have that
> >> item, and you'd still have to filter, but may be much faster to
> >> compute.
> >> Increasing min support might generally have the effect of smaller
> >> rules, though it doesn't impose a cap. That could help perf, if that's
> >> what you're trying to improve.
> >> I don't know if it's worth new params in the implementation, maybe. I
> >> think there would have to be an argument this generalizes.
> >>
> >> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli 
> wrote:
> >> >
> >> > Hi Everyone,
> >> >
> >> > I was wondering if we could make any enhancements to the FP-Growth
> algorithm in spark/pyspark.
> >> >
> >> > Many times I am looking for a rule for a particular consequent, so I
> don't need the rules for all the other consequents. I know I can filter the
> rules to get the desired output, but if I could input this in the algorithm
> itself, the execution time would reduce drastically.
> >> >
> >> > Also, sometimes I want the rules to be small, maybe of length 5-6.
> Again, I can filter on length but I was wondering if we could take this as
> input into the algo. Given the Depth first nature of FP-Growth, I am not
> sure that is feasible.
> >> >
> >> >  I am willing to work on these suggestions, if someone thinks they
> are feasible. Thanks to the dev team for all the hard work!
> >> >
> >> > Regards,
> >> > Aditya Addepalli
>


Re: Spark FP-growth

2020-05-07 Thread Sean Owen
The confidence calculation is pretty trivial, the work is finding the
supports needed. Not sure how to optimize that.

On Thu, May 7, 2020, 1:12 PM Aditya Addepalli  wrote:

> Hi Sean,
>
> 1.
> I was thinking that by specifying the consequent we can (somehow?) skip
> the confidence calculation for all the other consequents.
>
> This would greatly reduce the time taken as we avoid computation for
> consequents we don't care about.
>
>
> 2.
> Is limiting rule size even possible? I thought because of FP growth's
> depth first nature it might not be possible.
>
> My experience with Fp-growth has largely been in python where the API is
> limited. I will take a look at the scala source code and get back to you
> with more concrete answers.
>
> Thanks & Regards,
> Aditya
>
> On Thu, 7 May, 2020, 11:21 pm Sean Owen,  wrote:
>
>> Yes, you can get the correct support this way by accounting for how
>> many rows were filtered out, but not the right confidence, as it
>> depends on counting support in rows without the items of interest.
>>
>> But computing confidence depends on computing all that support; how
>> would you optimize it even if you knew the consequent you cared about?
>> maybe there's a way, sure, I don't know the code well but it wasn't
>> obvious at a glance how to take advantage of it.
>>
>> I can see how limiting the rule size could help.
>>
>> On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli 
>> wrote:
>> >
>> > Hi Sean,
>> >
>> > I understand your approach, but there's a slight problem.
>> >
>> > If we generate rules after filtering for our desired consequent, we are
>> introducing some bias into our rules.
>> > The confidence of the rules on the filtered input can be very high but
>> this may not be the case on the entire dataset.
>> > Thus we can get biased rules which wrongly depict the patterns in the
>> data.
>> > This is why I think having a parameter to mention the consequent would
>> help greatly.
>> >
>> > Reducing the support doesn't really work in my case simply because
>> rules for the consequents I am mining for occur very rarely in the data.
>> > Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less
>> than that to capture the rules for that consequent.
>> >
>> > Thanks for your reply. Let me know what you think.
>> >
>> > Regards.
>> > Aditya Addepalli
>> >
>> >
>> >
>> >
>> > On Sat, 2 May, 2020, 9:13 pm Sean Owen,  wrote:
>> >>
>> >> You could just filter the input for sets containing the desired item,
>> >> and discard the rest. That doesn't mean all of the item sets have that
>> >> item, and you'd still have to filter, but may be much faster to
>> >> compute.
>> >> Increasing min support might generally have the effect of smaller
>> >> rules, though it doesn't impose a cap. That could help perf, if that's
>> >> what you're trying to improve.
>> >> I don't know if it's worth new params in the implementation, maybe. I
>> >> think there would have to be an argument this generalizes.
>> >>
>> >> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli 
>> wrote:
>> >> >
>> >> > Hi Everyone,
>> >> >
>> >> > I was wondering if we could make any enhancements to the FP-Growth
>> algorithm in spark/pyspark.
>> >> >
>> >> > Many times I am looking for a rule for a particular consequent, so I
>> don't need the rules for all the other consequents. I know I can filter the
>> rules to get the desired output, but if I could input this in the algorithm
>> itself, the execution time would reduce drastically.
>> >> >
>> >> > Also, sometimes I want the rules to be small, maybe of length 5-6.
>> Again, I can filter on length but I was wondering if we could take this as
>> input into the algo. Given the Depth first nature of FP-Growth, I am not
>> sure that is feasible.
>> >> >
>> >> >  I am willing to work on these suggestions, if someone thinks they
>> are feasible. Thanks to the dev team for all the hard work!
>> >> >
>> >> > Regards,
>> >> > Aditya Addepalli
>>
>


Re: Spark FP-growth

2020-05-07 Thread Aditya Addepalli
Absolutely. I meant to say the confidence calculation depends on the
support calculations and hence would reduce the time. Thanks for pointing
that out.

On Thu, 7 May, 2020, 11:56 pm Sean Owen,  wrote:

> The confidence calculation is pretty trivial, the work is finding the
> supports needed. Not sure how to optimize that.
>
> On Thu, May 7, 2020, 1:12 PM Aditya Addepalli  wrote:
>
>> Hi Sean,
>>
>> 1.
>> I was thinking that by specifying the consequent we can (somehow?) skip
>> the confidence calculation for all the other consequents.
>>
>> This would greatly reduce the time taken as we avoid computation for
>> consequents we don't care about.
>>
>>
>> 2.
>> Is limiting rule size even possible? I thought because of FP growth's
>> depth first nature it might not be possible.
>>
>> My experience with Fp-growth has largely been in python where the API is
>> limited. I will take a look at the scala source code and get back to you
>> with more concrete answers.
>>
>> Thanks & Regards,
>> Aditya
>>
>> On Thu, 7 May, 2020, 11:21 pm Sean Owen,  wrote:
>>
>>> Yes, you can get the correct support this way by accounting for how
>>> many rows were filtered out, but not the right confidence, as it
>>> depends on counting support in rows without the items of interest.
>>>
>>> But computing confidence depends on computing all that support; how
>>> would you optimize it even if you knew the consequent you cared about?
>>> maybe there's a way, sure, I don't know the code well but it wasn't
>>> obvious at a glance how to take advantage of it.
>>>
>>> I can see how limiting the rule size could help.
>>>
>>> On Sat, May 2, 2020 at 11:04 AM Aditya Addepalli 
>>> wrote:
>>> >
>>> > Hi Sean,
>>> >
>>> > I understand your approach, but there's a slight problem.
>>> >
>>> > If we generate rules after filtering for our desired consequent, we
>>> are introducing some bias into our rules.
>>> > The confidence of the rules on the filtered input can be very high but
>>> this may not be the case on the entire dataset.
>>> > Thus we can get biased rules which wrongly depict the patterns in the
>>> data.
>>> > This is why I think having a parameter to mention the consequent would
>>> help greatly.
>>> >
>>> > Reducing the support doesn't really work in my case simply because
>>> rules for the consequents I am mining for occur very rarely in the data.
>>> > Sometimes this can be 1e-4 or 1e-5, so my minSupport has to be less
>>> than that to capture the rules for that consequent.
>>> >
>>> > Thanks for your reply. Let me know what you think.
>>> >
>>> > Regards.
>>> > Aditya Addepalli
>>> >
>>> >
>>> >
>>> >
>>> > On Sat, 2 May, 2020, 9:13 pm Sean Owen,  wrote:
>>> >>
>>> >> You could just filter the input for sets containing the desired item,
>>> >> and discard the rest. That doesn't mean all of the item sets have that
>>> >> item, and you'd still have to filter, but may be much faster to
>>> >> compute.
>>> >> Increasing min support might generally have the effect of smaller
>>> >> rules, though it doesn't impose a cap. That could help perf, if that's
>>> >> what you're trying to improve.
>>> >> I don't know if it's worth new params in the implementation, maybe. I
>>> >> think there would have to be an argument this generalizes.
>>> >>
>>> >> On Sat, May 2, 2020 at 3:13 AM Aditya Addepalli 
>>> wrote:
>>> >> >
>>> >> > Hi Everyone,
>>> >> >
>>> >> > I was wondering if we could make any enhancements to the FP-Growth
>>> algorithm in spark/pyspark.
>>> >> >
>>> >> > Many times I am looking for a rule for a particular consequent, so
>>> I don't need the rules for all the other consequents. I know I can filter
>>> the rules to get the desired output, but if I could input this in the
>>> algorithm itself, the execution time would reduce drastically.
>>> >> >
>>> >> > Also, sometimes I want the rules to be small, maybe of length 5-6.
>>> Again, I can filter on length but I was wondering if we could take this as
>>> input into the algo. Given the Depth first nature of FP-Growth, I am not
>>> sure that is feasible.
>>> >> >
>>> >> >  I am willing to work on these suggestions, if someone thinks they
>>> are feasible. Thanks to the dev team for all the hard work!
>>> >> >
>>> >> > Regards,
>>> >> > Aditya Addepalli
>>>
>>


[VOTE] Release Spark 2.4.6 (RC1)

2020-05-07 Thread Holden Karau
Please vote on releasing the following candidate as Apache Spark version
2.4.6.

The vote is open until February 5th 11PM PST and passes if a majority +1
PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.6
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.6 (try project = SPARK AND
"Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))

We _may_ want to hold the 2.4.6 release for something targetted to 2.4.7
( project = SPARK AND "Target Version/s" = "2.4.7") , currently,
SPARK-24266 & SPARK-26908 and I believe there is some discussion on if we
should include SPARK-31399 in this release.

The tag to be voted on is v2.4.5-rc2 (commit
a3cffc997035d11e1f6c092c1186e943f2f63544):
https://github.com/apache/spark/tree/v2.4.6-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1340/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/

The list of bug fixes going into 2.4.6 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12346781

This release is using the release script of the tag v2.4.6-rc1.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.6?
===

The current list of open tickets targeted at 2.4.5 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.6

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-07 Thread Holden Karau
Sorry correction: this vote is open until Monday May 11th at 9am pacific.

On Thu, May 7, 2020 at 11:29 AM Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.6.
>
> The vote is open until February 5th 11PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.6
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.6 (try project = SPARK AND
> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
>
> We _may_ want to hold the 2.4.6 release for something targetted to 2.4.7
> ( project = SPARK AND "Target Version/s" = "2.4.7") , currently,
> SPARK-24266 & SPARK-26908 and I believe there is some discussion on if we
> should include SPARK-31399 in this release.
>
> The tag to be voted on is v2.4.5-rc2 (commit
> a3cffc997035d11e1f6c092c1186e943f2f63544):
> https://github.com/apache/spark/tree/v2.4.6-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1340/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
>
> The list of bug fixes going into 2.4.6 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12346781
>
> This release is using the release script of the tag v2.4.6-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.6?
> ===
>
> The current list of open tickets targeted at 2.4.5 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.6
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [VOTE] Apache Spark 3.0.0 RC1

2020-05-07 Thread Sean Owen
So, this RC1 doesn't pass of course, but what's the status of RC2 - are
there outstanding issues?

On Tue, Mar 31, 2020 at 10:04 PM Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.0.
>
> The vote is open until 11:59pm Pacific time Fri Apr 3, and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.0.0-rc1 (commit
> 6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1):
> https://github.com/apache/spark/tree/v3.0.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1341/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/
>
> The list of bug fixes going into 2.4.5 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>
> This release is using the release script of the tag v3.0.0-rc1.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.0?
> ===
> The current list of open tickets targeted at 3.0.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> Note: I fully expect this RC to fail.
>
>
>
>


Re: [VOTE] Apache Spark 3.0.0 RC1

2020-05-07 Thread Xiao Li
Below are the three major blockers. I think we should start discussing how
to unblock the release.


   
   - 
   https://issues.apache.org/jira/browse/SPARK-31257
   - https://issues.apache.org/jira/browse/SPARK-31399
   - https://issues.apache.org/jira/browse/SPARK-31404

In this stage, for the features/functions that are not supported in the
previous releases, we should just simply throw an exception and document it
as a limitation. We do not need to fix all the things to block the release.
Not all of them are blockers.

Can we start RC2 next week?

Xiao


On Thu, May 7, 2020 at 5:28 PM Sean Owen  wrote:

> So, this RC1 doesn't pass of course, but what's the status of RC2 - are
> there outstanding issues?
>
> On Tue, Mar 31, 2020 at 10:04 PM Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.0.0.
>>
>> The vote is open until 11:59pm Pacific time Fri Apr 3, and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.0.0-rc1 (commit
>> 6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1):
>> https://github.com/apache/spark/tree/v3.0.0-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1341/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/
>>
>> The list of bug fixes going into 2.4.5 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>
>> This release is using the release script of the tag v3.0.0-rc1.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.0.0?
>> ===
>> The current list of open tickets targeted at 3.0.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> Note: I fully expect this RC to fail.
>>
>>
>>
>>

-- 



Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-07 Thread Dongjoon Hyun
Hi, Holden.

The following link looks outdated. It was a link used at Spark 2.4.5 RC2.
- https://repository.apache.org/content/repositories/orgapachespark-1340/

Instead, in the Apache repo, there are three candidates. Is 1343 the one we
vote?
- https://repository.apache.org/content/repositories/orgapachespark-1341/
- https://repository.apache.org/content/repositories/orgapachespark-1342/
- https://repository.apache.org/content/repositories/orgapachespark-1343/

Bests,
Dongjoon.


On Thu, May 7, 2020 at 2:31 PM Holden Karau  wrote:

> Sorry correction: this vote is open until Monday May 11th at 9am pacific.
>
> On Thu, May 7, 2020 at 11:29 AM Holden Karau  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.6.
>>
>> The vote is open until February 5th 11PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.6
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> There are currently no issues targeting 2.4.6 (try project = SPARK AND
>> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
>>
>> We _may_ want to hold the 2.4.6 release for something targetted to 2.4.7
>> ( project = SPARK AND "Target Version/s" = "2.4.7") , currently,
>> SPARK-24266 & SPARK-26908 and I believe there is some discussion on if we
>> should include SPARK-31399 in this release.
>>
>> The tag to be voted on is v2.4.5-rc2 (commit
>> a3cffc997035d11e1f6c092c1186e943f2f63544):
>> https://github.com/apache/spark/tree/v2.4.6-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1340/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
>>
>> The list of bug fixes going into 2.4.6 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12346781
>>
>> This release is using the release script of the tag v2.4.6-rc1.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.6?
>> ===
>>
>> The current list of open tickets targeted at 2.4.5 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.6
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-07 Thread Holden Karau
Thanks for catching that, I was looking at the old release and looked for
all the refs to 2.4.5 but missed that.

1343 (
https://repository.apache.org/content/repositories/orgapachespark-1343/ )
is the one to vote on, the others were dry runs of the release script which
were a little less dry than I was expecting.

On Thu, May 7, 2020 at 6:43 PM Dongjoon Hyun 
wrote:

> Hi, Holden.
>
> The following link looks outdated. It was a link used at Spark 2.4.5 RC2.
> - https://repository.apache.org/content/repositories/orgapachespark-1340/
>
> Instead, in the Apache repo, there are three candidates. Is 1343 the one
> we vote?
> - https://repository.apache.org/content/repositories/orgapachespark-1341/
> - https://repository.apache.org/content/repositories/orgapachespark-1342/
> - https://repository.apache.org/content/repositories/orgapachespark-1343/
>
> Bests,
> Dongjoon.
>
>
> On Thu, May 7, 2020 at 2:31 PM Holden Karau  wrote:
>
>> Sorry correction: this vote is open until Monday May 11th at 9am pacific.
>>
>> On Thu, May 7, 2020 at 11:29 AM Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.4.6.
>>>
>>> The vote is open until February 5th 11PM PST and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.6
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> There are currently no issues targeting 2.4.6 (try project = SPARK AND
>>> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
>>>
>>> We _may_ want to hold the 2.4.6 release for something targetted to 2.4.7
>>> ( project = SPARK AND "Target Version/s" = "2.4.7") , currently,
>>> SPARK-24266 & SPARK-26908 and I believe there is some discussion on if we
>>> should include SPARK-31399 in this release.
>>>
>>> The tag to be voted on is v2.4.5-rc2 (commit
>>> a3cffc997035d11e1f6c092c1186e943f2f63544):
>>> https://github.com/apache/spark/tree/v2.4.6-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1340/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
>>>
>>> The list of bug fixes going into 2.4.6 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12346781
>>>
>>> This release is using the release script of the tag v2.4.6-rc1.
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.4.6?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.4.5 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.4.6
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/h

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-05-07 Thread Jungtaek Lim
I don't see any new features/functions for these blockers.

For SPARK-31257 (which is filed and marked as a blocker from me), I agree
unifying create table syntax shouldn't be a blocker for Spark 3.0.0, as
that is a new feature, but even we put the proposal aside, the problem
remains the same and I think it's still a blocker.

We have a discussion thread for SPARK-31257 - let's revive the thread if we
wouldn't be able to adopt proposed solution in Spark 3.0.0.
https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E

On Fri, May 8, 2020 at 9:41 AM Xiao Li  wrote:

> Below are the three major blockers. I think we should start discussing how
> to unblock the release.
> 
>
>
>- 
>https://issues.apache.org/jira/browse/SPARK-31257
>- https://issues.apache.org/jira/browse/SPARK-31399
>- https://issues.apache.org/jira/browse/SPARK-31404
>
> In this stage, for the features/functions that are not supported in the
> previous releases, we should just simply throw an exception and document it
> as a limitation. We do not need to fix all the things to block the release.
> Not all of them are blockers.
>
> Can we start RC2 next week?
>
> Xiao
>
>
> On Thu, May 7, 2020 at 5:28 PM Sean Owen  wrote:
>
>> So, this RC1 doesn't pass of course, but what's the status of RC2 - are
>> there outstanding issues?
>>
>> On Tue, Mar 31, 2020 at 10:04 PM Reynold Xin  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.0.0.
>>>
>>> The vote is open until 11:59pm Pacific time Fri Apr 3, and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.0-rc1 (commit
>>> 6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1):
>>> https://github.com/apache/spark/tree/v3.0.0-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1341/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/
>>>
>>> The list of bug fixes going into 2.4.5 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>
>>> This release is using the release script of the tag v3.0.0-rc1.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.0.0?
>>> ===
>>> The current list of open tickets targeted at 3.0.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>> Note: I fully expect this RC to fail.
>>>
>>>
>>>
>>>
>
> --
> 
>


Executor exceptions stacktrace omitted by HotSpot in long running application

2020-05-07 Thread ZHANG Wei
Hi,

I'm considering to improve the experience of hitting potential
exceptions stacktrace omitted in long running application[1], which is
a JVM HotSpot optimization as Shixiong(Ryan) commented[2].

There might be 2 options:
1. Adds `-XX:-OmitStackTraceInFastThrow` as a common Executor JVM
option.
2. Adds an Executor replacement workflow with a fresh new JVM when
exceptions stacktrace being omitted.

I prefer option 1, there might be performance concerns. But option 2
isn't supposed to increase the performance since which JVM is replaced
by a new one to force working on no stacktrace fast throw mode.

May I get any concern or options on this?

-- 
Thanks,
-z

[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-NullPointerException-in-long-running-query-td37453.html
[2] https://stackoverflow.com/a/3010106

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org