ApacheCon North America 2020, project participation

2019-10-01 Thread Rich Bowen
Hi, folks,

(Note: You're receiving this email because you're on the dev@ list for
one or more Apache Software Foundation projects.)

For ApacheCon North America 2019, we asked projects to participate in
the creation of project/topic specific tracks. This was very successful,
with about 15 projects stepping up to curate the content for their
track/summit/event.

We need to know if you're going to do the same for 2020. This informs
how large a venue we book for the event, how long the event runs, and
many other considerations.

If you intend to participate again in 2020, we need to hear from you on
the plann...@apachecon.com mailing list. This is not a firm commitment,
but we need to know if you're, say, 75% confident that you'll be
participating.

And, no, we do not have any details at all, but assume that it will be
in roughly the same calendar space as this year's event, ie, somewhere
in the August-October timeframe.

Thanks.

-- 
Rich Bowen
VP Conferences
The Apache Software Foundation
@apachecon

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-01 Thread Jacek Laskowski
Hi,

I think I've got stuck and without your help I won't move any further.
Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
MicroBatch) and in getBatch phase when requested for a DataFrame, there is
this assert [1] I can't seem to go past with any DataFrame I managed to
create as it's not streaming.

  assert(batch.isStreaming,
s"DataFrame returned by getBatch from $source did not have
isStreaming=true\n" +
  s"${batch.queryExecution.logical}")

[1]
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441

All I could find is private[sql],
e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]

[2]
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
[3]
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski


[DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-01 Thread Jungtaek Lim
Hi devs,

I've discovered an issue with event logger, specifically reading incomplete
event log file which is compressed with 'zstd' - the reader thread got
stuck on reading that file.

This is very easy to reproduce: setting configuration as below

- spark.eventLog.enabled=true
- spark.eventLog.compress=true
- spark.eventLog.compression.codec=zstd

and start Spark application. While the application is running, load the
application in SHS webpage. It may succeed to replay the event log, but
high likely it will be stuck and loading page will be also stuck.

Please refer SPARK-29322 for more details.

As the issue only occurs with 'zstd', the simplest approach is dropping
support of 'zstd' for event log. More general approach would be introducing
timeout on reading event log file, but it should be able to differentiate
thread being stuck vs thread busy with reading huge event log file.

Which approach would be preferred in Spark community, or would someone
propose better ideas for handling this?

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-01 Thread Jungtaek Lim
Looks like it's missing, or intended to force custom streaming source
implemented as DSv2.

I'm not sure Spark community wants to expand DSv1 API: I could propose the
change if we get some supports here.

To Spark community: given we bring major changes on DSv2, someone would
want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and
new DSv2 gets stabilized. Would we like to provide necessary changes on
DSv1?

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski  wrote:

> Hi,
>
> I think I've got stuck and without your help I won't move any further.
> Please help.
>
> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
> MicroBatch) and in getBatch phase when requested for a DataFrame, there is
> this assert [1] I can't seem to go past with any DataFrame I managed to
> create as it's not streaming.
>
>   assert(batch.isStreaming,
> s"DataFrame returned by getBatch from $source did not have
> isStreaming=true\n" +
>   s"${batch.queryExecution.logical}")
>
> [1]
> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441
>
> All I could find is private[sql],
> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]
>
> [2]
> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
> [3]
> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> The Internals of Spark SQL https://bit.ly/spark-sql-internals
> The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> Follow me at https://twitter.com/jaceklaskowski
>
>


Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-01 Thread Mridul Muralidharan
Makes more sense to drop support for zstd assuming the fix is not
something at spark end (configuration, etc).
Does not make sense to try to detect deadlock in codec.

Regards,
Mridul

On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
 wrote:
>
> Hi devs,
>
> I've discovered an issue with event logger, specifically reading incomplete 
> event log file which is compressed with 'zstd' - the reader thread got stuck 
> on reading that file.
>
> This is very easy to reproduce: setting configuration as below
>
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
>
> and start Spark application. While the application is running, load the 
> application in SHS webpage. It may succeed to replay the event log, but high 
> likely it will be stuck and loading page will be also stuck.
>
> Please refer SPARK-29322 for more details.
>
> As the issue only occurs with 'zstd', the simplest approach is dropping 
> support of 'zstd' for event log. More general approach would be introducing 
> timeout on reading event log file, but it should be able to differentiate 
> thread being stuck vs thread busy with reading huge event log file.
>
> Which approach would be preferred in Spark community, or would someone 
> propose better ideas for handling this?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-01 Thread Dongjoon Hyun
Thank you for reporting, Jungtaek.

Can we try to upgrade it to the newer version first?

Since we are at 1.4.2, the newer version is 1.4.3.

Bests,
Dongjoon.



On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan  wrote:

> Makes more sense to drop support for zstd assuming the fix is not
> something at spark end (configuration, etc).
> Does not make sense to try to detect deadlock in codec.
>
> Regards,
> Mridul
>
> On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
>  wrote:
> >
> > Hi devs,
> >
> > I've discovered an issue with event logger, specifically reading
> incomplete event log file which is compressed with 'zstd' - the reader
> thread got stuck on reading that file.
> >
> > This is very easy to reproduce: setting configuration as below
> >
> > - spark.eventLog.enabled=true
> > - spark.eventLog.compress=true
> > - spark.eventLog.compression.codec=zstd
> >
> > and start Spark application. While the application is running, load the
> application in SHS webpage. It may succeed to replay the event log, but
> high likely it will be stuck and loading page will be also stuck.
> >
> > Please refer SPARK-29322 for more details.
> >
> > As the issue only occurs with 'zstd', the simplest approach is dropping
> support of 'zstd' for event log. More general approach would be introducing
> timeout on reading event log file, but it should be able to differentiate
> thread being stuck vs thread busy with reading huge event log file.
> >
> > Which approach would be preferred in Spark community, or would someone
> propose better ideas for handling this?
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[SS] Possible inconsistent semantics on metric "updated" between stateful operators

2019-10-01 Thread Jungtaek Lim
Hi devs,

I've indicated the different semantics on metric "updated" between
(Flat)MapGroupsWithState and other stateful operators.

* (Flat)MapGroupsWithState: removal is counted as updated
* others: removal is not counted as updated

Technically, the meanings of "removal" are different:
(Flat)MapGroupsWithState requires the state function to remove state
(removed via user logic), whereas others are evicting state based on
watermark. So removed via user logic vs removed automatically via mechanism
of Spark.

Even taking the difference into account, it may be still confusing - as end
users would assume total state rows >= updated rows when they are playing
with streaming aggregations or stream-stream joins, and when they start to
use (Flat)MapGroupsWithState, they would indicate their assumption is
incorrect - it's possible for FlatMapGroupsWithState to have metrics (total
0, updated 1) which might look odd for them.

We have some options here:

1) It's by intention and it works as expected. Leave it as it is.
2) Don't increase "updated" when state is removed for FlatMapGroupsWithState
3) Add a new metric "removed" and apply this to all stateful operators
(both removal and eviction)

Would like to hear voices on this.

Thanks in advance,
Jungtaek Lim (HeartSaVioR)

* JIRA issue: https://issues.apache.org/jira/browse/SPARK-29312


Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-01 Thread Jungtaek Lim
The change log for zstd v1.4.3 feels me that the changes don't seem to be
related.

https://github.com/facebook/zstd/blob/dev/CHANGELOG#L1-L5

v1.4.3
bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709)
bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722)
build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705)
misc: Add NULL pointer check in util.c by @leeyoung624 (#1706)

But it's only the matter of dependency update and rebuild, so I'll try it
out.

Before that, I just indicated ZstdOutputStream has a parameter
"closeFrameOnFlush" which seems to deal with flush. We let the value as the
default value which is "false". Let me pass the value to "true" and see it
helps. Please let me know if someone knows why we pick the value as false
(or let it by default).


On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun 
wrote:

> Thank you for reporting, Jungtaek.
>
> Can we try to upgrade it to the newer version first?
>
> Since we are at 1.4.2, the newer version is 1.4.3.
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan 
> wrote:
>
>> Makes more sense to drop support for zstd assuming the fix is not
>> something at spark end (configuration, etc).
>> Does not make sense to try to detect deadlock in codec.
>>
>> Regards,
>> Mridul
>>
>> On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
>>  wrote:
>> >
>> > Hi devs,
>> >
>> > I've discovered an issue with event logger, specifically reading
>> incomplete event log file which is compressed with 'zstd' - the reader
>> thread got stuck on reading that file.
>> >
>> > This is very easy to reproduce: setting configuration as below
>> >
>> > - spark.eventLog.enabled=true
>> > - spark.eventLog.compress=true
>> > - spark.eventLog.compression.codec=zstd
>> >
>> > and start Spark application. While the application is running, load the
>> application in SHS webpage. It may succeed to replay the event log, but
>> high likely it will be stuck and loading page will be also stuck.
>> >
>> > Please refer SPARK-29322 for more details.
>> >
>> > As the issue only occurs with 'zstd', the simplest approach is dropping
>> support of 'zstd' for event log. More general approach would be introducing
>> timeout on reading event log file, but it should be able to differentiate
>> thread being stuck vs thread busy with reading huge event log file.
>> >
>> > Which approach would be preferred in Spark community, or would someone
>> propose better ideas for handling this?
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Preferred approach on dealing with SPARK-29322

2019-10-01 Thread Jungtaek Lim
I need to do full manual test to make sure, but according to experiment
(small UT) "closeFrameOnFlush" seems to work.

There was relevant change on master branch SPARK-26283 [1], and it changed
the way to read the zstd event log file to "continuous", which seems to
read open frame. With "closeFrameOnFlush" being false for ZstdOutputStream,
frame is never closed (even flushing output stream) unless output stream is
closed.

I'll raise a patch once manual test is passed. Sorry for the false alarm.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-26283

On Wed, Oct 2, 2019 at 2:33 PM Jungtaek Lim 
wrote:

> The change log for zstd v1.4.3 feels me that the changes don't seem to be
> related.
>
> https://github.com/facebook/zstd/blob/dev/CHANGELOG#L1-L5
>
> v1.4.3
> bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709)
> bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722)
> build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705)
> misc: Add NULL pointer check in util.c by @leeyoung624 (#1706)
>
> But it's only the matter of dependency update and rebuild, so I'll try it
> out.
>
> Before that, I just indicated ZstdOutputStream has a parameter
> "closeFrameOnFlush" which seems to deal with flush. We let the value as the
> default value which is "false". Let me pass the value to "true" and see it
> helps. Please let me know if someone knows why we pick the value as false
> (or let it by default).
>
>
> On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun 
> wrote:
>
>> Thank you for reporting, Jungtaek.
>>
>> Can we try to upgrade it to the newer version first?
>>
>> Since we are at 1.4.2, the newer version is 1.4.3.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan 
>> wrote:
>>
>>> Makes more sense to drop support for zstd assuming the fix is not
>>> something at spark end (configuration, etc).
>>> Does not make sense to try to detect deadlock in codec.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim
>>>  wrote:
>>> >
>>> > Hi devs,
>>> >
>>> > I've discovered an issue with event logger, specifically reading
>>> incomplete event log file which is compressed with 'zstd' - the reader
>>> thread got stuck on reading that file.
>>> >
>>> > This is very easy to reproduce: setting configuration as below
>>> >
>>> > - spark.eventLog.enabled=true
>>> > - spark.eventLog.compress=true
>>> > - spark.eventLog.compression.codec=zstd
>>> >
>>> > and start Spark application. While the application is running, load
>>> the application in SHS webpage. It may succeed to replay the event log, but
>>> high likely it will be stuck and loading page will be also stuck.
>>> >
>>> > Please refer SPARK-29322 for more details.
>>> >
>>> > As the issue only occurs with 'zstd', the simplest approach is
>>> dropping support of 'zstd' for event log. More general approach would be
>>> introducing timeout on reading event log file, but it should be able to
>>> differentiate thread being stuck vs thread busy with reading huge event log
>>> file.
>>> >
>>> > Which approach would be preferred in Spark community, or would someone
>>> propose better ideas for handling this?
>>> >
>>> > Thanks,
>>> > Jungtaek Lim (HeartSaVioR)
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>