ApacheCon North America 2020, project participation
Hi, folks, (Note: You're receiving this email because you're on the dev@ list for one or more Apache Software Foundation projects.) For ApacheCon North America 2019, we asked projects to participate in the creation of project/topic specific tracks. This was very successful, with about 15 projects stepping up to curate the content for their track/summit/event. We need to know if you're going to do the same for 2020. This informs how large a venue we book for the event, how long the event runs, and many other considerations. If you intend to participate again in 2020, we need to hear from you on the plann...@apachecon.com mailing list. This is not a firm commitment, but we need to know if you're, say, 75% confident that you'll be participating. And, no, we do not have any details at all, but assume that it will be in roughly the same calendar space as this year's event, ie, somewhere in the August-October timeframe. Thanks. -- Rich Bowen VP Conferences The Apache Software Foundation @apachecon - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?
Hi, I think I've got stuck and without your help I won't move any further. Please help. I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming. assert(batch.isStreaming, s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" + s"${batch.queryExecution.logical}") [1] https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441 All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3] [2] https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428 [3] https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
[DISCUSS] Preferred approach on dealing with SPARK-29322
Hi devs, I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file. This is very easy to reproduce: setting configuration as below - spark.eventLog.enabled=true - spark.eventLog.compress=true - spark.eventLog.compression.codec=zstd and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck. Please refer SPARK-29322 for more details. As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file. Which approach would be preferred in Spark community, or would someone propose better ideas for handling this? Thanks, Jungtaek Lim (HeartSaVioR)
Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?
Looks like it's missing, or intended to force custom streaming source implemented as DSv2. I'm not sure Spark community wants to expand DSv1 API: I could propose the change if we get some supports here. To Spark community: given we bring major changes on DSv2, someone would want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and new DSv2 gets stabilized. Would we like to provide necessary changes on DSv1? Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski wrote: > Hi, > > I think I've got stuck and without your help I won't move any further. > Please help. > > I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, > MicroBatch) and in getBatch phase when requested for a DataFrame, there is > this assert [1] I can't seem to go past with any DataFrame I managed to > create as it's not streaming. > > assert(batch.isStreaming, > s"DataFrame returned by getBatch from $source did not have > isStreaming=true\n" + > s"${batch.queryExecution.logical}") > > [1] > https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441 > > All I could find is private[sql], > e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3] > > [2] > https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428 > [3] > https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81 > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > The Internals of Spark SQL https://bit.ly/spark-sql-internals > The Internals of Spark Structured Streaming > https://bit.ly/spark-structured-streaming > The Internals of Apache Kafka https://bit.ly/apache-kafka-internals > Follow me at https://twitter.com/jaceklaskowski > >
Re: [DISCUSS] Preferred approach on dealing with SPARK-29322
Makes more sense to drop support for zstd assuming the fix is not something at spark end (configuration, etc). Does not make sense to try to detect deadlock in codec. Regards, Mridul On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim wrote: > > Hi devs, > > I've discovered an issue with event logger, specifically reading incomplete > event log file which is compressed with 'zstd' - the reader thread got stuck > on reading that file. > > This is very easy to reproduce: setting configuration as below > > - spark.eventLog.enabled=true > - spark.eventLog.compress=true > - spark.eventLog.compression.codec=zstd > > and start Spark application. While the application is running, load the > application in SHS webpage. It may succeed to replay the event log, but high > likely it will be stuck and loading page will be also stuck. > > Please refer SPARK-29322 for more details. > > As the issue only occurs with 'zstd', the simplest approach is dropping > support of 'zstd' for event log. More general approach would be introducing > timeout on reading event log file, but it should be able to differentiate > thread being stuck vs thread busy with reading huge event log file. > > Which approach would be preferred in Spark community, or would someone > propose better ideas for handling this? > > Thanks, > Jungtaek Lim (HeartSaVioR) - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] Preferred approach on dealing with SPARK-29322
Thank you for reporting, Jungtaek. Can we try to upgrade it to the newer version first? Since we are at 1.4.2, the newer version is 1.4.3. Bests, Dongjoon. On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan wrote: > Makes more sense to drop support for zstd assuming the fix is not > something at spark end (configuration, etc). > Does not make sense to try to detect deadlock in codec. > > Regards, > Mridul > > On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim > wrote: > > > > Hi devs, > > > > I've discovered an issue with event logger, specifically reading > incomplete event log file which is compressed with 'zstd' - the reader > thread got stuck on reading that file. > > > > This is very easy to reproduce: setting configuration as below > > > > - spark.eventLog.enabled=true > > - spark.eventLog.compress=true > > - spark.eventLog.compression.codec=zstd > > > > and start Spark application. While the application is running, load the > application in SHS webpage. It may succeed to replay the event log, but > high likely it will be stuck and loading page will be also stuck. > > > > Please refer SPARK-29322 for more details. > > > > As the issue only occurs with 'zstd', the simplest approach is dropping > support of 'zstd' for event log. More general approach would be introducing > timeout on reading event log file, but it should be able to differentiate > thread being stuck vs thread busy with reading huge event log file. > > > > Which approach would be preferred in Spark community, or would someone > propose better ideas for handling this? > > > > Thanks, > > Jungtaek Lim (HeartSaVioR) > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
[SS] Possible inconsistent semantics on metric "updated" between stateful operators
Hi devs, I've indicated the different semantics on metric "updated" between (Flat)MapGroupsWithState and other stateful operators. * (Flat)MapGroupsWithState: removal is counted as updated * others: removal is not counted as updated Technically, the meanings of "removal" are different: (Flat)MapGroupsWithState requires the state function to remove state (removed via user logic), whereas others are evicting state based on watermark. So removed via user logic vs removed automatically via mechanism of Spark. Even taking the difference into account, it may be still confusing - as end users would assume total state rows >= updated rows when they are playing with streaming aggregations or stream-stream joins, and when they start to use (Flat)MapGroupsWithState, they would indicate their assumption is incorrect - it's possible for FlatMapGroupsWithState to have metrics (total 0, updated 1) which might look odd for them. We have some options here: 1) It's by intention and it works as expected. Leave it as it is. 2) Don't increase "updated" when state is removed for FlatMapGroupsWithState 3) Add a new metric "removed" and apply this to all stateful operators (both removal and eviction) Would like to hear voices on this. Thanks in advance, Jungtaek Lim (HeartSaVioR) * JIRA issue: https://issues.apache.org/jira/browse/SPARK-29312
Re: [DISCUSS] Preferred approach on dealing with SPARK-29322
The change log for zstd v1.4.3 feels me that the changes don't seem to be related. https://github.com/facebook/zstd/blob/dev/CHANGELOG#L1-L5 v1.4.3 bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709) bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722) build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705) misc: Add NULL pointer check in util.c by @leeyoung624 (#1706) But it's only the matter of dependency update and rebuild, so I'll try it out. Before that, I just indicated ZstdOutputStream has a parameter "closeFrameOnFlush" which seems to deal with flush. We let the value as the default value which is "false". Let me pass the value to "true" and see it helps. Please let me know if someone knows why we pick the value as false (or let it by default). On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun wrote: > Thank you for reporting, Jungtaek. > > Can we try to upgrade it to the newer version first? > > Since we are at 1.4.2, the newer version is 1.4.3. > > Bests, > Dongjoon. > > > > On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan > wrote: > >> Makes more sense to drop support for zstd assuming the fix is not >> something at spark end (configuration, etc). >> Does not make sense to try to detect deadlock in codec. >> >> Regards, >> Mridul >> >> On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim >> wrote: >> > >> > Hi devs, >> > >> > I've discovered an issue with event logger, specifically reading >> incomplete event log file which is compressed with 'zstd' - the reader >> thread got stuck on reading that file. >> > >> > This is very easy to reproduce: setting configuration as below >> > >> > - spark.eventLog.enabled=true >> > - spark.eventLog.compress=true >> > - spark.eventLog.compression.codec=zstd >> > >> > and start Spark application. While the application is running, load the >> application in SHS webpage. It may succeed to replay the event log, but >> high likely it will be stuck and loading page will be also stuck. >> > >> > Please refer SPARK-29322 for more details. >> > >> > As the issue only occurs with 'zstd', the simplest approach is dropping >> support of 'zstd' for event log. More general approach would be introducing >> timeout on reading event log file, but it should be able to differentiate >> thread being stuck vs thread busy with reading huge event log file. >> > >> > Which approach would be preferred in Spark community, or would someone >> propose better ideas for handling this? >> > >> > Thanks, >> > Jungtaek Lim (HeartSaVioR) >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>
Re: [DISCUSS] Preferred approach on dealing with SPARK-29322
I need to do full manual test to make sure, but according to experiment (small UT) "closeFrameOnFlush" seems to work. There was relevant change on master branch SPARK-26283 [1], and it changed the way to read the zstd event log file to "continuous", which seems to read open frame. With "closeFrameOnFlush" being false for ZstdOutputStream, frame is never closed (even flushing output stream) unless output stream is closed. I'll raise a patch once manual test is passed. Sorry for the false alarm. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://issues.apache.org/jira/browse/SPARK-26283 On Wed, Oct 2, 2019 at 2:33 PM Jungtaek Lim wrote: > The change log for zstd v1.4.3 feels me that the changes don't seem to be > related. > > https://github.com/facebook/zstd/blob/dev/CHANGELOG#L1-L5 > > v1.4.3 > bug: Fix Dictionary Compression Ratio Regression by @cyan4973 (#1709) > bug: Fix Buffer Overflow in v0.3 Decompression by @felixhandte (#1722) > build: Add support for IAR C/C++ Compiler for Arm by @joseph0918 (#1705) > misc: Add NULL pointer check in util.c by @leeyoung624 (#1706) > > But it's only the matter of dependency update and rebuild, so I'll try it > out. > > Before that, I just indicated ZstdOutputStream has a parameter > "closeFrameOnFlush" which seems to deal with flush. We let the value as the > default value which is "false". Let me pass the value to "true" and see it > helps. Please let me know if someone knows why we pick the value as false > (or let it by default). > > > On Wed, Oct 2, 2019 at 1:48 PM Dongjoon Hyun > wrote: > >> Thank you for reporting, Jungtaek. >> >> Can we try to upgrade it to the newer version first? >> >> Since we are at 1.4.2, the newer version is 1.4.3. >> >> Bests, >> Dongjoon. >> >> >> >> On Tue, Oct 1, 2019 at 9:18 PM Mridul Muralidharan >> wrote: >> >>> Makes more sense to drop support for zstd assuming the fix is not >>> something at spark end (configuration, etc). >>> Does not make sense to try to detect deadlock in codec. >>> >>> Regards, >>> Mridul >>> >>> On Tue, Oct 1, 2019 at 8:39 PM Jungtaek Lim >>> wrote: >>> > >>> > Hi devs, >>> > >>> > I've discovered an issue with event logger, specifically reading >>> incomplete event log file which is compressed with 'zstd' - the reader >>> thread got stuck on reading that file. >>> > >>> > This is very easy to reproduce: setting configuration as below >>> > >>> > - spark.eventLog.enabled=true >>> > - spark.eventLog.compress=true >>> > - spark.eventLog.compression.codec=zstd >>> > >>> > and start Spark application. While the application is running, load >>> the application in SHS webpage. It may succeed to replay the event log, but >>> high likely it will be stuck and loading page will be also stuck. >>> > >>> > Please refer SPARK-29322 for more details. >>> > >>> > As the issue only occurs with 'zstd', the simplest approach is >>> dropping support of 'zstd' for event log. More general approach would be >>> introducing timeout on reading event log file, but it should be able to >>> differentiate thread being stuck vs thread busy with reading huge event log >>> file. >>> > >>> > Which approach would be preferred in Spark community, or would someone >>> propose better ideas for handling this? >>> > >>> > Thanks, >>> > Jungtaek Lim (HeartSaVioR) >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>