correctness issue on chained streaming-streaming join

2019-06-12 Thread Jungtaek Lim
Hi devs, While helping user in user mailing list, I start to suspect that chained streaming-streaming joins works incorrectly but Structured Streaming doesn't prevent it. The reason is actually similar to why chained streaming aggregations is not supported in Structured Streaming, global watermark

Clean up method for DataSourceReader

2019-06-12 Thread Shubham Chaurasia
Hi All, Is there any way to receive some event that a DataSourceReader is finished? I want to do some clean up after all the DataReaders are finished reading and hence need some kind of cleanUp() mechanism at DataSourceReader(Driver) level. How to achieve this? For instance, in DataSourceWriter

Re: Clean up method for DataSourceReader

2019-06-12 Thread Shubham Chaurasia
FYI, I am already using QueryExecutionListener which satisfies the requirements. But that only works for dataframe APIs. If someone does df.rdd().someAction(), QueryExecutionListener is never invoked. I want something like QueryExecutionListener works in case of df.rdd().someAction() too. I explor

[StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Hi! I would like to socialize this issue we are currently facing: The Structured Streaming default CheckpointFileManager leaks .crc files by leaving them behind after users of this class (like HDFSBackedStateStoreProvider) apply their cleanup methods. This results in an unbounded creation of tiny

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Ooops - linked the wrong JIRA ticket: (that other one is related) https://issues.apache.org/jira/browse/SPARK-28025 On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote: > Hi! > I would like to socialize this issue we are currently facing: > The Structured Streaming default CheckpointFileManager

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Jungtaek Lim
Nice finding! Given you already pointed out previous issue which fixed similar issue, it would be also easy for you to craft the patch and verify whether the fix resolves your issue. Looking forward to see your patch. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Jun 12, 2019 at 8:23 PM Gerard Maas

RE: Adding Custom finalize method to RDDs.

2019-06-12 Thread Nasrulla Khan Haris
We cannot have control over RDD going out of scope from memory as it was handled by JVM. Thus I am not sure try and finalize will help. Thus I wanted to use some mechanism to cleanup of some temporary data which is created by RDD immediately as soon as it goes out of scope. Any ideas ? Thanks,

Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Dongjoon Hyun
Hi, All. Since we use both Apache JIRA and GitHub actively for Apache Spark contributions, we have lots of JIRAs and PRs consequently. One specific thing I've been longing to see is `Jira Issue Type` in GitHub. How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`? There are two ma

Re: correctness issue on chained streaming-streaming join

2019-06-12 Thread Jungtaek Lim
General representation of this issue would be: - stateful operator would evict rows in state when watermark passes by - for append mode, evicted rows are used as output rows, in other words, input rows for next stateful operator - next stateful operator would discard late input rows using same wat

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Hyukjin Kwon
Yea, I think we can automate this process via, for instance, https://github.com/apache/spark/blob/master/dev/github_jira_sync.py +1 for such sort of automatic categorizing and matching metadata between JIRA and github Adding Josh and Sean as well. On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun, wrote

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Reynold Xin
Seems like a good idea. Can we test this with a component first? On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun wrote: > Hi, All. > > Since we use both Apache JIRA and GitHub actively for Apache Spark > contributions, we have lots of JIRAs and PRs consequently. One specific > thing I've been long

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Marco Gaido
Hi Dongjoon, Thanks for the proposal! I like the idea. Maybe we can extend it to component too and to some jira labels such as correctness which may be worth to highlight in PRs too. My only concern is that in many cases JIRAs are created not very carefully so they may be incorrect at the moment of

Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

2019-06-12 Thread Saisai Shao
I think maybe we could start a vote on this SPIP. This has been discussed for a while, and the current doc is pretty complete as for now. Also we saw lots of demands in the community about building their own shuffle storage. Thanks Saisai Imran Rashid 于2019年6月11日周二 上午3:27写道: > I would be happy