Hi devs,
While helping user in user mailing list, I start to suspect that chained
streaming-streaming joins works incorrectly but Structured Streaming
doesn't prevent it. The reason is actually similar to why chained streaming
aggregations is not supported in Structured Streaming, global watermark
Hi All,
Is there any way to receive some event that a DataSourceReader is finished?
I want to do some clean up after all the DataReaders are finished reading
and hence need some kind of cleanUp() mechanism at DataSourceReader(Driver)
level.
How to achieve this?
For instance, in DataSourceWriter
FYI, I am already using QueryExecutionListener which satisfies the
requirements.
But that only works for dataframe APIs. If someone does
df.rdd().someAction(), QueryExecutionListener is never invoked. I want
something like QueryExecutionListener works in case of
df.rdd().someAction() too.
I explor
Hi!
I would like to socialize this issue we are currently facing:
The Structured Streaming default CheckpointFileManager leaks .crc files by
leaving them behind after users of this class (like
HDFSBackedStateStoreProvider) apply their cleanup methods.
This results in an unbounded creation of tiny
Ooops - linked the wrong JIRA ticket: (that other one is related)
https://issues.apache.org/jira/browse/SPARK-28025
On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote:
> Hi!
> I would like to socialize this issue we are currently facing:
> The Structured Streaming default CheckpointFileManager
Nice finding!
Given you already pointed out previous issue which fixed similar issue, it
would be also easy for you to craft the patch and verify whether the fix
resolves your issue. Looking forward to see your patch.
Thanks,
Jungtaek Lim (HeartSaVioR)
On Wed, Jun 12, 2019 at 8:23 PM Gerard Maas
We cannot have control over RDD going out of scope from memory as it was
handled by JVM. Thus I am not sure try and finalize will help.
Thus I wanted to use some mechanism to cleanup of some temporary data which is
created by RDD immediately as soon as it goes out of scope.
Any ideas ?
Thanks,
Hi, All.
Since we use both Apache JIRA and GitHub actively for Apache Spark
contributions, we have lots of JIRAs and PRs consequently. One specific
thing I've been longing to see is `Jira Issue Type` in GitHub.
How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`? There
are two ma
General representation of this issue would be:
- stateful operator would evict rows in state when watermark passes by
- for append mode, evicted rows are used as output rows, in other words,
input rows for next stateful operator
- next stateful operator would discard late input rows using same wat
Yea, I think we can automate this process via, for instance,
https://github.com/apache/spark/blob/master/dev/github_jira_sync.py
+1 for such sort of automatic categorizing and matching metadata between
JIRA and github
Adding Josh and Sean as well.
On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun, wrote
Seems like a good idea. Can we test this with a component first?
On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun
wrote:
> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been long
Hi Dongjoon,
Thanks for the proposal! I like the idea. Maybe we can extend it to
component too and to some jira labels such as correctness which may be
worth to highlight in PRs too. My only concern is that in many cases JIRAs
are created not very carefully so they may be incorrect at the moment of
I think maybe we could start a vote on this SPIP.
This has been discussed for a while, and the current doc is pretty complete
as for now. Also we saw lots of demands in the community about building
their own shuffle storage.
Thanks
Saisai
Imran Rashid 于2019年6月11日周二 上午3:27写道:
> I would be happy
13 matches
Mail list logo