Hello there,
I have a quick question for the following case:
situation:
a spark consumer is able to process 5 batches in 10 sec (where the batch
interval is zero by default - correct me if this is wrong). the window size
is 10 sec (zero overlapping sliding).
there are some fluctuations in the inc
@holdenK et al ping on next steps.
Sent from my iPhone
On Jul 12, 2018, at 3:47 PM, Saikat Kanjilal
mailto:sxk1...@hotmail.com>> wrote:
Thanks maximiliano so much for responding, I didn't want this discussion to
disappear in the wilderness of dev emails :), here's what I would like to see
or
I'd agree it might make sense to bundle this into an API. We'd have to
think about whether it's a common enough use case to justify the API
complexity.
It might be worth exploring decoupling state and partitions, but I wouldn't
want to start making decisions based on it without a clearer design
pi
coalesce might work.
Say "spark.sql.shuffle.partitions" = 200, and then "
input.readStream.map.filter.groupByKey(..).coalesce(2)" would still
create 200 instances for state but execute just 2 tasks.
However I think further groupByKey operations downstream would need similar
coalesce.
And thi
I'd like to propose adding a plugin api for Executors, primarily for
instrumentation and debugging (
https://issues.apache.org/jira/browse/SPARK-24918). The changes are small,
but as its adding a new api, it might be spip-worthy. I mentioned it as
well in a recent email I sent about memory monito
A coalesced RDD will definitely maintain any within-partition invariants
that the original RDD maintained. It pretty much just runs its input
partitions sequentially.
There'd still be some Dataframe API work needed to get the coalesce
operation where you want it to be, but this is much simpler tha
I’m afraid I don’t know about the details on coalesce(), but some finding
resource for coalesce, it looks like helping reducing actual partitions.
For streaming aggregation, state for all partitions (by default, 200) must
be initialized and committed even it is being unchanged. Otherwise error
occ
Scheduling multiple partitions in the same task is basically what
coalesce() does. Is there a reason that doesn't work here?
On Fri, Aug 3, 2018 at 5:55 AM, Jungtaek Lim wrote:
> Here's a link for Google docs (anyone can comment):
> https://docs.google.com/document/d/1DEOW3WQcPUq0YFgazkZx6Ei6EOd
Here's a link for Google docs (anyone can comment):
https://docs.google.com/document/d/1DEOW3WQcPUq0YFgazkZx6Ei6EOdj_3pXEsyq4LGpyNs/edit?usp=sharing
Please note that I just copied the content to the google docs, so someone
could point out lack of details. I would like to start with explanation of
Hi everyone,
I am building a framework on top of Spark in which users specify sql
queries and we analyze them in order to extract some metadata. Moreover,
sql queries can be composed, meaning that if a user writes a query X to
build a dataset, another user can use X in his own query to refer to it
Hi All,
I have a scenario where my streaming application is running and in between
the application has restarted, So basically i want to start from most recent
data and go back to
old data. Is there any way in spark we can do this or are we planning for
providing such kind of flexibility in futur
Can you share this in a google doc to make the discussions easier.?
Thanks for coming up with ideas to improve upon the current restrictions
with the SS state store.
If I understood correctly, the plan is to introduce a logical partitioning
scheme for state storage (based on keys) independent
12 matches
Mail list logo