Heres the proposal for supporting it in "append" mode -
https://github.com/apache/spark/pull/23576. You could see if it addresses
your requirement and post your feedback in the PR.
For "update" mode its going to be much harder to support this without first
adding support for "retractions", otherwis
+1 to upgrade Jackson. It has come up multiple times due to CVEs and the
back port has worked out but it may be good to include if its not going to
delay the release.
On Thu, 18 Apr 2019 at 19:53, Wenchen Fan wrote:
> I've cut RC1. If people think we must upgrade Jackson in 2.4, I can cut
> RC2
I am not sure how would it cause a leak though. When a spark session or the
underlying context is stopped it should clean up everything. The
getOrCreate is supposed to return the active thread local or the global
session. May be if you keep creating new sessions after explicitly clearing
the defaul
Yes, I agree thats its a valid concern and leads to individual contributors
giving up on new ideas or major improvements.
On Tue, 26 Feb 2019 at 15:24, Jungtaek Lim wrote:
> Adding one more, it implicitly leads individual contributors to give up
> with challenging major things and just focus on
Unless its some sink metadata to be maintained by the framework (e.g sink
state that needs to be passed back to the sink etc), would it make sense
to keep it under the checkpoint dir ?
Maybe I am missing the motivation of the proposed approach but I guess
the sink mostly needs to store the last se
Congrats Jose! Well deserved.
On Tue, 29 Jan 2019 at 11:15, Jules Damji wrote:
> Congrats Jose!
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jan 29, 2019, at 11:07 AM, shane knapp wrote:
>
> congrats, and welcome!
>
> On Tue, Jan 29, 2019 at 11:07 AM Dean Wampler
> wrote:
>
There has been efforts to come up with a unified syntax for streaming (see
[1] [2]), but I guess there will be differences based on the streaming
features supported by a system.
Agree it needs a detailed design and it can be as close to the Spark batch
SQL syntax as possible.
Also I am not sure i
I guess thats roughly it.
As of now theres no in-built support to co-ordinate the commits across the
executors in an atomic way. So you need to commit the batch (global commit)
at the driver.
And when the batch is replayed and if any of the intermediate operations
are not idempotent or can cause
IMO, the currentOffset should not be optional.
For continuous mode I assume this offset gets periodically check pointed
(so mandatory) ?
For the micro batch mode the currentOffset would be the start offset for a
micro-batch.
And if the micro-batch could be executed without knowing the 'latest'
off
Thanks for bringing up the custom metrics API in the list, its something
that needs to be addressed.
A couple more items worth considering,
1. Possibility to unify the batch, micro-batch and continuous sources.
(similar to SPARK-25000)
Right now now there is significant code duplication even
I don't think separate API or RPCs etc might be necessary for queryable
state if the state can be exposed as just another datasource. Then the sql
queries can be issued against it just like executing sql queries against
any other data source.
For now I think the "memory" sink could be used as a s
Congratulations everyone!
On Wed, 3 Oct 2018 at 09:16, Dilip Biswal wrote:
> Congratulations to Shane Knapp, Dongjoon Hyun, Kazuaki Ishizaki, Xingbo
> Jiang, Yinan Li and Takeshi Yamamuro !!
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980
> dbis...@us.ibm.com
>
>
>
> - Original message ---
nism to take down
>>>> tasks after they have prepared and bring up the tasks during the commit
>>>> phase.
>>>>
>>>> I guess we already got into too much details here, but if it is based
>>>> on client transaction Spark must a
he
>>>> final commit.
>>>>
>>>> XA is also not a trivial one to get it correctly with current execution
>>>> model: Spark doesn't require writer tasks to run at the same time but to
>>>> achieve 2PC they should run until end of transact
In some cases the implementations may be ok with eventual consistency (and
does not care if the output is written out atomically)
XA can be one option for datasources that supports it and requires
atomicity but I am not sure how would one implement it with the current
API.
May be we need to discu
Ryan's proposal makes a lot of sense. Its better not to release half-baked
changes in 2.4 which not only breaks a lot of the APIs released in 2.3, but
also expected to change further due redesigns before 3.0 so don't see much
value releasing it in 2.4.
On Sun, 9 Sep 2018 at 22:42, Wenchen Fan wro
om/document/d/1DEOW3WQcPUq0YFgazkZx6Ei6
>>>> EOdj_3pXEsyq4LGpyNs/edit?usp=sharing
>>>>
>>>> Please note that I just copied the content to the google docs, so
>>>> someone could point out lack of details. I would like to start with
>>>> explanation o
Can you share this in a google doc to make the discussions easier.?
Thanks for coming up with ideas to improve upon the current restrictions
with the SS state store.
If I understood correctly, the plan is to introduce a logical partitioning
scheme for state storage (based on keys) independent
I guess sorting would make sense only when you have the complete data set. In
streaming you don’t know what record is coming next so doesn’t make sense to
sort it (except in the aggregated complete output mode where the entire result
table is emitted each time and the results can be sorted).
Th
19 matches
Mail list logo