Re: Refreshing Data in Spark Memory (DataFrames)

Arti Pande Fri, 13 Nov 2020 11:23:04 -0800

Thanks for quick response.

This is a batch use case in as-is world. We are redesigning it and intend
to use streaming. Good to know that spark streaming will refresh data for
every microbatch.


When you say refresh happens for only batch or non-streaming sources, I am
assuming all kinds of DB sources like RDBMS, Distributed data store, file
system etc as batch sources. Please correct if required.

Thanks & regards,
Arti Pande

On Sat, Nov 14, 2020 at 12:11 AM Lalwani, Jayesh <jlalw...@amazon.com>
wrote:

> Is this a streaming application or a batch application?
>
>
>
> Normally, for batch applications, you want to keep data consistent. If you
> have a portfolio  of mortgages that you are computing payments for and the
> interest rate changes while you are computing payments, you don’t want to
> compute half the mortgages with older interest rate, and other half with
> the newer interest rate. And if you run the same mortgages tomorrow, you
> don’t want to get completely different results than what you got yesterday.
> The finance industry is kind of sensitive about things like this. You can’t
> just change things willy-nilly
>
> In the past, I’ve worked in fintech for about 8 years, and IMO, I’ve never
> heard changing the reference data in middle of a computation as a required
> thing. I would have given people heart attacks if I told them that the
> reference data is changing halfway. I am pretty sure that there are
> scenarios where this is required. I have a hard time believing that this is
> a common scenario Maybe things in finance have changed in 2020
>
> Normally, any reference data has an “as of date” associated it, and every
> record being processed has a time stamp associated with it. You match up
> your input with reference by matching the as of date with the timestamp.
> When the reference data changes, you don’t remove the old records from
> reference data, and you add records with the new “as of date”. Essentially,
> you keep the history of the reference data. SO, if you have to rerun old
> computation, your results don’t change.
> There might be scenarios where you want to correct old reference data. In
> this case you update your reference table, and rerun your computation.
>
>
>
> Now, if you are talking about streaming applications, then it’s a
> different story. You want to refresh your reference data. Spark reloads the
> dataframes from batch sources at the beginning of every microbatch. As long
> as you are reading the data from from a non-streaming source, it will get
> refreshed in every microbatch. Alternatively, you can send updates to
> reference data through a stream, and then merge your historic reference
> data with the updates that you are getting from the streaming source.
>
>
>
> *From: *Arti Pande <pande.a...@gmail.com>
> *Date: *Friday, November 13, 2020 at 1:04 PM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *[EXTERNAL] Refreshing Data in Spark Memory (DataFrames)
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Hi
>
>
>
> In the financial systems world, if some data is being updated too
> frequently, and that data is to be used as reference data by a Spark job
> that runs for 6/7 hours, most likely Spark job may read that data at the
> beginning and keep it in memory as DataFrame and will keep running for
> remaining 6/7 hours. Meanwhile if the reference data is updated by some
> other system, then Spark job's in-memory copy of that data (data frame)
> goes out of sync.
>
>
>
> Is there a way to refresh that reference data in Spark memory / dataframe
> by some means?
>
>
>
> This seems to be a very common scenario. Is there a solution / workaround
> for this?
>
>
>
> Thanks & regards,
>
> Arti Pande
>

Re: Refreshing Data in Spark Memory (DataFrames)

Reply via email to