Re: Refreshing Data in Spark Memory (DataFrames)

Lalwani, Jayesh Fri, 13 Nov 2020 13:53:04 -0800

  *   When you say refresh happens for only batch or non-streaming sources, I 
am assuming all kinds of DB sources like RDBMS, Distributed data store, file 
system etc as batch sources. Please correct if required.


It depends on how you read the data frame. Any dataframe that you get by doing 
spark.readStream is a streaming data frame. Any dataframe read by doing 
spark.read is a non-streaming dataframe. It doesn’t matter what the underlying 
format is. Spark will refresh the entire non-streaming dataframe at the 
beginning of every microbatch. Note that if you cache the non-streaming 
dataframe, then it won’t refresh the dataframe.

Keep in mind that refreshing  the dataframe for every microbatch will introduce 
latency because refreshing non-streaming dataframe will incur IO. This latency 
depends on the size of the data being read. If your reference data is large, 
you will be incurring IO overhead on every microbatch.

From: Arti Pande <pande.a...@gmail.com>
Date: Friday, November 13, 2020 at 2:19 PM
To: "Lalwani, Jayesh" <jlalw...@amazon.com>
Cc: "user@spark.apache.org" <user@spark.apache.org>
Subject: RE: [EXTERNAL] Refreshing Data in Spark Memory (DataFrames)


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Thanks for quick response.

This is a batch use case in as-is world. We are redesigning it and intend to 
use streaming. Good to know that spark streaming will refresh data for every 
microbatch.

When you say refresh happens for only batch or non-streaming sources, I am 
assuming all kinds of DB sources like RDBMS, Distributed data store, file 
system etc as batch sources. Please correct if required.

Thanks & regards,
Arti Pande

On Sat, Nov 14, 2020 at 12:11 AM Lalwani, Jayesh 
<jlalw...@amazon.com<mailto:jlalw...@amazon.com>> wrote:
Is this a streaming application or a batch application?

Normally, for batch applications, you want to keep data consistent. If you have 
a portfolio  of mortgages that you are computing payments for and the interest 
rate changes while you are computing payments, you don’t want to compute half 
the mortgages with older interest rate, and other half with the newer interest 
rate. And if you run the same mortgages tomorrow, you don’t want to get 
completely different results than what you got yesterday. The finance industry 
is kind of sensitive about things like this. You can’t just change things 
willy-nilly
In the past, I’ve worked in fintech for about 8 years, and IMO, I’ve never 
heard changing the reference data in middle of a computation as a required 
thing. I would have given people heart attacks if I told them that the 
reference data is changing halfway. I am pretty sure that there are scenarios 
where this is required. I have a hard time believing that this is a common 
scenario Maybe things in finance have changed in 2020
Normally, any reference data has an “as of date” associated it, and every 
record being processed has a time stamp associated with it. You match up your 
input with reference by matching the as of date with the timestamp. When the 
reference data changes, you don’t remove the old records from reference data, 
and you add records with the new “as of date”. Essentially, you keep the 
history of the reference data. SO, if you have to rerun old computation, your 
results don’t change.
There might be scenarios where you want to correct old reference data. In this 
case you update your reference table, and rerun your computation.

Now, if you are talking about streaming applications, then it’s a different 
story. You want to refresh your reference data. Spark reloads the dataframes 
from batch sources at the beginning of every microbatch. As long as you are 
reading the data from from a non-streaming source, it will get refreshed in 
every microbatch. Alternatively, you can send updates to reference data through 
a stream, and then merge your historic reference data with the updates that you 
are getting from the streaming source.

From: Arti Pande <pande.a...@gmail.com<mailto:pande.a...@gmail.com>>
Date: Friday, November 13, 2020 at 1:04 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: [EXTERNAL] Refreshing Data in Spark Memory (DataFrames)


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hi

In the financial systems world, if some data is being updated too frequently, 
and that data is to be used as reference data by a Spark job that runs for 6/7 
hours, most likely Spark job may read that data at the beginning and keep it in 
memory as DataFrame and will keep running for remaining 6/7 hours. Meanwhile if 
the reference data is updated by some other system, then Spark job's in-memory 
copy of that data (data frame) goes out of sync.

Is there a way to refresh that reference data in Spark memory / dataframe by 
some means?

This seems to be a very common scenario. Is there a solution / workaround for 
this?

Thanks & regards,
Arti Pande

Re: Refreshing Data in Spark Memory (DataFrames)

Reply via email to