Thanks for quick response. This is a batch use case in as-is world. We are redesigning it and intend to use streaming. Good to know that spark streaming will refresh data for every microbatch.
When you say refresh happens for only batch or non-streaming sources, I am assuming all kinds of DB sources like RDBMS, Distributed data store, file system etc as batch sources. Please correct if required. Thanks & regards, Arti Pande On Sat, Nov 14, 2020 at 12:11 AM Lalwani, Jayesh <jlalw...@amazon.com> wrote: > Is this a streaming application or a batch application? > > > > Normally, for batch applications, you want to keep data consistent. If you > have a portfolio of mortgages that you are computing payments for and the > interest rate changes while you are computing payments, you don’t want to > compute half the mortgages with older interest rate, and other half with > the newer interest rate. And if you run the same mortgages tomorrow, you > don’t want to get completely different results than what you got yesterday. > The finance industry is kind of sensitive about things like this. You can’t > just change things willy-nilly > > In the past, I’ve worked in fintech for about 8 years, and IMO, I’ve never > heard changing the reference data in middle of a computation as a required > thing. I would have given people heart attacks if I told them that the > reference data is changing halfway. I am pretty sure that there are > scenarios where this is required. I have a hard time believing that this is > a common scenario Maybe things in finance have changed in 2020 > > Normally, any reference data has an “as of date” associated it, and every > record being processed has a time stamp associated with it. You match up > your input with reference by matching the as of date with the timestamp. > When the reference data changes, you don’t remove the old records from > reference data, and you add records with the new “as of date”. Essentially, > you keep the history of the reference data. SO, if you have to rerun old > computation, your results don’t change. > There might be scenarios where you want to correct old reference data. In > this case you update your reference table, and rerun your computation. > > > > Now, if you are talking about streaming applications, then it’s a > different story. You want to refresh your reference data. Spark reloads the > dataframes from batch sources at the beginning of every microbatch. As long > as you are reading the data from from a non-streaming source, it will get > refreshed in every microbatch. Alternatively, you can send updates to > reference data through a stream, and then merge your historic reference > data with the updates that you are getting from the streaming source. > > > > *From: *Arti Pande <pande.a...@gmail.com> > *Date: *Friday, November 13, 2020 at 1:04 PM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *[EXTERNAL] Refreshing Data in Spark Memory (DataFrames) > > > > *CAUTION*: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > Hi > > > > In the financial systems world, if some data is being updated too > frequently, and that data is to be used as reference data by a Spark job > that runs for 6/7 hours, most likely Spark job may read that data at the > beginning and keep it in memory as DataFrame and will keep running for > remaining 6/7 hours. Meanwhile if the reference data is updated by some > other system, then Spark job's in-memory copy of that data (data frame) > goes out of sync. > > > > Is there a way to refresh that reference data in Spark memory / dataframe > by some means? > > > > This seems to be a very common scenario. Is there a solution / workaround > for this? > > > > Thanks & regards, > > Arti Pande >