* When you say refresh happens for only batch or non-streaming sources, I am assuming all kinds of DB sources like RDBMS, Distributed data store, file system etc as batch sources. Please correct if required.
It depends on how you read the data frame. Any dataframe that you get by doing spark.readStream is a streaming data frame. Any dataframe read by doing spark.read is a non-streaming dataframe. It doesn’t matter what the underlying format is. Spark will refresh the entire non-streaming dataframe at the beginning of every microbatch. Note that if you cache the non-streaming dataframe, then it won’t refresh the dataframe. Keep in mind that refreshing the dataframe for every microbatch will introduce latency because refreshing non-streaming dataframe will incur IO. This latency depends on the size of the data being read. If your reference data is large, you will be incurring IO overhead on every microbatch. From: Arti Pande <pande.a...@gmail.com> Date: Friday, November 13, 2020 at 2:19 PM To: "Lalwani, Jayesh" <jlalw...@amazon.com> Cc: "user@spark.apache.org" <user@spark.apache.org> Subject: RE: [EXTERNAL] Refreshing Data in Spark Memory (DataFrames) CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Thanks for quick response. This is a batch use case in as-is world. We are redesigning it and intend to use streaming. Good to know that spark streaming will refresh data for every microbatch. When you say refresh happens for only batch or non-streaming sources, I am assuming all kinds of DB sources like RDBMS, Distributed data store, file system etc as batch sources. Please correct if required. Thanks & regards, Arti Pande On Sat, Nov 14, 2020 at 12:11 AM Lalwani, Jayesh <jlalw...@amazon.com<mailto:jlalw...@amazon.com>> wrote: Is this a streaming application or a batch application? Normally, for batch applications, you want to keep data consistent. If you have a portfolio of mortgages that you are computing payments for and the interest rate changes while you are computing payments, you don’t want to compute half the mortgages with older interest rate, and other half with the newer interest rate. And if you run the same mortgages tomorrow, you don’t want to get completely different results than what you got yesterday. The finance industry is kind of sensitive about things like this. You can’t just change things willy-nilly In the past, I’ve worked in fintech for about 8 years, and IMO, I’ve never heard changing the reference data in middle of a computation as a required thing. I would have given people heart attacks if I told them that the reference data is changing halfway. I am pretty sure that there are scenarios where this is required. I have a hard time believing that this is a common scenario Maybe things in finance have changed in 2020 Normally, any reference data has an “as of date” associated it, and every record being processed has a time stamp associated with it. You match up your input with reference by matching the as of date with the timestamp. When the reference data changes, you don’t remove the old records from reference data, and you add records with the new “as of date”. Essentially, you keep the history of the reference data. SO, if you have to rerun old computation, your results don’t change. There might be scenarios where you want to correct old reference data. In this case you update your reference table, and rerun your computation. Now, if you are talking about streaming applications, then it’s a different story. You want to refresh your reference data. Spark reloads the dataframes from batch sources at the beginning of every microbatch. As long as you are reading the data from from a non-streaming source, it will get refreshed in every microbatch. Alternatively, you can send updates to reference data through a stream, and then merge your historic reference data with the updates that you are getting from the streaming source. From: Arti Pande <pande.a...@gmail.com<mailto:pande.a...@gmail.com>> Date: Friday, November 13, 2020 at 1:04 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: [EXTERNAL] Refreshing Data in Spark Memory (DataFrames) CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi In the financial systems world, if some data is being updated too frequently, and that data is to be used as reference data by a Spark job that runs for 6/7 hours, most likely Spark job may read that data at the beginning and keep it in memory as DataFrame and will keep running for remaining 6/7 hours. Meanwhile if the reference data is updated by some other system, then Spark job's in-memory copy of that data (data frame) goes out of sync. Is there a way to refresh that reference data in Spark memory / dataframe by some means? This seems to be a very common scenario. Is there a solution / workaround for this? Thanks & regards, Arti Pande