Hi,

I was just curious if anyone has ever used Spark as an application server
cache?

My use case is:
 * I have large datasets which need to be updated / inserted (upsert) in
the database
 * I have actually found that it is much easier to run a Spark submit job
that pulls from the database, and compares the incoming new data with the
existing data in memory and only upsert the necessary rows (remove all
duplicates)

I was thinking that if I keep the spark dataframe in memory in a long
running spark session, then I can further speed up this process as I can
remove the database query on each batch run.

I have a data pipeline in which I'm subscribed to essentially a firehose of
information and I want to save everything however I don't want to update /
save any duplicate data and would like to eliminate this in memory before
having to make the database IO call.

If anyone has used Spark like this would appreciate their input and or a
diff solution if Spark is not appropriate

Thx

Reply via email to