Dataframe caching

रविशंकर नायर Fri, 20 Jan 2017 06:34:05 -0800

Dear all,

Here is a requirement I am thinking of implementing in Spark core. Please
let me know if this is possible, and kindly provide your thoughts.


A user executes a query to fetch 1 million records from , let's say a
database. We let the user store this as a  dataframe, partitioned across
the cluster.

Another user , executed the same query from another session. Is there
anyway that we can let the second user reuse the dataframe created by the
first user?

Can we have a master dataframe (or RDD) which stores the information about
the current dataframes loaded and matches against any queries that are
coming from other users?

In this way, we will have a wonderful system which never allows same query
to be executed and loaded again into the cluster memory.

Best, Ravion

Dataframe caching

Reply via email to