Re: RDD staleness

Michael Armbrust Sun, 31 May 2015 16:38:46 -0700

Each time you run a Spark SQL query we will create new RDDs that load the
data and thus you should see the newest results.  There is one caveat:
formats that use the native Data Source API (parquet, ORC (in Spark 1.4),
JSON (in Spark 1.5)) cache file metadata to speed up interactive querying.
To clear the metadata cache run sql("REFRESH TABLE <tableName>").


On Sun, May 31, 2015 at 10:46 PM, DW @ Gmail <[email protected]> wrote:

> There is no mechanism for keeping an RDD up to date with a changing
> source. However you could set up a steam that watches for changes to the
> directory and processes the new files or use the Hive integration in
> SparkSQL to run Hive queries directly. (However, old query results will
> still grow stale. )
>
> Sent from my rotary phone.
>
>
> > On May 31, 2015, at 7:11 AM, Ashish Mukherjee <
> [email protected]> wrote:
> >
> > Hello,
> >
> > Since RDDs are created from data from Hive tables or HDFS, how do we
> ensure they are invalidated when the source data is updated?
> >
> > Regards,
> > Ashish
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: RDD staleness

Reply via email to