Gents, I am investigating Spark with a view to perform reporting on a large data set, where the large data set receives additional data in the form of log files on an hourly basis.
Where the data set is large there is a possibility we will create a range of aggregate tables, to reduce the volume of data which has to be processed. Having spent a little while reading up about Spark, my thought was that I could create an RDD which is an agg, persist this to disk, have reporting queries run against that RDD and when new data arrives, convert the new log file into an agg and add that to the agg RDD. However, I begin now to get the impression that RDDs cannot be persisted across jobs - I can generate an RDD, I can persist it, but I can see no way for a later job to load a persisted RDD (and I begin to think it will have been GCed anyway, at the end of the first job). Is this correct?