Ah, I see. So basically what you need is something like cache write through support which exists in Shark but not implemented in Spark SQL yet. In Shark, when inserting data into a table that has already been cached, the newly inserted data will be automatically cached and “union”-ed with the existing table content. SPARK-1671 <https://issues.apache.org/jira/browse/SPARK-1671> was created to track this feature. We’ll work on that.
Currently, as a workaround, instead of doing union at the RDD level, you may try cache the new table, union it with the old table and then query the union-ed table. The drawbacks is higher code complexity and you end up with lots of temporary tables. But the performance should be reasonable. On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur <[email protected]> wrote: > LittleCode snippet: > > line1: cacheTable(existingRDDTableName) > line2: //some operations which will materialize existingRDD dataset. > line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) > line4: cacheTable(new_existingRDDTableName) > line5: //some operation that will materialize new _existingRDD. > > now, what we expect is in line4 rather than caching both > existingRDDTableName and new_existingRDDTableName, it should cache only > new_existingRDDTableName. but we cannot explicitly uncache > existingRDDTableName because we want the union to use the cached > existingRDDTableName. since being lazy new_existingRDDTableName could be > materialized later and by then we cant lose existingRDDTableName from > cache. > > What if keep the same name of the new table > > so, cacheTable(existingRDDTableName) > existingRDD.union(newRDD).registerAsTable(existingRDDTableName) > cacheTable(existingRDDTableName) //might not be needed again. > > Will our both cases be satisfied, that it uses existingRDDTableName from > cache for union and dont duplicate the data in the cache but somehow, > append to the older cacheTable. > > Thanks and Regards, > > > Archit Thakur. > Sr Software Developer, > Guavus, Inc. > > On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora <[email protected] > > wrote: > >> I think i should elaborate usecase little more. >> >> So we have UI dashboard whose response time is quite fast as all the data >> is >> cached. Users query data based on time range and also there is always new >> data coming into the system at predefined frequency lets say 1 hour. >> >> As you said i can uncache tables it will basically drop all data from >> memory. >> I cannot afford losing my cache even for short interval. As all queries >> from >> UI will get slow till the time cache loads again. UI response time needs >> to >> be predictable and shoudl be fast enough so that user does not get >> irritated. >> >> Also i cannot keep two copies of data(till newrdd materialize) into memory >> as it will surpass total available memory in system. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
