The unknown slowdown might be addressed by https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd
On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan <velvia.git...@gmail.com> wrote: > SPARK-1671 looks really promising. > > Note that even right now, you don't need to un-cache the existing > table. You can do something like this: > > newAdditionRdd.registerTempTable("table2") > sqlContext.cacheTable("table2") > val unionedRdd = > sqlContext.table("table1").unionAll(sqlContext.table("table2")) > > When you use "table", it will return you the cached representation, so > that the union executes much faster. > > However, there is some unknown slowdown, it's not quite as fast as > what you would expect. > > On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > > Ah, I see. So basically what you need is something like cache write > through > > support which exists in Shark but not implemented in Spark SQL yet. In > > Shark, when inserting data into a table that has already been cached, the > > newly inserted data will be automatically cached and “union”-ed with the > > existing table content. SPARK-1671 was created to track this feature. > We’ll > > work on that. > > > > Currently, as a workaround, instead of doing union at the RDD level, you > may > > try cache the new table, union it with the old table and then query the > > union-ed table. The drawbacks is higher code complexity and you end up > with > > lots of temporary tables. But the performance should be reasonable. > > > > > > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur < > archit279tha...@gmail.com> > > wrote: > >> > >> LittleCode snippet: > >> > >> line1: cacheTable(existingRDDTableName) > >> line2: //some operations which will materialize existingRDD dataset. > >> line3: > existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) > >> line4: cacheTable(new_existingRDDTableName) > >> line5: //some operation that will materialize new _existingRDD. > >> > >> now, what we expect is in line4 rather than caching both > >> existingRDDTableName and new_existingRDDTableName, it should cache only > >> new_existingRDDTableName. but we cannot explicitly uncache > >> existingRDDTableName because we want the union to use the cached > >> existingRDDTableName. since being lazy new_existingRDDTableName could be > >> materialized later and by then we cant lose existingRDDTableName from > cache. > >> > >> What if keep the same name of the new table > >> > >> so, cacheTable(existingRDDTableName) > >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName) > >> cacheTable(existingRDDTableName) //might not be needed again. > >> > >> Will our both cases be satisfied, that it uses existingRDDTableName from > >> cache for union and dont duplicate the data in the cache but somehow, > append > >> to the older cacheTable. > >> > >> Thanks and Regards, > >> > >> > >> Archit Thakur. > >> Sr Software Developer, > >> Guavus, Inc. > >> > >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora > >> <pankajarora.n...@gmail.com> wrote: > >>> > >>> I think i should elaborate usecase little more. > >>> > >>> So we have UI dashboard whose response time is quite fast as all the > data > >>> is > >>> cached. Users query data based on time range and also there is always > new > >>> data coming into the system at predefined frequency lets say 1 hour. > >>> > >>> As you said i can uncache tables it will basically drop all data from > >>> memory. > >>> I cannot afford losing my cache even for short interval. As all queries > >>> from > >>> UI will get slow till the time cache loads again. UI response time > needs > >>> to > >>> be predictable and shoudl be fast enough so that user does not get > >>> irritated. > >>> > >>> Also i cannot keep two copies of data(till newrdd materialize) into > >>> memory > >>> as it will surpass total available memory in system. > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: user-h...@spark.apache.org > >>> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >