Re: Use Case of mutable RDD - any ideas around will help.

Michael Armbrust Wed, 17 Sep 2014 17:33:31 -0700

The unknown slowdown might be addressed by
https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd


On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan <velvia.git...@gmail.com> wrote:

> SPARK-1671 looks really promising.
>
> Note that even right now, you don't need to un-cache the existing
> table.   You can do something like this:
>
> newAdditionRdd.registerTempTable("table2")
> sqlContext.cacheTable("table2")
> val unionedRdd =
> sqlContext.table("table1").unionAll(sqlContext.table("table2"))
>
> When you use "table", it will return you the cached representation, so
> that the union executes much faster.
>
> However, there is some unknown slowdown, it's not quite as fast as
> what you would expect.
>
> On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
> > Ah, I see. So basically what you need is something like cache write
> through
> > support which exists in Shark but not implemented in Spark SQL yet. In
> > Shark, when inserting data into a table that has already been cached, the
> > newly inserted data will be automatically cached and “union”-ed with the
> > existing table content. SPARK-1671 was created to track this feature.
> We’ll
> > work on that.
> >
> > Currently, as a workaround, instead of doing union at the RDD level, you
> may
> > try cache the new table, union it with the old table and then query the
> > union-ed table. The drawbacks is higher code complexity and you end up
> with
> > lots of temporary tables. But the performance should be reasonable.
> >
> >
> > On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur <
> archit279tha...@gmail.com>
> > wrote:
> >>
> >> LittleCode snippet:
> >>
> >> line1: cacheTable(existingRDDTableName)
> >> line2: //some operations which will materialize existingRDD dataset.
> >> line3:
> existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
> >> line4: cacheTable(new_existingRDDTableName)
> >> line5: //some operation that will materialize new _existingRDD.
> >>
> >> now, what we expect is in line4 rather than caching both
> >> existingRDDTableName and new_existingRDDTableName, it should cache only
> >> new_existingRDDTableName. but we cannot explicitly uncache
> >> existingRDDTableName because we want the union to use the cached
> >> existingRDDTableName. since being lazy new_existingRDDTableName could be
> >> materialized later and by then we cant lose existingRDDTableName from
> cache.
> >>
> >> What if keep the same name of the new table
> >>
> >> so, cacheTable(existingRDDTableName)
> >> existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
> >> cacheTable(existingRDDTableName) //might not be needed again.
> >>
> >> Will our both cases be satisfied, that it uses existingRDDTableName from
> >> cache for union and dont duplicate the data in the cache but somehow,
> append
> >> to the older cacheTable.
> >>
> >> Thanks and Regards,
> >>
> >>
> >> Archit Thakur.
> >> Sr Software Developer,
> >> Guavus, Inc.
> >>
> >> On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora
> >> <pankajarora.n...@gmail.com> wrote:
> >>>
> >>> I think i should elaborate usecase little more.
> >>>
> >>> So we have UI dashboard whose response time is quite fast as all the
> data
> >>> is
> >>> cached. Users query data based on time range and also there is always
> new
> >>> data coming into the system at predefined frequency lets say 1 hour.
> >>>
> >>> As you said i can uncache tables it will basically drop all data from
> >>> memory.
> >>> I cannot afford losing my cache even for short interval. As all queries
> >>> from
> >>> UI will get slow till the time cache loads again. UI response time
> needs
> >>> to
> >>> be predictable and shoudl be fast enough so that user does not get
> >>> irritated.
> >>>
> >>> Also i cannot keep two copies of data(till newrdd materialize) into
> >>> memory
> >>> as it will surpass total available memory in system.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Use Case of mutable RDD - any ideas around will help.

Reply via email to