Analyzing and reusing cached Datasets

Jacek Laskowski Sat, 19 Nov 2016 12:21:56 -0800

Hi,

There might be a bug in how analyzing Datasets or looking up cached
Datasets works. I'm on master.


➜  spark git:(master) ✗ ./bin/spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_112
Branch master
Compiled by user jacek on 2016-11-19T08:39:43Z
Revision 2a40de408b5eb47edba92f9fe92a42ed1e78bf98
Url https://github.com/apache/spark.git
Type --help for more information.

After reviewing CacheManager and how caching works for Datasets I
thought the following query would use the cached Dataset but it does
not.

// Cache Dataset -- it is lazy
scala> val df = spark.range(1).cache
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

// Trigger caching
scala> df.show
+---+
| id|
+---+
|  0|
+---+

// Visit http://localhost:4040/storage to see the Dataset cached. And it is.

// Use the cached Dataset in another query
// Notice InMemoryRelation in use for cached queries
// It works as expected.
scala> df.withColumn("newId", 'id).explain(extended = true)
== Parsed Logical Plan ==
'Project [*, 'id AS newId#16]
+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, newId: bigint
Project [id#0L, id#0L AS newId#16L]
+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#0L, id#0L AS newId#16L]
+- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk, memory,
deserialized, 1 replicas)
      +- *Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*Project [id#0L, id#0L AS newId#16L]
+- InMemoryTableScan [id#0L]
      +- InMemoryRelation [id#0L], true, 10000, StorageLevel(disk,
memory, deserialized, 1 replicas)
            +- *Range (0, 1, step=1, splits=Some(8))

I hoped that the following query would use the cached one but it does
not. Should it? I thought that QueryExecution.withCachedData [1] would
do the trick.

[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L70

// The following snippet uses spark.range(1) which is the same as the
one cached above
// Why does the physical plan not use InMemoryTableScan and InMemoryRelation?
scala> spark.range(1).withColumn("new", 'id).explain(extended = true)
== Parsed Logical Plan ==
'Project [*, 'id AS new#29]
+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, new: bigint
Project [id#26L, id#26L AS new#29L]
+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#26L, id#26L AS new#29L]
+- Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*Project [id#26L, id#26L AS new#29L]
+- *Range (0, 1, step=1, splits=Some(8))

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Analyzing and reusing cached Datasets

Reply via email to