Hi,

 

A Resilient  Distributed Dataset (RDD) is a heap of data distributed among all 
nodes of cluster. It is basically raw data and that is all about it with little 
optimization on it. Remember data is not much of a value until it is turned 
into information.

 

On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a 
table in Oracle or Sybase. In other words a two-dimensional array-like 
structure, in which each column contains measurements on one variable, and each 
row contains one case.

 

So, a DataFrame by definition has additional metadata due to its tabular 
format, which allows Spark Optimizer AKA Catalyst  to take advantage of this 
tabular format for certain optimizations. So still after so many years, the 
relational model is arguably the most elegant model known and used and emulated 
everywhere. 

 

Much like a table in RDBMS, a DataFrame keeps track of the schema and supports 
various relational operations that lead to more optimized execution. 
Essentially each DataFrame object represents a logical plan but because of 
their "lazy" nature no execution occurs until the user calls a specific "output 
operation". This is very important to remember. You can go from a DataFrame to 
an RDD via its rdd method. You can go from an RDD to a DataFrame (if the RDD is 
in a tabular format) via the toDF method.

 

In general it is recommended to use a DataFrame where possible due to the built 
in query optimization.

 

For those familiar with SQL a DataFrame can be conveniently registered as a 
temporary table and SQL operations can be performed on it.

 

Case in point I am looking for all my replication server log files compressed 
and stored in an HDFS directory for error on a specific connection

 

//create an RDD

val rdd = sc.textFile("/test/REP_LOG.gz")

//convert it to Data Frame

val df = rdd.toDF("line")

//register the line as a temporary table

df.registerTempTable("t")

println("\n Search for ERROR plus another word in table t\n")

sql("select * from t WHERE line like '%ERROR%' and line like 
'%hiveserver2.asehadoop%'").collect().foreach(println)

 

Alternatively you can just use method calls on the DataFrame itself to filter 
out the word

 

df.filter(col("line").like("%ERROR%")).collect.foreach(println)

 

HTH,

 

Dr Mich Talebzadeh

 

LinkedIn   
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] 
Sent: 16 February 2016 16:06
To: User <user@spark.apache.org>
Subject: Use case for RDD and Data Frame

 

Gurus,

 

What are the main differences between a Resilient Distributed Data (RDD) and 
Data Frame (DF)

 

Where one can use RDD without transforming it to DF?

 

Regards and obliged

Reply via email to