RE: Use case for RDD and Data Frame

Mich Talebzadeh Tue, 16 Feb 2016 10:25:12 -0800

Thanks Chandeep.


Andy Grove, the author earlier on pointed to that article in an earlier
thread :)

 

 

 

Dr Mich Talebzadeh

 

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 

From: Chandeep Singh [mailto:c...@chandeep.com] 
Sent: 16 February 2016 18:17
To: Mich Talebzadeh <m...@peridale.co.uk>
Cc: Ashok Kumar <ashok34...@yahoo.com>; User <user@spark.apache.org>
Subject: Re: Use case for RDD and Data Frame

 

Here is another interesting post.

 

http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm
_content=buffer31ce5
<http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?ut
m_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=
buffer> &utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

 

On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <m...@peridale.co.uk
<mailto:m...@peridale.co.uk> > wrote:

 

Hi,

 

A Resilient  Distributed Dataset (RDD) is a heap of data distributed among
all nodes of cluster. It is basically raw data and that is all about it with
little optimization on it. Remember data is not much of a value until it is
turned into information.

 

On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a
table in Oracle or Sybase. In other words a two-dimensional array-like
structure, in which each column contains measurements on one variable, and
each row contains one case.

 

So, a DataFrame by definition has additional metadata due to its tabular
format, which allows Spark Optimizer AKA Catalyst  to take advantage of this
tabular format for certain optimizations. So still after so many years, the
relational model is arguably the most elegant model known and used and
emulated everywhere. 

 

Much like a table in RDBMS, a DataFrame keeps track of the schema and
supports various relational operations that lead to more optimized
execution. Essentially each DataFrame object represents a logical plan but
because of their "lazy" nature no execution occurs until the user calls a
specific "output operation". This is very important to remember. You can go
from a DataFrame to an RDD via its rdd method. You can go from an RDD to a
DataFrame (if the RDD is in a tabular format) via the toDF method.

 

In general it is recommended to use a DataFrame where possible due to the
built in query optimization.

 

For those familiar with SQL a DataFrame can be conveniently registered as a
temporary table and SQL operations can be performed on it.

 

Case in point I am looking for all my replication server log files
compressed and stored in an HDFS directory for error on a specific
connection

 

//create an RDD

val rdd = sc.textFile("/test/REP_LOG.gz")

//convert it to Data Frame

val df = rdd.toDF("line")

//register the line as a temporary table

df.registerTempTable("t")

println("\n Search for ERROR plus another word in table t\n")

sql("select * from t WHERE line like '%ERROR%' and line like
'%hiveserver2.asehadoop%'").collect().foreach(println)

 

Alternatively you can just use method calls on the DataFrame itself to
filter out the word

 

df.filter(col("line").like("%ERROR%")).collect.foreach(println)

 

HTH,

 

Dr Mich Talebzadeh

 

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 

From: Ashok Kumar [ <mailto:ashok34...@yahoo.com.INVALID>
mailto:ashok34...@yahoo.com.INVALID] 
Sent: 16 February 2016 16:06
To: User < <mailto:user@spark.apache.org> user@spark.apache.org>
Subject: Use case for RDD and Data Frame

 

Gurus,

 

What are the main differences between a Resilient Distributed Data (RDD) and
Data Frame (DF)

 

Where one can use RDD without transforming it to DF?

 

Regards and obliged

RE: Use case for RDD and Data Frame

Reply via email to