Re: Use case for RDD and Data Frame

Chandeep Singh Tue, 16 Feb 2016 10:17:53 -0800

Here is another interesting post.

http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
 
<http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer>


> On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Hi,
>  
> A Resilient  Distributed Dataset (RDD) is a heap of data distributed among 
> all nodes of cluster. It is basically raw data and that is all about it with 
> little optimization on it. Remember data is not much of a value until it is 
> turned into information.
>  
> On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a 
> table in Oracle or Sybase. In other words a two-dimensional array-like 
> structure, in which each column contains measurements on one variable, and 
> each row contains one case.
>  
> So, a DataFrame by definition has additional metadata due to its tabular 
> format, which allows Spark Optimizer AKA Catalyst  to take advantage of this 
> tabular format for certain optimizations. So still after so many years, the 
> relational model is arguably the most elegant model known and used and 
> emulated everywhere. 
>  
> Much like a table in RDBMS, a DataFrame keeps track of the schema and 
> supports various relational operations that lead to more optimized execution. 
> Essentially each DataFrame object represents a logical plan but because of 
> their "lazy" nature no execution occurs until the user calls a specific 
> "output operation". This is very important to remember. You can go from a 
> DataFrame to an RDD via its rdd method. You can go from an RDD to a DataFrame 
> (if the RDD is in a tabular format) via the toDF method.
>  
> In general it is recommended to use a DataFrame where possible due to the 
> built in query optimization.
>  
> For those familiar with SQL a DataFrame can be conveniently registered as a 
> temporary table and SQL operations can be performed on it.
>  
> Case in point I am looking for all my replication server log files compressed 
> and stored in an HDFS directory for error on a specific connection
>  
> //create an RDD
> val rdd = sc.textFile("/test/REP_LOG.gz")
> //convert it to Data Frame
> val df = rdd.toDF("line")
> //register the line as a temporary table
> df.registerTempTable("t")
> println("\n Search for ERROR plus another word in table t\n")
> sql("select * from t WHERE line like '%ERROR%' and line like 
> '%hiveserver2.asehadoop%'").collect().foreach(println)
>  
> Alternatively you can just use method calls on the DataFrame itself to filter 
> out the word
>  
> df.filter(col("line").like("%ERROR%")).collect.foreach(println)
>  
> HTH,
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID 
> <mailto:ashok34...@yahoo.com.INVALID>] 
> Sent: 16 February 2016 16:06
> To: User <user@spark.apache.org <mailto:user@spark.apache.org>>
> Subject: Use case for RDD and Data Frame
>  
> Gurus,
>  
> What are the main differences between a Resilient Distributed Data (RDD) and 
> Data Frame (DF)
>  
> Where one can use RDD without transforming it to DF?
>  
> Regards and obliged

Re: Use case for RDD and Data Frame

Reply via email to