Thanks Chandeep.
Andy Grove, the author earlier on pointed to that article in an earlier thread :) Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr V8Pw http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. From: Chandeep Singh [mailto:c...@chandeep.com] Sent: 16 February 2016 18:17 To: Mich Talebzadeh <m...@peridale.co.uk> Cc: Ashok Kumar <ashok34...@yahoo.com>; User <user@spark.apache.org> Subject: Re: Use case for RDD and Data Frame Here is another interesting post. http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm _content=buffer31ce5 <http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?ut m_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign= buffer> &utm_medium=social&utm_source=twitter.com&utm_campaign=buffer On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk> > wrote: Hi, A Resilient Distributed Dataset (RDD) is a heap of data distributed among all nodes of cluster. It is basically raw data and that is all about it with little optimization on it. Remember data is not much of a value until it is turned into information. On the other hand a DataFrame is equivalent to a table in RDBMS akin to a table in Oracle or Sybase. In other words a two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. So, a DataFrame by definition has additional metadata due to its tabular format, which allows Spark Optimizer AKA Catalyst to take advantage of this tabular format for certain optimizations. So still after so many years, the relational model is arguably the most elegant model known and used and emulated everywhere. Much like a table in RDBMS, a DataFrame keeps track of the schema and supports various relational operations that lead to more optimized execution. Essentially each DataFrame object represents a logical plan but because of their "lazy" nature no execution occurs until the user calls a specific "output operation". This is very important to remember. You can go from a DataFrame to an RDD via its rdd method. You can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method. In general it is recommended to use a DataFrame where possible due to the built in query optimization. For those familiar with SQL a DataFrame can be conveniently registered as a temporary table and SQL operations can be performed on it. Case in point I am looking for all my replication server log files compressed and stored in an HDFS directory for error on a specific connection //create an RDD val rdd = sc.textFile("/test/REP_LOG.gz") //convert it to Data Frame val df = rdd.toDF("line") //register the line as a temporary table df.registerTempTable("t") println("\n Search for ERROR plus another word in table t\n") sql("select * from t WHERE line like '%ERROR%' and line like '%hiveserver2.asehadoop%'").collect().foreach(println) Alternatively you can just use method calls on the DataFrame itself to filter out the word df.filter(col("line").like("%ERROR%")).collect.foreach(println) HTH, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr V8Pw http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. From: Ashok Kumar [ <mailto:ashok34...@yahoo.com.INVALID> mailto:ashok34...@yahoo.com.INVALID] Sent: 16 February 2016 16:06 To: User < <mailto:user@spark.apache.org> user@spark.apache.org> Subject: Use case for RDD and Data Frame Gurus, What are the main differences between a Resilient Distributed Data (RDD) and Data Frame (DF) Where one can use RDD without transforming it to DF? Regards and obliged